Polyglot Assembly

Writing assembly code that runs on multiple architectures.

A few years back, while I worked as a Solaris engineer at Oracle, I posted an april fools joke to one of our internal mailing lists. Do you remember Solaris? Pepperidge farm remembers. Unfortunatelly Solaris 12 was never released (not as such, anyway) and many of the devs were laid off shortly afterwards, but that's a different story... The april fools e-mail read as follows:

Hi all,
   can I please get a codereview for my new Hello World program?
The purpose of the program is to output a greeting.
As the code is very short, I'm inlining it at the end of this email.
The code has been tested on sparc and x86.
Only 64 bit arch is supported - please compile with cc -m64.
cstyle reports no issues.

Thanks!
Vojtech

$ cat hello.c

const float main[] = {
         -5.0076923e-29, -6.02841476e-21, 1.75860328e-33, -4.3672462e-34,
         -2.03652633e-33, 3.00046579e-32, -6.99961556e-33, -4.36343733e-34,
         -253599.734, 1.87611745e-33, -4.36724253e-34, -2.03652633e-33,
         2.62426763e-32, -4.36343733e-34, -253599.859, -1.05886372e-37,
         -2.84236475e-29, -4.2805483e-28, -7.27646377e-27, -3.28364893e-28,
         -7.3422524e-38, -8.52233404e-38, -7.19531561e-38, -2.84236445e-29,
         -6.02842122e-21, 2.3509887e-38, -7.3422524e-38, -8.52233404e-38,
         -6.02842122e-21, 2.3509887e-38, -7.3422524e-38, -8.52233404e-38,
         1.69276854e-42, 1.58918456e-40, -7.11823707e-31, 3.30286048e-42,
         1.26058568e-39, 6.72382769e-36, 5.90304592e+22, 2.02799612e-19,
         1.17234334e+27, 9.30333261e-40, 1.7976867e-38, 0.0
};

I expect main() not being a function and instead being an array of numbers won't be news to many people, this has been done before. (If you've never seen this before, have a look at that blogpost, it's pretty good!)

The focus of this post instead is this line:

The code has been tested on sparc and x86.

Any code that a Solaris dev looked to push to the core repository had to be rigorously tested on both architectures targeted by Solaris: x86 and SPARC. And this code doesn't really look like it could run on multiple architectures, right? ... But it does. And it does that with a trick I call a polyglot assembly code. Maybe that's a little too pompous a name for a silly joke, but I didn't know what else to call it...

Linux, x86-64 & ARM

I recently decided to revive the idea and, since both Solaris and SPARC are fairly obscure nowadays, to port the code over to Linux on x86-64 and ARM.

Here's the new code:

#include <stdint.h>

const uint64_t _start[] __attribute__((section(".text"))) = {
    0xe3a00001909032eb, 0xe3a0200ce28f1014,
    0xef000000e3a07004, 0xe3a07001e3a00000,
    0x6c6c6548ef000000, 0x0a214d5241202c6f,
    0x18ec834800000000, 0x4800000045058b48,
    0xa20fc03148240489, 0x0c24548908245c89,
    0x142444c710244c89, 0x01c0c74800000a21,
    0x0001c7c748000000, 0x480124748d480000,
    0x050f00000015c2c7, 0x480000003cc0c748,
    0x6c654820050fff31, 0x00000000202c6f6c,
};

It's still just a Hello, World! program.

To build it, I recommend gcc -static -nostdlib. I found out the special section attribute is necessary for gcc to put the code in the right section.

You can also get pre built binaries here.

Please keep in mind that the resulting binaries are still platform-specific. The polyglot trick's purpose is to make it possible to build for x86-64 and ARM from the same source file. While the executable code inside the ELF file is the same bit by bit for both platforms, the ELF header and section layout is somewhat different and unfortunatelly there seems to be no way around this (at minimum, the kernel checks the ELF architecture byte and won't load the program unless it matches).

So anyway, how does it work?

diagram

The basic idea is pretty simple: The code starts with a magical snippet that gets interpreted by both architectures and basically performs a conditional jump based on which architecture is runing the code.

Architecture #1 decodes this bit as a an effective no-op instruction (in my example it's actually an ALU instruction whose output is ignored) and simply continues to the area marked blue in the diagram. Arch-#1-specific code is stored there.

Architecture #2 decodes the initial part as a jump and proceeds to jump over to the yellow area where the code for arch #2 follows.

How do you come up with the magical architecture-selecting prolog?

It turns out x86 is actually a pretty good choice for arch #2, because it has a short near-jump instruction, just 2 bytes. Together with 2 NOPs (1 byte each) this makes up the length of one regular (non-Thumb) ARM instruction.

The three x86 instructions, a JMP and two NOPs, can be arbitrarily reordered and also the jump distance can be adjusted, and as I found out this is enough degrees of freedom to find a valid nop-like ARM instruction.

Searching the instruction space

The original Solaris code was a result of basically just hokus-pokus. In the Linux rewrite I tried to come up with a somewhat more automated way of finding the right combo. There's a hackish python script and a Makefile rule that generate a big plaintext list of all NOP/NOP/2-byte JMP permutations along with their ARM representations.

From the list it's apparent that the JMP, NOP, NOP ordering yields a safe ARM addsls instruction most of the time, at least as long as the jump distance is small enough for the addsls to not touch the sp or pc registers...

In my case I only need to JMP 52 bytes ahead. This results in addsls r3, r0, r11, ror #5 on ARM, which is a perfectly harmless instruction. Maybe a bit too complex (a conditional add with a bit rotation), but harmless.

In summary, the magical bytes that make an x86 CPU jump ahead and an ARM CPU continue are:

eb 32 90 90

The code

The source code for this mini-project can be found on my github.

The C file quoted above effectively contains the following two code paths for x86-64 and ARM, respectively:

x86:
    # create a hello message on the stack
    sub     $24, %rsp
    movq    msg(%rip), %rax
    mov     %rax, (%rsp)
    xor     %rax, %rax
    cpuid
    mov     %ebx, 8(%rsp)
    mov     %edx, 12(%rsp)
    mov     %ecx, 16(%rsp)
    movl    $0x0a21, 20(%rsp)

    # write syscall
    mov     $1, %rax
    mov     $1, %rdi
    lea     1(%rsp), %rsi
    mov     $21, %rdx
    syscall

    # exit syscall
    mov     $60, %rax
    xor     %rdi, %rdi
    syscall

msg:
    .ascii  " Hello, "
    .int 0x0

arm:
    # write syscall
    mov     %r0, $1
    adr     %r1, msg
    mov     %r2, $len
    mov     %r7, $4
    swi     $0

    # exit syscall
    mov     %r0, $0
    mov     %r7, $1
    swi     $0

msg:
    .ascii  "Hello, ARM!\n"
    .equ len, . - msg
    .int 0x0

I decided to automate the whole process of generating the C file and the binaries for x86-64 and ARM. There's a big Makefile that goes through the arduous process of first generating ARM code from the source ARM asm file, dumping the compiled ARM, feeding that to a Python script with a template of the x86 + compiled ARM code, compiling that, generating the C source from that, and finally compiling the C source for both x86 and ARM.

Running the Makefile requires having an arm cross-compiling gcc installed along with a matching gdb.

Practical applications

There are none :-)

At least as far as I know. As stated earlier, the resulting binary is still platform-specific, Which means the trick can't be used to make multi-platform ELFs :-|

In theory it could be used to create a multi-platform shellcode, but I find it unlikely that it would actually be useful in practice.

Multi-platform kernels maybe? But then again boot procedures differ quite a lot between architectures, so again it's probably not very practical...

So yeah, it's really just a joke done for fun and to perhaps learn something new...