Assembler
The project only contains small programs to help me learn assembler.
gas Contains examples using the GNU Assembler.
nasm Contains examples using the Netwide Assembler.
c C programs used for viewing the generated assembler code.
Registers
rax caller saved. rbx caller saved.
rdi callee saved. Used to pass 1st argument to functions rsi caller saved. Used to pass 2nd argument to functions rdx caller saved. Used to pass 3rd argument to functions rcx caller saved. Used to pass 4th argument to functions r8 caller saved. Used to pass 5th argument to functions r9 caller saved. Used to pass 6th argument to functions
rbp caller saved. The stack base pointer rsp caller saved. The stack pointer
r10 caller saved r11 caller saved r12 callee saved r13 callee saved r14 callee saved r15 callee saved
Caller saved
These registers might be changed when making function calls and it is the callers responsibility to save them.
Callee saved
These registers are preserved/saved accross function calls.
The Stack
The stack consists of memory locations reserved at the end of the memory area allocated to the program. The ESP register is used to point to the top of the stack in memory. When PUSH is used it places data on the bottom of this memory area and decreases the ESP (stack pointer).
When POP is used it moves data to a register or a memory location and increases the ESP.
When a c-style function call is made it places the required arguments on the stack and the call instruction places the return address onto the stack aswell.
param2 8(%esp)
param1 4(%esp)
return address <- (%esp)
So ESP points to the top of the stack where the return address is. If we used the POP instruction to get the parameters as the return address might be lost in the process. This can be avoided using indirect addressing, as in using 4(%esp) to access the parameters and avoid ESP to be incremented. But what if the function itself needs to push data onto the stack, this would also change the value of ESP and it would throw off the indirect addressing. Instead what is common practice is to store the current value of ESP (which is pointing to the return address) in EBP. Then use indirect addressing with EBP which will no change if the PUSH/POP instructions are used. The calling function might also be using the EBP for the same reason so we first PUSH that value onto the stack, decreasing the ESP. So the value of EBP is first pushed onto the stack and then we store the current ESP value in EBP to enable indirect addressing.
param2 12(%esp)
param1 8(%esp)
return address <- 4(%esp)
esp ->old EBP <- (%esp)
_main:
pushl �p
mov %esp, �p
....
movl �p, %esp
popl �p
ret
Resetting the ESP register value ensures that any data placed on the stack within the function but not cleaned off will be discarded when execution returns to the main program (otherwise, the RET instruction could return to the wrong memory location).
Now, since we are using EBP we can place additional data on the stack without affecting how input parameters values are accessed. We can used EBP with indirect addressing to create local variables:
param2 12(%esp)
param1 8(%esp)
return address 4(%esp)
esp ->old EBP (%esp)
local var1 -4(%esp)
local var2 -8(%esp)
local var3 -12(%esp)
But what would happen if the function now uses the PUSH instruction to push data onto the stack?
Well, it would overrwrite one or more local variables since ESP was not affected by the usage of EBP.
We need some way of reserving space for these local variables so that ESP points to -12(%esp) in our
case.
_main:
pushl �p
mov %esp, �p
subl $12, %esp ; reserv 8 bytes to local variables.
Also, when the function returns the parameters are still on the stack which might not be expected but the calling function. What you should do it reset the stack to the state before the call, when there were now parameters on the stack. You can do this by adding 4,8,12 (what ever the size and number of parameters are).
Inspecting the stack
When you start a program in lldb
you can take a look at the stack pointer memory location using:
$ lldb ./out/cli 10 20
(lldb) breakpoint set --file cli.s --line 9
(lldb) run
(lldb) register read rsp
rsp = 0x00007fff5fbfeb98
(lldb) memory read --size 4 --format x 0x00007fff5fbfeb98
0x7fff5fbfeb98: 0x850125ad 0x00007fff 0x850125ad 0x00007fff 0x7fff5fbfeba8: 0x00000000 0x00000000 0x00000002 0x00000000
What I'm trying to figure out is where argc
might be. We can see that 0x7fff5fbfeba8
has 2
which matches our two parameters (the program name and the argument).
What I was missing was that when using a C runtime argc is passed in rdi and not on the stack. I was looking for the value on the stack which.
Compare while(flat) to while(flag == true)
(while flag == true) :
while`main:
0x100000f70 < 0>: pushq %rbp
0x100000f71 < 1>: movq %rsp, %rbp
0x100000f74 < 4>: movl $0x0, -0x4(%rbp) ## padding?
0x100000f7b < 11>: movb $0x1, -0x5(%rbp) ## flag = true
0x100000f7f < 15>: movl $0x5, -0xc(%rbp)
0x100000f86 < 22>: cmpl $0x5, -0xc(%rbp)
0x100000f8a < 26>: jne 0x100000f94 ; < 36> at while.cc:10
0x100000f90 < 32>: movb $0x0, -0x5(%rbp)
0x100000f94 < 36>: movl -0xc(%rbp), �x ## move a into eax
0x100000f97 < 39>: addl $0x1, �x ## increment a
0x100000f9a < 42>: movl �x, -0xc(%rbp) ## move incremented value back into a
0x100000f9d < 45>: movb -0x5(%rbp), %al ## move flat into al
0x100000fa0 < 48>: andb $0x1, %al ## AND 1 and flag
0x100000fa2 < 50>: movzbl %al, �x ## conditionally move al into ecx if zero
0x100000fa5 < 53>: cmpl $0x1, �x ## flat == true
0x100000fa8 < 56>: je 0x100000f86 ; < 22> at while.cc:7
0x100000fae < 62>: xorl �x, �x
0x100000fb0 < 64>: popq %rbp
0x100000fb1 < 65>: retq
Compared to using while(flag):
while`main:
0x100000f70 < 0>: pushq %rbp
0x100000f71 < 1>: movq %rsp, %rbp
0x100000f74 < 4>: movl $0x0, -0x4(%rbp) ## padding?
0x100000f7b < 11>: movb $0x1, -0x5(%rbp) ## flag = true
0x100000f7f < 15>: movl $0x5, -0xc(%rbp) ## a = 5
0x100000f86 < 22>: cmpl $0x5, -0xc(%rbp) ## a == 5
0x100000f8a < 26>: jne 0x100000f94 ; < 36> at while.cc:10
0x100000f90 < 32>: movb $0x0, -0x5(%rbp) ## flag = false
0x100000f94 < 36>: movl -0xc(%rbp), �x ## move a into eax
0x100000f97 < 39>: addl $0x1, �x ## increment a
0x100000f9a < 42>: movl �x, -0xc(%rbp) ## move incremented value back into a
0x100000f9d < 45>: testb $0x1, -0x5(%rbp) ## AND 1 and flag
0x100000fa1 < 49>: jne 0x100000f86 ; < 22> at while.cc:7 ## branch if not equal
0x100000fa7 < 55>: xorl �x, �x
0x100000fa9 < 57>: popq %rbp
0x100000faa < 58>: retq
Inspecting images
To list the current executable and its dependant images:
$ target modules list
or
$ image list
You can dump the object file using:
(lldb) target modules dump objfile /Users/danielbevenius/work/assembler/gas/out/cli
You can show the sections using:
(lldb) image dump sections
Linking and Loading
Using chmod x
any file can be set to be an executable, but this only tells the kernel to
read the file into memory and to look for a header to determine the executable format. This header
is often referred to as magic
which is a know digit identifying a certain type of executable format.
Magic's:
\x7FELF Executable and Library Format. Native in Linux and UNIX though not supported by OS X
#! Script. The kernel looks for the string following #! and executes it as a command passing
the rest of the file to the process through stdin
0xcafebabe Multi-arch binaries for OS X only
0xfeedface OS X native binary format 32 bit
0xfeedfacf OS X native binary format 64 bit
Mach-Object Binaries
Mach-Object (Mach-O) is a legacy of its NeXTSTEP origins. The header can be found in /usr/include/mach-o/loader.h
struct mach_header {
uint32_t magic; /* mach magic number identifier */
cpu_type_t cputype; /* cpu specifier */
cpu_subtype_t cpusubtype; /* machine specifier */
uint32_t filetype; /* type of file */
uint32_t ncmds; /* number of load commands */
uint32_t sizeofcmds; /* the size of all the load commands */
uint32_t flags; /* flags */
};
struct mach_header_64 {
uint32_t magic; /* mach magic number identifier */
cpu_type_t cputype; /* cpu specifier */
cpu_subtype_t cpusubtype; /* machine specifier */
uint32_t filetype; /* type of file */
uint32_t ncmds; /* number of load commands */
uint32_t sizeofcmds; /* the size of all the load commands */
uint32_t flags; /* flags */
uint32_t reserved; /* reserved */
};
The two are in fact mostly identical besides the reserved
field which is unused in mach_header_64.
You can find the filetypes in the same header:
#define MH_OBJECT 0x1 /* relocatable object file */
#define MH_EXECUTE 0x2 /* demand paged executable file */
#define MH_FVMLIB 0x3 /* fixed VM shared library file */
#define MH_CORE 0x4 /* core file */
#define MH_PRELOAD 0x5 /* preloaded executable file */
#define MH_DYLIB 0x6 /* dynamically bound shared library */
#define MH_DYLINKER 0x7 /* dynamic link editor */
#define MH_BUNDLE 0x8 /* dynamically bound bundle file */
#define MH_DYLIB_STUB 0x9 /* shared library stub for static */
/* linking only, no section contents */
#define MH_DSYM 0xa /* companion file with only debug */
/* sections */
I think MH simply stands for Mach Header.
You can inspect the header of a file using:
$ otool -hV out/loop
Mach header
magic cputype cpusubtype caps filetype ncmds sizeofcmds flags
MH_MAGIC_64 X86_64 ALL LIB64 EXECUTE 15 1200 NOUNDEFS DYLDLINK TWOLEVEL PIE
Load commands:
$ otool -l out/loop
The kernel is responsible for allocating virtual memory (LC_SEGMENT_64), creating the main thread, and code signing and encryption.
Load command 1
cmd LC_SEGMENT_64
cmdsize 392
segname __TEXT
vmaddr 0x0000000100000000
vmsize 0x0000000000001000
fileoff 0
filesize 4096
maxprot 0x00000007
initprot 0x00000005
nsects 4
flags 0x0
So this will load filesize 4096 from fileoff 0.
Sections: __text main prog code __stubs, __stub_helper subs used in dynamic linking
LC_MAIN
Replaces LC_UNIXTHREAD from Montain Lion onward and is responsible for starting the binaries
main thread. For example, using out/loop
once again:
Load command 11
cmd LC_MAIN
cmdsize 24
entryoff 3929
stacksize 0
For dynamically linked executables the loading of libraries and the resolving of symbols is done in user mode by the LC_LOAD_DYLINKER command.
OS X uses .dylib wheras Linux uses .so for dynamic libraries. DYLD uses segments and in them sections. The dynamic linker is started by the kernel following an LC_DYLINKER load command:
$ otool -l out/loop ... Load command 7 cmd LC_LOAD_DYLINKER cmdsize 32 name /usr/lib/dyld (offset 12)
The dynamic linker is started by the kernel by following the LC_LOAD_DYLINKER load command. The default being dyld (dynamik link editor) and this is a user mode process. http://www.opensource.apple.com/source/dyld.
$ otool -tV out/loop
out/loop:
(__TEXT,__text) section
_main:
0000000100000f59 subq $0x8, %rsp
0000000100000f5d movabsq $0x0, %r12
0000000100000f67 leaq values(%rip), %r13
0000000100000f6e movq (%r13,%r12,4), %rsi
0000000100000f73 leaq val(%rip), %rdi
0000000100000f7a callq 0x100000f96 ## symbol stub for: _printf
0000000100000f7f incq %r12
0000000100000f82 cmpq $0x5, %r12
0000000100000f86 jne 0x100000f6e
0000000100000f88 movl $0x2000001, �x
0000000100000f8d movq $0x0, %rdi
0000000100000f94 syscall
Now, notice the callq
operation which is our call to _printf
. The comment says that this is a symbol stub, so what are these?
This is an external undefined symbol and the code is generated with a call to the symbol stub section.
$ dyldinfo -lazy_bind out/loop
lazy binding information (from lazy_bind part of dyld info):
segment section address index dylib symbol
__DATA __la_symbol_ptr 0x100001010 0x0000 libSystem _printf
So lets take a look at the sections again and look at the __stubs section:
$ otool -l out/loop
Section
sectname __stubs
segname __TEXT
addr 0x0000000100000f96
size 0x0000000000000006
offset 3990
align 2^1 (2)
reloff 0
nreloc 0
flags 0x80000408
reserved1 0 (index into indirect symbol table)
reserved2 6 (size of stubs)
And recall that the call to the stub looked like this: 0000000100000f7a callq 0x100000f96 ## symbol stub for: _printf
We can see that addr
matched the address of the callq
operation.
$ lldb out/loop
(lldb) breakpoint set --name main
Now, we want to follow the code when we callq (the first time that is) dyld_stub_binder is called the first time and does the symbol binding
-> 0x100000f96 < 0>: jmpq *0x74(%rip) ; (void *)0x0000000100000fac
0x100000f9c: leaq 0x65(%rip), %r11 ; (void *)0x0000000000000000
0x100000fa3: pushq %r11
0x100000fa5: jmpq *0x55(%rip) ; (void *)0x00007fff8eca4148: dyld_stub_binder
So we will be in libdyld.dylib`dyld_stub_binder
There is a cache for dynamic libraries that can be found in: /private/var/db/dyld/
Print the symbols of an object file:
$ nm -m out/loop
0000000100000000 (__TEXT,__text) [referenced dynamically] external __mh_execute_header
0000000100000f59 (__TEXT,__text) external _main
(undefined) external _printf (from libSystem)
(undefined) external dyld_stub_binder (from libSystem)
0000000100001018 (__DATA,__data) non-external val
0000000100001025 (__DATA,__data) non-external values
Make the linker trace SEGMENTS:
$ export DYLD_PRINT_SEGMENTS=1
For more environment variables see man ldld
.
Signals
/usr/include/sys/signal.h
Show info about a raw address
(lldb) image lookup --address 0x100000f78
Address: overflow[0x0000000100000f78] (overflow.__TEXT.__stubs 12)
Summary: overflow`symbol stub for: printf
Break point using address
(lldb) breakpoint set --addresu 0x100000f47
Displaying the stack
The equivalent of x/20x $rsp
would be:
(lldb) memory read --count 20 --size 4 --format x $rsp
printf
Print with zero padding instead of blank
$ printf "0x" 3
0000000003$
The first zero after the procent sign is the padding which can either be 0 or if left out blank padding will be added. 10 is the number of the padding and x is for signed hexadecimal.
Instruction Pointer Relative addressing (RIP)
RIP addressing is a mode where an address references are provided as a 32-bit displacements from the current instruction pointer. One of the advantages os RIP is that is makes it easier to generate PIC, which is code that is not dependent upon where the code is loaded. This is important for shared objects as they don't know where they will be loaded. In x64 references to code and data are done using instruction pointer relative (RIP) addressing modes.
Position Independant Code (PIC)
When the linker creates a shared library it does not know where in the process's address space it might be loaded. This causes a problem for code and data references which need to point to the correct memory locations.
My view of this is that when the linker takes multiple object files and merges the sections, like .text, .data etc, merge might not be a good description but rather adds them sequentially to the resulting object file. If the source files refer to absolut locations in it's .data section these might not be in the same place after linking ito the resulting object file. Solving this problem can be done using position independant code (PIC) or load-time relocation.
There is an offset between the text and data sections. The linker combines all the text and data sections from all the object files and therefore knows the sizes of these sections. So the linker can rewrite the instructions using offsets and the sizes of the sections.
But x86 requires absolute addressing does it not?
If we need a relative address (relative to the current instruction pointer which there is no operation for) a way to get this is to use the CALL some_label
like this:
call some_label
some_label:
pop eax
call
causes the address of the next instruction to be saved on the stack and then it will jump to some_label. pop eax
pops the address into eax which is now the value of the instruction pointer.
PIC are implemented using Global Offset Table (GOT) which is a table of addresses in the .data section. When an instruction referres to a variable it does not use an absolute address (would require relocation) but instead referrs to an entry in the GOT which is located at a well known place in the data section. The entry in the GOT referrs to an absolut address. So this is a sort of relocation but in the data section instead of in the code section which is what was done for load-time relocation. But doing this in the data section, which is not shared and is writable does not cause any issues. Also relocations in the code section have to be done per variable reference and not per variable as is the case when using a GOT.
So that covers variables but for function calls a Procedure Linkage Table (PLT) is used. This is part of the text section. Instead of calling a function directly a call is made to an entry in the PLT which performs the actual call. This is sometimes called trampoline
which I've seen on occasions when inspecting/dumping in lldb but did not know what it meant. This allows for lazy resolution of functions calls.Also every PLT entry as an entry in the GOT.
Only position independent code is supposed to be included into shared objects (SO) as they should have an ability to dynamically change their location in RAM.
Load-time relocation
This process might take some during loading which might be an performance hit depending on the type of program being written. Since the text section needs to be modified during loading (needs to do the actual relocations) it is not possible to have it shared by multiple processes.
Instruction Pointer Relative addressing (RIP)
References to code and data in x64 are done with instruction relative pointer addressing. So instructions can use references that are relative to the current instruction (or the next one) and don't require them to be absolute addresses. This works for offsets of up to 32bits but for programs that are larger than that this offset will not be enough. One could use absolute 64 bit addresses for everything but more instructions are required to perform simple operations and most programs will not require this. The solution is to introduce code models to cater for all needs. The compiler should be able to take an option where the programmer can say that this object file will not be lined into a large program. And also that this compilation unit will be included in a huge library and that 64-bit addressing should be used.
In (64-bit mode), the encoding for the old 32-bit immediate offset addressing mode, is now a 32-bit offset from the current RIP, not from 0x00000000 like before. You only need to know how far away it is from the currently executing instruction (technically the next instruction)