Assembly is a low-level programming language. You already know that low-level programming languages are close to machines and very hard to understand by humans. We have already written some programs with languages like C, C++, Python, etc. In compiling C programs article we talked about what happens when we compile a computer program. The source code is translated into a set of binary instructions. Assembly is just a representation of those binary instructions But why do we need Assembly language?
What is the Assembly language?
Think that there is a CPU instruction as 10111000. This instruction may do tasks like moving data from one place to another or popping off the stack, etc. Let's say its hexadecimal value is b8. We can use both of the above to represent the CPU instruction. When we compare binary and hexadecimal code you can see hexadecimal value is easy to remember and use.
Even more, we can map some simple words to each CPU instruction. Let's think about moving some data. Imagine that the hexadecimal code for moving some data is b8. We can assign the word MOV for that instruction. So whenever we want to use b8 we can use the word MOV instead of that hexadecimal value. That's more human-friendly than the binary and hexadecimal representations. (Actually above instruction is equal to moving some data to the eax register. We'll talk more about this later.)
Let's see some examples of these opcodes and Assembly instructions.
Why do we need to learn Assembly?
What is the usage of learning Assembly? If you are going to learn reverse engineering you must have a great understanding of the assembly language. In RE we don't have access to the source code of a program. But we can use a disassembler and get Assembly instructions from the binary. So if you know assembly you can imagine what it does. Then you can get an idea about the high-level code and its structure.
Also, assembly is very helpful when we write shellcode. A shell code is a set of CPU instructions used to get a payload on a system. Since we run the shellcode directly on the CPU without any compiling or linking it's purely written in opcodes. But it's so hard to write it in opcodes. So what we do is write the shellcode on assembly and convert it to opcodes.
Sometimes we need to write programs directly in assembly. For some microcomputers such as real-time monitoring devices, microcomputers, etc. A great advantage of programs written in assembly is their high performance and speed. Because we write those programs for a specific device. Also, we write assembly programs with the hardware architecture on our minds.
Structure of an Assembly program
So I think you got a clear idea about what's assembly and for what we use it. Now we can start our awesome journey of Assembly language. First of all, let's see the architecture of an assembly program.
.intel_syntax noprefix section .data section .bss .section .text .global _start _start: mov eax, 0x1 mov ebx, 0x5 int 0x80
At the top of the program, there is a code line as .intel_syntax noprefix This line indicates the Assembly syntax we use. Here we have used the Intel syntax. Many times I used the Intel syntax for Assembly.
In the following, we can write the same program in AT&T assembly syntax.
.section .data .section .text .globl _start _start: movl $1, %eax movl $5, %ebx int $0x80
The following are some of the differences.
- At&t syntax puts a "%" sign before a register name while Intel syntax does not. For example, When referring to eax at&t uses %eax.
- For instructions that use two operands, At&t puts source first and destination second. But Intel puts destination first and source second. For example to copy something from the esp register to the ebp register at&t use mov %esp, %ebp. But Intel's syntax is mov ebp, esp.
- In at&t syntax, we use the structure -0x4(%ebp) to indirectly access a memory location, But in Intel syntax DWORD PTR [ebp - 0x4] is the command for the same reference. You may learn more about this in the tutorial move data with Assembly.
I have seen most of the textbooks use At&t syntax. Not only in books but most debugging tools such as GDB, objdump, etc also use at&t as their default syntax. Actually, I don't know the reason for that. However, I personally like to use the Intel syntax. The reason is its clear formatting.
You can clearly see some different points in the above two syntaxes. At&T uses movl for mov . Next, it put a percentage symbol in front of register names such as %eax, %ebx, etc. Also, operand locations are different in Intel syntax and At&T assembly syntax. For example in Intel syntax we use the instruction mov eax, 0x1 to move 1 into eax. So we put destination first and source location second. But in AT&T we use movl $1, %eax. Here we use the source location first and the destination second
Personally, I prefer using Intel syntax for Assembly because it looks like clean code.
Next in the program, we can see some sections. First, there is a section called data. We use that section to store the data we use in the program. These data are variables, constants, strings, etc. Since the above program is a very little and simple one it doesn't use any data in the data section. We can see how to use that section in later articles.
Next, there is a section as text. This is where we put our program instructions. In this section, we start to write our assembly instructions like mov eax, etc.
What are CPU registers?
In a normal Windows/Linux environment you have heard about moving data or files. What we do is copy data from a source location to a destination. Assembly MOV instruction is very similar to that. But actually, assembly MOV instruction is equal to copy/paste. Because when we move a file data will be removed from the source.
Allocating Storage for Initialized Data
The data which have a value when declaring is known as initialized data and those are located in the .data section.
DB - Allocates a Byte
DW - Allocates 2 Bytes (This size is known as a Word)
DD - Allocates 4 Bytes (This is equal to the twice of a word and known as a Doubleword)
DQ - Allocates 8 Bytes (Four words are equal to a Doubleword)
DT - Allocates 10 Bytes
Allocating Storage for Uninitialized Data
The data which doesn't have a value when declaring is known as uninitialized data and those are located in the .bss section.
RESB - Reserve a Byte
RESW - Reserve a Word
RESD - Reserve a Doubleword
RESQ - Reserve a Quadword
REST - Reserve Ten Bytes
Moving data in assembly
Even in a simple assembly program, we use the mov command many times. The source and destination can be a register, memory address, or any other place. Sometimes we copy a value from a stack to a register. At another time we copy the value of each to ebp. For all of such tasks, we can use mov instruction.
Let's see the syntax of the mov instruction. Since we use Intel's assembly syntax following is the instruction format.
mov [destination] [source]
(If you prefer using at&t assembly syntax you want to put source first and destination second.)
Ok, let's see some examples of moving data from one place to another place.
mov eax 0x1
This Assembly instruction will copy the value 0x1 (0x1 is the hexadecimal representation of one) into the eax register.
The above instruction directly copied a value to a place. So we cal it direct mode or immediate mode.
mov eax ebx
Can you think about what the above code does? .it will copy whatever value found at ebx to the eax register.
Here we can see a practice example of the above Assembly instruction. I used the GDB debugger to demonstrate it.
(gdb) i r ebp esp ebp 0xbffffd38 0xbffffd38 esp 0xbffffcb8 0xbffffcb8
First I examined esp and ebp using the info register command in GDB to see what holds those registers. The esp register holds the value 0xbffffcb8 and ebp holds the value 0xbffffd38.
If you're not familiar with GDB please refer to our old tutorials on debugging binaries with GDB.
Next, I used x/i $eip to see the Assembly instruction to be executed next. You can see that instruction is mov ebp, esp. Well. You know what it is going to do. After that, I used the ni command (ni stands for next instruction) to execute that instruction on the CPU.
(gdb) x/i $eip 0x80483f5 <main+1>: mov ebp,esp (gdb) ni
Now as we know the CPU will copy whatever is found on esp to ebp register. Let's see if is it true.
(gdb) i r ebp esp ebp 0xbffffcb8 0xbffffcb8 esp 0xbffffcb8 0xbffffcb8
I used the info register command again to see what's on esp and ebp. You can see the esp register holds the value 0xbffffcb8. That's fine. It is the old value of esp. So the value of esp is not changed.
What about ebp?It also holds the value 0xbffffcb8. So we can clearly see that the value of esp copied to ebp. Great. Now we saw it practically.
In the following image, you can see a layout of a 32-bit register.
Ok. Now you know how we can copy data from one register to another register. Let's give your focus on the following code line.
mov eax [ebx]
If you pay attention closely you may see that this code is not equal to the previous one. Hear the source location (ebx) is covered with brackets. What does it mean? Here we are not copying the value of ebx into eax. Instead, the ebx register acts as a pointer. In c programming, you may have heard about pointers.
Here there is a memory address in the ebx register. We get that address and copy whatever is found at that memory address into the eax register.
The following instruction is very similar to the above one.
mov [ebx] eax
It takes the value from eax. Then take the value from ebx and treat it as a memory address. Then go to that address and copy the value found at the eax register.
It's also possible to copy a direct value to a pointer location like the following.
This will copy the value 0x1 into the location pointed by the value of esp. Here you can see it practically.
(gdb) i r esp esp 0xbffffc50 0xbffffc50 (gdb) x/x 0xbffffc50 0xbffffc50: 0x080485e0
First I examined the value of esp. It is 0xbffffc50. Now we treat this as a memory address and check what is in that location. We can examine that by entering the command x/x 0xbffffc50. You can see there is a value of 0x080485e0.
Now we examine eip to check the next instruction and use the ni command to execute it on the CPU.
(gdb) x/i $eip 0x80484bc <main+40>: mov [esp],0x1 (gdb) ni
We saw the next instruction is mov [esp],0x1. No, as the theory 0x1 should be copied to the location that was pointed by esp. Let's examine that location again to check it.
(gdb) i r esp esp 0xbffffc50 0xbffffc50 (gdb) x/x 0xbffffc50 0xbffffc50: 0x00000001
Yes. All are going as expected. I think now you understood the concept of pointed locations.
Now. what about the following example
(gdb) i r esp eax esp 0xbffffc50 0xbffffc50 eax 0xbffffc6c -1073742740 (gdb) x/x 0xbffffc50 0xbffffc50: 0x00000000 (gdb) x/i $eip 0x804844d <main+21>: mov [esp],eax (gdb) ni (gdb) x/x 0xbffffc50 0xbffffc50: 0xbffffc6c
I don't go to explain it deeply. If you understood the previous one you may realize what's going on here. It'll take the value of eax and copy it to the location pointed by the value of the esp register
Can you understand what's following Assembly instructions? What does it do?
mov eax, [esp+0x5c]
The concept is the same as mov eax, [esp]. But this time we add a hexadecimal value to the value of the esp register. What does it mean? Read the following code and try to understand
(gdb) i r esp eax esp 0xbffffc50 0xbffffc50 eax 0x30 48 (gdb) x/x 0xbffffc50 + 0x5c 0xbffffcac: 0x66666666 (gdb) x/i $eip 0x8048471 <main+57>: mov eax, [esp+0x5c] (gdb) ni (gdb) i r eax eax 0x66666666 1717986918
When we execute mov eax, [esp+0x5c], the following happens.
First, we get the value of esp. It is 0xbffffc50
Next, we add 0x5c to it. The answer is 0xbffffcac. You can use a calculator to do it.
After we treat this answer as a memory address and go to that location. Next, we copy whatever is found at that location and copy it to the eax register.
Let's take another example.
mov [eax] [ebx]
What happens to hear?
First, we get the value of the EBX register. We treat it as a memory address. Let's call this address A.
Next, we take the value from eax and treat that as a memory address. Let's call this address B.
Now we copy whatever is found at address A into address A. Got it?
Finally, it's possible to do something like the one below.
mov al 0x1
This is a cool trick we often use in shell coding. Here al is not actually a register. It is a section of the register. To understand this refers to the following image.
Since we are talking about the 32-bit architecture a CPU register is 32 bits in length. The least significant two bytes of eax are called as ax register. That is 16 bits long. That ax part can be divided into two parts as al and hl. You can learn more about this in our CPU registers tutorial.
So guys that's all for this document. I hope you learned something new. thanks for reading.
Understand the system calls
Programs get support from the operating system to do some special things. Actually what happens hear is the program calls Kernel to handle these requests. Windows systems have their own kernel while Linux distributions have their own Linux kernel. When a program wants to print a string to screen it'll load arguments (string) and call specific system calls. When the kernel receives the call it'll do what the program wanted.
Yes, there are many systems that call for doing various things. Like exit syscall, print sys-call, etc. But how do we identify a system call? There is a unique number for every system call. When we want to use a syscall we call it by its own number.
1) Load EAX with the system call number
Every system call has its unique syscall number. Before we call the kernel to handle the system call, we should specify the system call number. So the kernel knows which system call it needs to execute. We should store this Syscall number on the RAX register.
if you are using a Linux machine, the list of system calls and their numbers can find out in the header file located at /usr/include/asm/unistd.h.
2) Load arguments into registers
A system call is like a function. (Just imaging it as a function and it is not actually a function) So when we call it, we can supply arguments. For example, if we call the exit system call we may provide the status value. so after the exiting of our program, we are indicating whether the program was completed successfully or not. For this purpose, we can use the registers such as RAX, RBX, RCX, RDX, RSI, RDI, etc. If there are more we can use the stack.
3) Call the kernel
This is the final step is executing the system call. We use the int 0x80 instruction to stop the execution of our assembly program and hand over our system call request to the kernel.
Example 1: Simple exit program
global _start section .data section .bss section .text mov eax, 0x1 mov ebx, 0x5 int 0x80
Think about the above code. It'll call exit system call which causes the program to exit. There are three steps we did in the above sys-call.
First, we copied 0x1 (in hexadecimal) to the eax register. That is the sys-call number. We know a unique number of exit sys-call is one. So that is the way we tell the kernel which sys-call we wanted to execute. We should fill eax with its sys-call number.
As the second step, we copied 0x5 into ebx. What did we expect from it? It is a status value. When a program is quiet in Linux there is a special value called status value. It indicated whether a program exited successfully or not. If a program exits with success it will return zero. Yes, you are correct, we found such a situation in functions in C programming. So here we returned five. (0x5 in hexadecimal) . That is optional we can return any number. But one another thing. After the program is completed and exited we can get that return value by entering echo $? in our Linux terminal.
Example 2: Add two numbers
section .data section .bss .section .text mov eax, 0x1 mov ecx, 0x2 add eax, ecx mov ebx, eax mov eax, 0x1 int 0x80
The first thing to note is there are two sections called .section .data and .section .text. The variables and other data are saved in the .data section while actual program instructions are in the .text section. In this program, there is nothing in the .data sections. Because this is a very little program, we don't want it.
The first assembly instruction is mov eax, 0x1. The mov instruction is used to move some value from one place to another place. Those places may be a register, a memory location, or even a place on the stack. Actually, this is not equal to moving a file in Windows or Linux. When we move a file in an OS, we cut the source and paste it into the destination. But here we just copy the value.
Here you can see mov instruction gets two arguments. Remember that we are coding with Intel syntax. So the first one is the destination while the second one is the source. This command will copy the hexadecimal value 0x1 into eax. The second instruction does the same as the above one. It'll copy 0x2 into ecx.
The next instruction is add eax, ecx. Here we can see another command called add. This will add two arguments and save the result in the first argument. So after this command, the sum of eax and ecx will be saved in eax.
Now our task is completed. we were successfully able to add two numbers. Finally, we want to exit our program and output the result of the calculation. We have to use the exit system call for this. For that, we need to save the status value in EBX and fill eax with the system call number. The sys call number for exit is 1. so here we do that. The result of the calculation is currently saved in eax . so I copied eax into ebx. Finally, I used int 0x80 so it'll call kernel to handle the system call.
Assemble and run a program
Now let's see how we can build a binary with an assembly source code. The tool which is used for this process is an assembler and the process is called assembling. There are various assemblers such as NASM, AS, etc. In this example, I'm using the NASM Assembler.
In the bellow commands I have demonstrated how to se the NASM assembler to assemble the source code.
First of all lets see what are the avilable files on our current working directory. We can use the ls command for this This is urely for understand which files are created after the process.
[email protected]:~/nasm/exit$ ls file.asm
Now lets use the nasm. We have to give some arguments to the tool. -f indicates the architecture we are going to build the binary file. Here I'm on a intel based CPU with 64 bit word size. So I'm using the elf64 as the architecture.
[email protected]:~/nasm/exit$ nasm -f elf64 file.asm -o file.x
I have specified the output file by the flag -o. Here extensio is not nessory. It can be anything or even you can continue without a file extension.
Now lets run the ls command againg to see what are the files created by abve assembler.
[email protected]:~/nasm/exit$ ls file.asm file.x
We can see there s a file called file.x created in the same working directory. That is our output file. Now we can use an linker to combine the files and build the final output. Actually in ths example there is a single file. The tool we are going to use s ld. As above commadn we can specify the output file. I've specified it as just file.
ld file.x -o file
Now we can again use the ls command to see what is generated. Great. The tput file s listed as file. As you know executable files in linux or mac dont need any special extenon such as exe or com in windows.
[email protected]:~/nasm/exit$ ls file file.asm file.x
Now I run it. What we have to do is use ./ and the program name. Finally I have used the command echo $? to se the exit status code of the program after it running. W can seee the value 5 whichis we gave in our source code.
[email protected]:~/nasm/exit$ ./file [email protected]:~/nasm/exit$ echo $? 5
Here you can see what happens when we run it. Since this program does not print anything, we can use echo $? command to see the output status value.