Assembly programming for beginners
Assembly is a low-level programming language. You already know that low-level programming languages are close to machines and very hard to understand by humans. We have already written some programs with languages like C, C++, Python, etc. In Compiling C programs article we talked about what happens when we compile a computer program. The source code is translated into a set of binary instructions. Assembly is just a representation of those binary instructions But why do we need Assembly language?
What is Assembly programming?
Think that there is a CPU instruction as 10111000. This instruction may do tasks like moving data from one place to another or popping off the stack, etc. Let's say its hexadecimal value is b8. We can use both of the above to represent the CPU instruction. When we compare binary and hexadecimal code you can see hexadecimal value is easy to remember and use.
Even more, we can map some simple words to each CPU instruction. Let's think about moving some data. Imagine that the hexadecimal code for moving some data is b8. We can assign the word MOV for that instruction. So whenever we want to use b8 we can use the word MOV instead of that hexadecimal value. That's more human-friendly than the binary and hexadecimal representations. (Actually above instruction is equal to moving some data to the eax register. We'll talk more about this later.)
Let's see some examples of these opcodes and Assembly instructions.
int 0x80
Why do we need to learn Assembly language?
What is the usage of learning Assembly? When it comes to cyber security and ethical hacking, Assembly is a must to learn. If you are going to learn reverse engineering you must have a great understanding of the assembly language. In reverse engineering, we don't have access to the source code of a program. But we can use a disassembler and get Assembly instructions from the binary. So if you know assembly you can imagine what it does. Then you can get an idea about the high-level code and its structure.
Also, assembly is very helpful when we write a shell code. A shell code is a set of CPU instructions used to get a payload on a system. Since we run the shellcode directly on the CPU without any compiling or linking it's purely written in opcodes. But it's so hard to write it in opcodes. So what we do is write the shellcode on assembly and convert it to opcodes.
Sometimes we need to write programs directly in assembly. For some microcomputers such as real-time monitoring devices, microcomputers, etc. A great advantage of programs written in assembly is their high performance and speed. Because we write those programs for a specific device. Also, we write assembly programs with the hardware architecture on our minds.
Structure of an Assembly program
So I think you got a clear idea about what's assembly and for what we use it. Now we can start our awesome journey of Assembly language. First of all, let's see the architecture of an assembly program.
.global _start
.intel_syntax noprefix
section .data
section .bss
.section .text
_start:
mov eax, 0x1
mov ebx, 0x5
int 0x80
At the top of the program, there is a code line as .intel_syntax noprefix This line indicates the Assembly syntax we use. Here we have used the Intel syntax. Many times I used the Intel syntax for Assembly.
In the following, we can write the same program in AT&T assembly syntax.
.globl _start
.section .data
section .bss
.section .text
_start:
movl $1, %eax
movl $5, %ebx
int $0x80
The following are some of the differences.
At&t syntax puts a "%" sign before a CPU register name while Intel syntax does not. For example, When referring to eax at&t uses %eax.
For instructions that use two operands, At&t puts source first and destination second. But Intel puts destination first and source second. For example to copy something from the esp register to the ebp register at&t use mov %esp, %ebp. But Intel's syntax is mov ebp, esp.
In at&t syntax, we use the structure -0x4(%ebp) to indirectly access a memory location, But in Intel syntax DWORD PTR [ebp - 0x4] is the command for the same reference. You may learn more about this in the tutorial on moving data with Assembly.
I have seen most of the textbooks use At&t syntax. Not only in books but most debugging tools such as GDB, objdump, etc also use at&t as their default syntax. Actually, I don't know the reason for that. However, I personally like to use the Intel syntax. The reason is its clear formatting.
You can clearly see some different points in the above two syntaxes. At&T uses movl for mov . Next, it put a percentage symbol in front of register names such as %eax, %ebx, etc. Also, operand locations are different in Intel syntax and At&T assembly syntax. For example in Intel syntax we use the instruction mov eax, 0x1 to move 1 into eax. So we put destination first and source location second. But in AT&T we use movl $1, %eax. Here we use the source location first and the destination second
Personally, I prefer using Intel syntax for Assembly because it looks like clean code.
Next in the program, we can see some sections. First, there is a section called data. We use that section to store the data we use in the program. These data are variables, constants, strings, etc. Since the above program is a very little and simple one it doesn't use any data in the data section. We can see how to use that section in later articles.
Next, there is a section as text. This is where we put our program instructions. In this section, we start to write our assembly instructions like mov eax, etc.
Allocating Storage for Initialized Data
The data which have a value when declaring is known as initialized data and those are located in the .data section.
DB - Allocates a Byte
DW - Allocates 2 Bytes (This size is known as a Word)
DD - Allocates 4 Bytes (This is equal to the twice of a word and known as a Doubleword)
DQ - Allocates 8 Bytes (Four words are equal to a Doubleword)
DT - Allocates 10 Bytes
Allocating Storage for Uninitialized Data
The data which doesn't have a value when declaring is known as uninitialized data and those are located in the .bss section.
RESB - Reserve a Byte
RESW - Reserve a Word
RESD - Reserve a Doubleword
RESQ - Reserve a Quadword
REST - Reserve Ten Bytes
Understand the system calls
Programs get support from the operating system to do some special things. Actually what happens to here is the program calls Kernel to handle these requests. Windows systems have their own kernel while Linux distributions have their own Linux kernel. When a program wants to print a string to screen it'll load arguments (string) and call specific system calls. When the kernel receives the call it'll do what the program wanted.
Yes, many systems call for doing various things. Like exit syscall, print sys-call, etc. But how do we identify a system call? There is a unique number for every system call. When we want to use a syscall we call it by its own number.
1) Load EAX with the system call number
Every system call has its unique syscall number. Before we call the kernel to handle the system call, we should specify the system call number. So the kernel knows which system call it needs to execute. We should store this Syscall number on the RAX register.
if you are using a Linux machine, the list of system calls and their numbers can find out in the header file located at /usr/include/asm/unistd.h.
2) Load arguments into registers
A system call is like a function. (Just imaging it as a function and it is not actually a function) So when we call it, we can supply arguments. For example, if we call the exit system call we may provide the status value. so after the exiting of our program, we are indicating whether the program was completed successfully or not. For this purpose, we can use the registers such as RAX, RBX, RCX, RDX, RSI, RDI, etc. If there are more we can use the stack.
3) Call the kernel
This is the final step is executing the system call. We use the int 0x80 instruction to stop the execution of our assembly program and hand over our system call request to the kernel.
Example 1: Simple exit program
global _start
section .data
section .bss
section .text
mov eax, 0x1
mov ebx, 0x5
int 0x80
Think about the above code. It'll call exit system call which causes the program to exit. There are three steps we did in the above sys-call.
First, we copied 0x1 (in hexadecimal) to the eax register. That is the sys-call number. We know a unique number of exit sys-call is one. So that is the way we tell the kernel which sys-call we wanted to execute. We should fill eax with its sys-call number.
As the second step, we copied 0x5 into ebx. What did we expect from it? It is a status value. When a program is quiet in Linux there is a special value called status value. It indicated whether a program exited successfully or not. If a program exits with success it will return zero. Yes, you are correct, we found such a situation in functions in C programming. So here we returned five. (0x5 in hexadecimal) . That is optional we can return any number. But one another thing. After the program is completed and exited we can get that return value by entering echo $? in our Linux terminal.
Example 2: Add two numbers
section .data
section .bss
.section .text
mov eax, 0x1
mov ecx, 0x2
add eax, ecx
mov ebx, eax
mov eax, 0x1
int 0x80
The first thing to note is there are two sections called .section .data and .section .text. The variables and other data are saved in the .data section while actual program instructions are in the .text section. In this program, there is nothing in the .data sections. Because this is a very little program, we don't want it.
The first assembly instruction is mov eax, 0x1. The mov instruction is used to move some value from one place to another place. Those places may be a register, a memory location, or even a place on the stack. Actually, this is not equal to moving a file in Windows or Linux. When we move a file in an OS, we cut the source and paste it into the destination. But here we just copy the value.
Here you can see mov instruction gets two arguments. Remember that we are coding with Intel syntax. So the first one is the destination while the second one is the source. This command will copy the hexadecimal value 0x1 into eax. The second instruction does the same as the above one. It'll copy 0x2 into ecx.
The next instruction is add eax, ecx. Here we can see another command called add. This will add two arguments and save the result in the first argument. So after this command, the sum of eax and ecx will be saved in eax.
Now our task is completed. we were successfully able to add two numbers. Finally, we want to exit our program and output the result of the calculation. We have to use the exit system call for this. For that, we need to save the status value in EBX and fill eax with the system call number. The sys call number for exit is 1. so here we do that. The result of the calculation is currently saved in eax . so I copied eax into ebx. Finally, I used int 0x80 so it'll call kernel to handle the system call.
Assemble and run a program
Now let's see how we can build a binary with an assembly source code. The tool which is used for this process is an assembler and the process is called assembling. There are various assemblers such as NASM, AS, etc. In this example, I'm using the NASM Assembler.
In the bellow commands, I have demonstrated how to se the NASM assembler to assemble the source code.
First of all, let's see what are the available files in our current working directory. We can use the ls command for that.
[email protected]:~/nasm/exit$ ls
file.asm
Now let us use the nasm. We have to give some arguments for the tool. -f indicates the architecture we are going to build the binary file. Here I'm on an intel-based CPU with 64-bit word size. So I'm using the elf64 as the architecture.
[email protected]:~/nasm/exit$ nasm -f elf64 file.asm -o file.x
I have specified the output file by the flag -o. Here extension is not necessary. It can be anything or even you can continue without a file extension.
Now let's run the ls command again to see what are files created by the above assembler.
[email protected]:~/nasm/exit$ ls
file.asm file.x
We can see there s a file called file. x created in the same working directory. That is our output file. Now we can use a linker to combine the files and build the final output. Actually, in this example, there is a single file. The tool we are going to use s ld. As above command, we can specify the output file. I've specified it as just a file.
ld file.x -o file
Now we can again use the ls command to see what is generated. Great. The output file s listed as a file. As you know executable files in Linux or Mac don't need any special extension such as exe or com in Windows.
[email protected]:~/nasm/exit$ ls
file file.asm file.x
Now I run it. What we have to do is use ./ and the program name. Finally, I have used the command echo $? to use the exit status code of the program after it runs. W can see the value 5 which is what we gave in our source code.
thilan@ubuntu:~/nasm/exit$ ./file
[email protected]:~/nasm/exit$ echo $?
5
Here you can see what happens when we run it. Since this program does not print anything, we can use echo $? command to see the output status value.
Explorer the world of cyber security. Read some cool articles on System exploitation, Web application hacking, exploit development, malwara analysis, Cryptography etc.