Jul 02, 2020

Reverse engineering tutorial for beginners

So you want to learn Reverse engineering. That's great. RE is used in various topics such as malware analysis, exploit development, software cracking, etc. In this document, we are going to take a look at a reverse engineering example. First, we write a simple program in C, next disassemble it and try to understand things at the assembly level.

What is reverse engineering & why we use it?

Before we continue into the reversing part, Let's clear some basics ideas of this topic. Reverse engineering is the process of disassembling a binary and understanding the structure of that program.You can refer to the "Compiling C programs" article to see what happens when we compiling a program. If we take it shortly following is the procedure.

First, we write the code in a language like C, C++, etc. Let's assume we write a code to print something on a screen. We can use functions like printf(), putchar() etc. The C programming language tells us how we can use those functions and which data we should supply.

Next, we use a compiler to build a binary from the source code. A binary is a collection of mashing instructions. There are various mashing instructions like MOV, SUB, ADD, etc. Each of these instruction do a specific task. For example, we use MOV instruction to move data from one place to another place. So how we identify these mashing instructions? There is a unique number (Or a code) called opcode for every instruction.

Let's take the instruction INT 0x80. This instruction is commonly used to give control to the kernel. The opcode for this instruction is cd 80. Actually cd represents the INT instruction and 80 is the argument(Or operand). Think about the MOV ECX, ESP. It'll copy data from ESP to ECX register. The opcode for this instruction is 89 el. Hear 89 represent copy data to ECX. el means we are copying data from ESP. We'll talk more on opcodes in our shell coding tutorials. Till then just take a rough idea.

But how the compiler generates these mashing instructions? (Keep in mind a compiler is also a program writing in some language). The compiler knows how to build assembly instructions for a task. For example, if a high-level program adds two numbers, the compiler builds a set of assembly instructions to do the same task. It'll copy two numbers into two registers and add them.

In the following image, you can see a memory layout.

There are rows of bytes. Both CPU instructions and data are saved in the memory. In the left upper you can see some saved CPU instructions. One byte is equal to 8 bits. In the bottom right you can see there is a set of data as 00, 6f, 6c, etc. As Intel systems save data in little-endian notation we can see the string is saved in reverse order.

Now the compilation process is done. What is reverse engineering? When we think about a compiled binary, it only contains mashing instructions as opcodes. So we can't get the source code from it. But a disassembler can extract those opcodes from the binary. After disassembler gets assembly instructions related to those opcodes. It is so hard to understand a program by looking at opcodes. But Assembly instructions are a little bit clear and close to humans. Words like MOV, ADD are readable than opcodes like 5f, 4c etc. So the disassembler generates a set of assembly instructions. We can read them and imagine what the source code does. I hope you got a basic idea about reverse engineering.

So what we can do with reverse engineering? In the malware analysis industry anti-virus guys use reverse engineering to understand the behavior of malware. They don't have access to the source code of malware. So they disassemble binary and look at what it does. In sometimes they can find vulnerabilities and week points of malware. Then they write a patch and release it. This is an interesting topic to discuss more. Let's talk more about malware in future articles.

In exploit development, we reverse a program and find vulnerabilities. If we fid a one we can write an exploit to get the advantage of that vulnerability.

A simple program in C

First we write a simple C program. I think you can read the code and determine what it does.


#include 
#include 
#include 

int main(int argc, char const *argv[])
{
	if (argc != 2)
	{
		printf("Please input a number
");
		exit(1);
	}

	if (atoi(argv[1]) == 0)
	{
		printf("Input number is zero
");
	}else{
		printf("Input number is non-zero
");
	}

	return 0;
}
}

Here you can see I used two if statements. First I checked if argc is 2 or not. If argc is not equal to two we know that the user has not passed a number as a command-line argument. If there is no argument provided we show an error message and exit the program. If the user has given a number as an argument we continue the code.

After that checking, we convert the input string to an integer using atoi() function. You may know that it stands for ASCII to integer [This is why we included stdlib.h header file]. Ok after that we check if the user inputted number is equal to zero or not. Next, we use printf() function to display the result. A very simple code. Now we compile it on a Linux mashing with GCC. In this example, we are using a Linux distribution. But the theory is the same on every OS. After we learn this we can simply understand assembly instructions on other platforms too.

[email protected]:~/programming/c$ gcc if.c -o if -mpreferred-stack-boundary=2

I used an additional argument for GCC called -mpreferred-stack-boundary=2. It'll reduce some optimizations by the compiler. (Some stack padding alignments etc).

Let's run it and see what happens.


[email protected]:~/programming/c$ ./if
Please input a number
[email protected]:~/programming/c$ ./if 2
Input number is non-zero
[email protected]:~/programming/c$ ./if 0
Input number is zero

It works differently when we supply different inputs.

Disassembling the binary

Now we can use GDB to examine the inner working of the program.


[email protected]:~/programming/c$ gdb -q ./if

Here is the disassembly of the main function.

Dump of assembler code for function main:
0x08048424 <main+0>: push ebp
0x08048425 <main+1>: mov ebp,esp
0x08048427 <main+3>: sub esp,0x4
0x0804842a <main+6>: cmp DWORD PTR [ebp+0x8],0x2
0x0804842e <main+10>: je 0x8048448 <main+36>
0x08048430 <main+12>: mov DWORD PTR [esp],0x8048540
0x08048437 <main+19>: call 0x8048350 <[email protected]>
0x0804843c <main+24>: mov DWORD PTR [esp],0x1
0x08048443 <main+31>: call 0x8048360 <[email protected]>
0x08048448 <main+36>: mov eax,DWORD PTR [ebp+0xc]
0x0804844b <main+39>: add eax,0x4
0x0804844e <main+42>: mov eax,DWORD PTR [eax]
0x08048450 <main+44>: mov DWORD PTR [esp],eax
0x08048453 <main+47>: call 0x8048340 <[email protected]>
0x08048458 <main+52>: test eax,eax
0x0804845a <main+54>: jne 0x804846a <main+70>
0x0804845c <main+56>: mov DWORD PTR [esp],0x8048556
0x08048463 <main+63>: call 0x8048350 <[email protected]>
0x08048468 <main+68>: jmp 0x8048476 <main+82>
0x0804846a <main+70>: mov DWORD PTR [esp],0x804856b
0x08048471 <main+77>: call 0x8048350 <[email protected]>
0x08048476 <main+82>: mov eax,0x0
0x0804847b <main+87>: leave
0x0804847c <main+88>: ret
End of assembler dump.

If you compile and disassemble the binary in different mashing you may not see the same disassembly as above. That is because compilers optimize the assembly code. But the main parts and logic are always the same.

push ebp, mov ebp, esp and sub esp,0x4 instructions are added by the compiler and those are the set of function prologue instructions. I don't o to explain them in deeply because I posted separate tutorials for function prologue, function epilogue, etc.

You can see a sub esp,0x4 instruction above. What does it do? In our main function, there is a local variable called int i. So above assembly command makes a space in stack for that local variable.

Understanding the logic of program

Let's focus on the following a couple of assembly instructions.

0x0804842a <main+6>: cmp DWORD PTR [ebp+0x8],0x2
0x0804842e <main+10>: je 0x8048448 <main+36>

First of all, let's clear-out what is DWORD PTR [ebp+0x8]. You know the main function is expecting two arguments called argc and argv. In assembly-level, we can access them with ebp as an offset. So ebp+0x8 is argc and ebp+0xc is argv.

Next, we use cmp command with DWORD PTR [ebp+0x8] and 0x2 as arguments. the cmp instruction compares two values and saves the result in the EFLAGS register. As you know the EFLAGS register is 4 bytes(32 bits) long and has 32 flags. (Each bit is a flag) Each of these flag can be set or cleared. So if the above two arguments of cmp instruction are equal a unique flag in the EFLAGS register(ZF flag) will be set. That means there is a flag to set if two arguments are equal, also there is another flag to set if they are not equal. If you want to learn more about the EFLAGS register read this document.

Now what je 0x8048448 instruction does? je stands for "Jump if equal". This is totally depended on the previous comparison. That means it will jump to the given address if the above two arguments are equal. But how je instructions know the result of previous instruction? It looks in the EFLAGS register and checks if the corresponding flag is set or not. So if the condition is met the execution jumps to the given memory address (So next instruction will be in 0x8048448). If the condition is not met it will execute in the normal flow (next instruction is in 0x08048430).

So what happening here is the following. If we don't supply arguments, cmp instruction tells argc is not equal to 2. so it doesn't set ZF flag in EFLAGS register. After that je instruction looks in ZF flag and when it determines the result of the above cmp instruction it decides that condition is not met. So it doesn't jump to the given address. So below four instructions will be executed.

0x08048430 <main+12>: mov DWORD PTR [esp],0x8048540
0x08048437 <main+19>: call 0x8048350 <[email protected]>
0x0804843c <main+24>: mov DWORD PTR [esp],0x1
0x08048443 <main+31>: call 0x8048360 <[email protected]>

What they do is simply exit the program with an error message. We can find the string of error messages by examining the memory address 0x8048540.


(gdb) x/s 0x8048540
0x8048540: "Please input a number"

We push this memory address to top of the stack and call puts function. but why? Puts function needs one argument (A pointer to a string). After that, we put 0x1 in eax(This is the status value) and call exit function.

What if we supply a number as an argument to program?. Since cmp instruction set ZF flag in EFLAGS register je instruction will redirect execution to 0x8048448. So following set of instructions will be executed.

0x0804844b <main+39>: add eax,0x4
0x0804844e <main+42>: mov eax,DWORD PTR [eax]
0x08048450 <main+44>: mov DWORD PTR [esp],eax
0x08048453 <main+47>: call 0x8048340 <[email protected]>
0x08048458 <main+52>: test eax,eax
0x0804845a <main+54>: jne 0x804846a <main+70>
0x0804845c <main+56>: mov DWORD PTR [esp],0x8048556
0x08048463 <main+63>: call 0x8048350 <[email protected]>
0x08048468 <main+68>: jmp 0x8048476 <main+82>
0x0804846a <main+70>: mov DWORD PTR [esp],0x804856b
0x08048471 <main+77>: call 0x8048350 <[email protected]>
0x08048476 <main+82>: mov eax,0x0
0x0804847b <main+87>: leave
0x0804847c <main+88>: ret

So at the moment our first if the statement is over. It decided on the flow of the program.

Now let's focus on next if command.

The following set of assembly instructions convert our input number to an integer Do you remember we learned in our C programming tutorial that argv holds arguments in string form. So we used atoi (ASCII to Integer) function to convert it to an integer.

0x0804844b <main+39>: add eax,0x4
0x0804844e <main+42>: mov eax,DWORD PTR [eax]
0x08048450 <main+44>: mov DWORD PTR [esp],eax
0x08048453 <main+47>: call 0x8048340 <[email protected]>

Now eax holds our input in integer form. As the next step, we can check whether eax is zero or not. But in the assembly level how we do it? Let's see.

0x08048458 <main+52>: test eax,eax
0x0804845a <main+54>: jne 0x804846a <main+70>

What do the above two instructions do? the test is another assembly instruction that takes two arguments. test eax,eax instruction will set zf flag (zf flag will be 1) if eax is zero. If eax is not zero test instruction clears zf flag (its value will be zero).

Now the next instruction is jne. what does it do? jne stands for Jump if not equal. You may remember that je instruction jumps to the given address if zf flag is set(zf flag's value 1). jne is the opposite of je. So jne will jump to a given location if zf flag is not set.

Let's assume our input number is zero. Now eax holds zero. So test eax,eax will set zf flag. So jne check-in zf flag and it it values is 1. So it doesn't jump to the given address and continues the normal flow. What happens next is execute the following instructions. At the moment you can read and understand what they do.

0x0804845c <main+56>: mov DWORD PTR [esp],0x8048556
0x08048463 <main+63>: call 0x8048350 <[email protected]>
0x08048468 <main+68>: jmp 0x8048476 <main+82>

Let's examine what in 0x8048556.


(gdb) x/s 0x8048556
0x8048556: "Input number is zero"

Yes. It prints out the string we hopped and jumps to 0x8048476. What's in 0x8048476?.

0x08048476 <main+82>: mov eax,0x0
0x0804847b <main+87>: leave
0x0804847c <main+88>: ret

Hear program exists normally.

Now, what if our input is not zero?. The value of eax is not zero. So test eax,eax don't set zf flag. So jne instruction jumps to the given memory address. So following set of instructions will be executed.

0x0804846a <main+70>: mov DWORD PTR [esp],0x804856b
0x08048471 <main+77>: call 0x8048350 <[email protected]>
0x08048476 <main+82>: mov eax,0x0
0x0804847b <main+87>: leave
0x0804847c <main+88>: ret

That path also prints a string. Let's examine it too.


(gdb) x/s 0x804856b
0x804856b: "Input number is non-zero"

Yes. everything happens as expected. After printing it program exists normally.

Ok, guys. Now I think you understand many things in this document. I'll write more interesting stuff on these topics. Thanks for reading.

Mar 21
Assembly system calls

When we talk about computer programs, they do various tasks. Like printing a string to screen,....

Jun 09
Functions in python programming

Functions are life savers. Yes they make our life easier. A function is a peace of code and used to....

Apr 16
Wordpress nulled theme checker

We all love free stuff. So many people try to install premium themes and plugins on there WordPress....

Replying to 's comment Cancel reply