The point of a disassembler is to take an input series of bytes and output an architecture-specific interpretation of those bytes. For example, a typical disassembler targeting the x86 architecture will take the following bytes: 55 8B EC B8 FF 00 00 00 33 DB 93, and produce a readable representation of those bytes similar to below:
55 push ebp 8B EC mov ebp, esp B8 FF 00 00 00 mov eax, 0FFh 33 DB xor ebx,ebx 93 xchg eax,ebx
The process involves looking at the opcode(s), getting the instruction length, parsing out extra information in the instruction such as displacements, relative/absolute destinations, register/memory affected, etc. — basically a large amount of lookups and parsing. Fortunately, there are libraries for this. The disassembly engine used in this example will be BeaEngine due to its simplicity. Capstone Engine is also a great engine that supports many architectures, a clean and thread-safe API, and a permissive license among other things. After all of this is implemented, the actual challenge of parsing executable files comes into play. This issue will be the topic of this post.
There are two common ways of disassembling a file: linearly and recursively. In the case of linear disassembly, the disassembler begins reading the first instruction at an address in the binary and continues reading until some termination condition, a termination condition being a set amount of instructions decoded, the end of a block, or an error condition such as an unknown opcode. The code for linear disassembly is straightforward and is shown below. The termination condition in the example code will stop printing when a RET instruction is hit.
DISASM disasm = { 0 }; disasm.EIP = (UIntPtr)pStartingAddress; int iLength = UNKNOWN_OPCODE; do { iLength = DisasmFnc(&disasm); fprintf(stdout, "0x%X -- %s\n", disasm.EIP, disasm.CompleteInstr); disasm.EIP += iLength; } while (!IsRet(disasm.Instruction) && iLength != UNKNOWN_OPCODE); |
The “algorithm” is (very) easy to write, and with knowledge into the format of the file being disassembled proves to be pretty reliable. For example, the Portable Executable (PE) format on Windows provides information on all executable sections and their sizes on disk and in memory with alignment. The ELF format on Linux provides the same relevant information. Using this information, a disassembler knows the exact range to disassemble to produce reliable output. The major drawback with this technique is that there is no reliable way to separate useless code from executing code. Any unused code/data inserted intentionally (or not) into the target area to disassemble will be listed. Looking at this in an assembly dump usually sticks out because the instructions will be nonsensical relative to surrounding code. Also any use of instruction interleaving, i.e. a jump into the middle of an instruction — usually for obfuscation purposes — will be missed by the disassembler.
The second type of way to disassemble a file is to do it recursively, that is to say that the disassembler will (try to) follow the control path of the actual program. The involves analyzing the destinations of any control flow instructions: calls, jumps, and returns. For every CALL instruction encountered, the address of the next instruction must be pushed on a stack, and the disassembly continues on at the CALL address. This continues on, recursively if need be for multiple CALLs, until a RET instruction is hit. Once a RET instruction is hit, the top of the call stack is popped off and disassembly continues on from that point. This is pretty much exactly how execution happens in a program. Also, for every unconditional jump instruction, the disassembly merely continues at the target destination. The sample code is a bit more complex, but not by much
DISASM disasm = { 0 }; disasm.EIP = (UIntPtr)pStartingAddress; int iLength = UNKNOWN_OPCODE; do { iLength = DisasmFnc(&disasm); fprintf(stdout, "0x%X -- %s\n", disasm.EIP, disasm.CompleteInstr); if (IsCall(disasm.Instruction)) { m_retStack.push(disasm.EIP + iLength); disasm.EIP = ResolveAddress(disasm); } else if (IsJump(disasm.Instruction)) { disasm.EIP = ResolveAddress(disasm); } else if (IsRet(disasm.Instruction)) { if (!m_retStack.empty()) { disasm.EIP = m_retStack.top(); m_retStack.pop(); } else { break; } } else { disasm.EIP += iLength; } } while (iLength != UNKNOWN_OPCODE); |
This technique has its own benefits and drawbacks. The major benefit is that (theoretically) only exectuable code will be disassembled. This means that only relevant and executing code will be shown to the user. Also, the approximate or exact number of instructions to disassemble does not need to be known like in the linear technique. With recursive disassembly, you provide starting set(s) of instructions and then begin tracing control flow into those. Obfuscation techniques such as instruction interleaving will also be discovered. This technique does have a major drawback, however. CALLs or JMPs made indirectly cannot be deciphered. For example, the destinations of instructions such as JMP [ESI+0x4], CALL EBX, CALL [0xAABBCCDD] where 0xAABBCCDD contains an import fixed up at runtime, and so on, cannot be followed with the disassembler. This means that there are a lot of edge cases to consider when encountering instructions such as these in terms of knowing where to go next and making sure that the call stack is consistent.
The sample code provides a trivial implementation of both of these techniques. To see how it performs, there are also two functions provided. TestFunction1 demonstrates how a recursive disassembler follows control flow. Compare the two outputs:
Linear
0x1146670 -- call dword ptr [0114B008h] 0x1146676 -- ret
Recursive
0x1146670 -- call dword ptr [0114B008h] 0x754218E0 -- mov eax, dword ptr fs:[00000018h] 0x754218E6 -- mov eax, dword ptr [eax+24h] 0x754218E9 -- ret 0x1146676 -- ret
The second example, TestFunction2, shows how the recursive disassembler skips over instructions that are not executed.
0x66680 -- push ebp 0x66681 -- mov ebp, esp 0x66683 -- mov eax, 000000FFh 0x66688 -- call 000666AAh 0x6668D -- xor ebx, ebx 0x6668F -- xchg eax, ebx 0x66690 -- jmp 000666B1h 0x66692 -- cmp ecx, AABBCCDDh 0x66698 -- push 00000000h 0x6669A -- push 00000000h 0x6669C -- push 00000000h 0x6669E -- push 00000000h 0x666A0 -- call dword ptr [0006B0A0h] 0x666A6 -- pop ebp 0x666A7 -- mov esp, ebp 0x666A9 -- ret
Overall, each approach has its benefits and drawbacks. With good knowledge of an executable files format, a linear disassembler works perfectly fine for showing a disassembly listing. Typically, disassemblers with a focus on code analysis, i.e. IDA Pro, will use a recursive approach and have a sophisticated analysis engine to complement it.
The Visual Studio 2015 RC project for this example can be found here. The source code is viewable on Github here.
Follow on Twitter for more updates
Good content, but I dont understand why you chose BeaEngine for this tutorial. Capstone API is way more clean & deadly simple, making it better for educated post like this.
A huge problem with BeaEngine is that it misses few hundred X86 instructions, and also decodes many instruction wrongly. So much that some important projects had to abandon it (to switch to Capstone)
Comment by fuji — July 24, 2015 @ 12:54 AM
@fuji
Thank you; a very fair criticism. I used BeaEngine here because I distributed the code with a DLL build of the engine and it was a matter of convenience. BeaEngine has only one function – Disasm – but I’d have to get cs_{open, disasm, free, close} with Capstone. However, you raise good points about Capstone being more accurate in its disassembly and given the very small amount of code in this post, I’ll end up going back at some point soon and rewriting it using Capstone. I’ll add an additional disclaimer giving preference to Capstone due to the accuracy issues that you mentioned as well.
Comment by admin — July 24, 2015 @ 4:24 PM