Updated May 10, 2023
Introduction to Reverse Engineering with Python
Python is a High-level language. Though many of you may think it is a programming language, it is not. It is a Scripting language. It comes nowhere near machine code or machine language. Then what is it that makes Python so interesting? If you have any experience in pen-testing or have had a conversation with many web security experts or malware analyzers, they would always suggest Python as their primary language to develop malware or exploits.
While some people may prefer C, C++, or Perl, I would vote for Python. The reason being it is not only useful as a tool for writing a program, but it is also useful to break it.
What is Reverse Engineering?
Reverse Engineering is a very broad concept. One cannot define it with simple syntaxes. Reverse Engineering’s proper concept is to break down a code into simpler parts, understand it, modify and enhance it as per our own purpose, and then reassemble it to fit our needs. To make it a bit simpler, let me give you some extremely common examples.
Let us take an example of an Android Cell Phone. Manufacturers Create a Stock Rom and Sell it to their Consumers. But most of the time, it contains a lot of bloatware and becomes laggy. So, some people on websites like XDA and android central reverse engineer their way into the ROM, enhance it and make it lag-proof. One Practical example would be CyanogenMod ROM.
But this was just an example to make you understand what it is. Reverse Engineering has the same concept but is way too complicated than just modifying a ROM.
Compilation and Python
If you have experience in Python, then you know that when writing a Python script, may it be a virus, payload, trojan, or whatever the file is, it will only work on computers in which Python is installed. So, let us say I have written an excellent computer trojan that can bypass any Anti-virus, and I start to deploy it in a Windows system. Still, if the Windows system does not have the Python interpreter installed, then it won’t work. So, one needs to compile every file of the written Python script into an executable and then deploy it in the Windows system.
Reverse Engineering Windows Executables
Now you know that we need to compile Python scripts to run in Windows; you must also know that there needs to be a compiler that converts Python scripts into an executable. Yes, there is. Its name is Py2exe. Py2exe is a simple installer that converts Python scripts into standalone Windows programs. Another tool also converts Windows executables written in Python back to Python scripts. Its name is Pyinstaller Exe Rebuilder.
Pyinstaller exe rebuilder is a tool to recompile/reverse engineer py installer-generated executables without having access to the source code. When you launch the EXE – it is unpackaged in memory. This includes the .pyc files (python code converted to bytecode). Basically, tools like py installer and py2exe package libraries and dependencies together so you can run the ‘standalone’ EXE without having to download them or prepare the machine with a Python interpreter.
Another toolkit also takes you very near to the source code. The name is PyRetic which stands for Reverse Engineer Obfuscated Python Bytecode. This toolkit allows you to return an object in memory to source code without directly accessing the bytecode on disk. This can be useful if the application’s pyc’s on disk are obfuscated in one of many ways.
Reverse Engineering The Hard Way
The above part is easy to understand and practically do when you have at least a basic knowledge of Python. But that’s not always the case. Sometimes, you don’t have any documentation or comments in the Python script, and there are too many files to understand all by yourself. Now there is an awesome book on this part, but I won’t be concentrating much on that.
The name of the book is “Working Effectively with Legacy Code”. The book is independent of Python or any other language and will give you an idea for reverse engineering in almost any language. When trying to understand a piece of code, the key focus is why you want to understand it.
Whether you want to reverse engineer the code to modify or port it, the approach for both would be quite different. So, instrumenting the legacy code with batteries and scaffolding tests and tracing/logging is the crucial path on the long, hard slog to understand and modify safely and responsibly.
Reverse Engineering Tools
Now there is another method to make it a bit easy, which you can follow along with the above steps. There is a site called Epydoc. I will check the code on this site and create some documentation for it. The result will not be as good as the original documentation, but it will at least give you an idea of exactly how it works. And by doing this, you can start writing your own documentation. After partially writing the document, you can again generate the remaining partial document from the site for the remaining part.
You can even use the IDE tool to analyze the code. This typically gives you code completion, but more importantly, in this case, it makes it possible to just ctrl-click on a variable to see where it comes from. This really speeds things up when you want to understand other people’s code.
Also, you need to learn a debugger. In tricky parts of the code, you must step through them in a debugger to see what the code does. Pythons pdb works, but many IDEs have integrated debuggers, which make debugging easier. PyReverse from Logilab and PyNSource from Andy Bulka is helpful, too, for UML diagram generation.
Bringing code content into a visual UML model helps programmers or software engineers review an implementation, identify potential bugs or deficiencies, and look for possible improvements. There is a process to produce a UML class model from a given input of source code. With this, you can reverse a snapshot of your code base to UML classes and form a class diagram further.
Apart from this, developers may reverse a code library as UML classes and construct a model with them, like reversing a generic collection framework and developing your own framework by extending the generic one. In this chapter, we will go through the instant reverse of Python.
Objects and Primers
To fully understand Python’s inner workings, one should first become familiar with how Python compiles and executes code. When code is compiled in Python, the result is a code object. A code object is immutable and contains all the interpreter’s information to run the code. A byte code instruction is a one-byte opcode value followed by arguments when required. Data is referenced using an index into one of the other properties of the code object.
A byte code string looks like this:
\x64\x02\x64\x08\x66\x02
Python byte code operates on a stack of items. A more enterprising extension would be to attempt to decompile the byte code back into readable Python source code, complete with object and function names. Python code can be distributed in binary form by utilizing the marshal module. This module allows serializing and deserializing code objects using the store and load functions.
The most commonly encountered binary format is a compiled Python file (.pyc) which contains a magic number, a timestamp, and a serialized object. The Python interpreter usually produces this file type as a cache of the compiled object to avoid having to parse the source multiple times. These techniques rely on the ease of access to byte code and type information.
With a code object’s byte code, code logic can be modified or even replaced entirely. Extracting type information can aid in program design comprehension and identification of function and object purposes.
The obfuscation and hardening of application byte code will always be a race between the implementers and those seeking to break it. To attempt to defend against byte code retrieval, the logical first step is towards a runtime translation solution.
Properties of a code object could be stored in any signed, encrypted, or otherwise obfuscated format that is de-obfuscated or translated during runtime and used to instantiate a new object. One could even change how variable name lookups work within the interpreter to obfuscate naming information. A developer could further mitigate reversing attempts by adding a translation layer between the lookup of the actual names and the names within the source code.
Conclusion
Now, after reading all these, you may feel the need to go and experiment with a few of the tools out there. So, here are some tools that can help you reverse-engineer your way into your Python code:
- Paimei
- Sulley
- The Carrera Collection
- PyEmu
- IDAPython
- ImmDbg
All of these are great pieces of code, but what makes them outstanding is when they are used together. Keep in mind this is in no way a complete list, just the ones that I use the most and think show how the flexibility of Python can make such a complex task as reverse engineering manageable.
Recommended Articles
Here are some articles that will help you to get more detail about Reverse Engineering with Python, so just go through the link.