Reverse-Engineering checklist

Not all those who wander are lost, but how did you end up here?

This page is a work in progress

The idea of this page is to provide a decision tree of steps to follow when reverse-engineering a new malware sample. I won’t go into details about each step, but I’ll try to provide some useful links, refernces, and the high level approaches that are possible.

I plan to update this page with steps I do when analysing a sample, once in a while. Anyway, let’s start.


First, check the kind of the file you’re dealing with. This be done manually, or automated with yara. For your convenience, you can use exe_kind.yar:

yara exe_kind.yar .

Then follow a hyperlink depending on what you got:


Pyinstaller

In general, Python reverse-engineering consists of two steps: dumping the bytecode, and analysing it.

In case of Pyinstaller, the situation is usually simple because off-the shell tools work. I recommend pyinstxtractor-ng, another older option is pyinstxtractor.

python3 ~/opt/pyinstxtractor.py malware.exe

If you managed to get the bytecode, continue from →Python Bytecode.


Python Bytecode

Approaches to python bytecode analysis you should try are, in order:

  • pycdc - current state of the art of python decompilation
  • pyc disassembly - disassemble the bytecode and analyse manually

pycdc

Currently pycdc gives you the biggest chance of success. If you’re lucky it’s packaged in your distro, otherwise you need to clone the repository and compile it yourself.

nix-shell -p pycdc
pycdc _extracted_malware/malware.pyc > malware.py

If it didn’t work, you’ll have to read →pyc disassembly.


pyc disassembly

Worst case, you can always read the disassembled bytecode directly. Your options are:

  • dis - built-in Python bytecode disassembler
  • xdis - pure Python library for disassembling bytecode

dis

Surprisingly, even though Python includes a dis module, I’m not aware of any built-in way to actually disassemble Python bytecode. So you will need this tiny script:

import dis, sys, marshal

path = sys.argv[1]
with open(path, "rb") as f:
    f.seek(16)
    dis.dis(marshal.load(f))

Run it like this:

nix-shell -p python313  # pick a correct version
python3 view_pyc_file.py malware.pyc > malware.pybc

The caveat is that that this depends on the Python version, so you need to use the same Python version for disassembly as the Python it was compiled with.

If you’re lucky and it worked, go →read the bytecode. Otherwise, try →xdis.


xdis

That’s why you may also consider xdis, which is pure Python library for disassembling Python bytecode

  • independent of the original interpreter version.

TODO example

This should work. If it did, go →read the bytecode. Otherwise, it’s possible you’re dealing with an obfuscated bytecode, or even worse - a custom CPython build. I don’t have anything about then yet, so that’s EOF for now.


Read the bytecode

Any text editor will do. I just want to mention vscode-python-bytecode-highlight, my vscode extension for python bytecode syntax highlighting. It colours the bytecode nicely and makes some links clickable - but that’s it. You can use any other editor.


Delphi

First, load to IDR (https://github.com/crypto2011/IDR) in a Windows VM. This may take a long time.

Then export .IDC script.

Then use https://github.com/huettenhain/dhrake and DhrakeInit


Dotnet

This means that the executable is a .NET assembly. Now, depending on how obfuscated the sample is:

  • Use →ilspycmd for quick analysis and triage.
  • Use →dnSpy for serious reverse-engineering.
  • Use →dnLib for automating the deobfuscation process.
  • Use →dotnet sdk to quickly unpack malware by reusing the unpacker’s code.
  • Use →Visual Studio if dotnet-sdk doesn’t work.
  • Check out →dbglib for obfuscated samples that elude dnSpy debugger.

ilspycmd

For lightweight analysis, I recommend ilspycmd.

sudo docker run -v .:/docker --rm -ti berdav/ilspycmd -c "cd /docker; /home/ilspy/.dotnet/tools/ilspycmd -p malware.exe -o out_dir"

This will decompile dotnet_malware.exe to out_dir. After that you can easily open the decompiled code in your favorite editor, or use standard Linux commands (like grep) to analyse it.

The downside is that this code is fully static, it’s not possible to debug it, and there is no east way to deobfuscate anything. In some cases ilspycmd will flat out refuse to decompile some of the code. In this case, you may have to turn to →dnSpy.


dnSpy

For heavyweight analysis, I recommend https://github.com/dnSpyEx/dnSpy. It’s a GUI, but it’s very powerful and contains a built-in debugger. Unfortunately, it’s Windows-only - at least I prefer to stay as much as possible in Linux for reverse-engineering.


dnLib

DnSpy is based on dnLib, which is a .NET library for when you really need to get your hands dirty and start scripting the analysis - see for example this XWorm analysis (the idea is clear, unfortunately full framework was not open-sourced).

Since I mostly script in Python, I use pythonnet a crazy library that brings .NET to Python. This allows me to use dnlib like This:

from dnlib.DotNet.Emit import OpCodes

def get_string_default_values(typeobj):
    """Get all variables initialised to a string as a dict.
    Ignore other initialisation code.
    typeobj is TypeDefMD from dnlib.DotNet."""
    static_ctor = typeobj.FindStaticConstructor()
    if not static_ctor:
        return {}

    result = {}
    code = static_ctor.Body.Instructions
    for i in range(len(code) - 1):
        if code[i].OpCode == OpCodes.Ldstr:
            if code[i+1].OpCode == OpCodes.Stsfld:
                fieldname = code[i+1].Operand.Name.String
                fieldvalue = code[i].Operand
                result[fieldname] = fieldvalue
    return result

For example, that is a code snippet that extracts all string initialisations from a static constructor. It’s an extremely useful piece of code when writing automatic extractors for .NET stealers (of which there are plenty)..


dotnet sdk

A powerful unpacking method is a code reuse. For example, let’s say decompiled malware contains this line:

GCP gCP = (GCP)Marshal.GetDelegateForFunctionPointer(GetProcAddress(hModule, Reverse(Decipher("zzljvyWaulyybJalN", 7))), typeof(GCP));

Instead of reverse-engineering Deciper, you can just reuse the code in your own small tool:

nix-shell -p dotnet-sdk
dotnet init
vim Program.cs

And put this code in Program.cs:

internal class Program
{
	// Reverse and Decipher methods, copied from decompiled source code.

	private static void Main(string[] args)
	{
		Console.WriteLine(Reverse(Decipher("jvssHzsM", 7)));
	}
}

And just run it:

dotnet run

it goes without saying, that you should be careful when running malware code - I recommend doing this in a Docker container or a Virtual Machine.


Visual Studio


dbglib

Honourable mention goes to dbglib, which is my library for native low-level .NET debugging. See this DotRunPeX analysis for a tutorial how to use it.

golang

Well, you’re in for an adventure. Golang is not the easiest language to reverse-engineer. By default the decompiled binary looks like trash, but with help of a few tools it’ll get slightly better.

My main decompiler/disassembler is Ghidra. Most of the hits should transfer to other tools, but this guide is opinionated and I’m not going to cover them.

First, load the executable to Ghidra. Since recently, Ghidra has a basic Golang support, so remember to select x86:LE:32:default:golang language instead of the default choice.

The support is not great right now, though. To get the symbols, we will use GoReSym:

nix-shell -p goresym
GoReSym -t -d -p malware.exe > goresym.json

Then, load the symbols to Ghidra using a helper script (goresym.py) (TODO: modified cerberus, upload and link).

For malware, most likely the binary is obfuscated and you will still see junk function names:

But that’s still a step up compared to what you get by default.

After that, you can try ghostrings to recover strings from the binary:

Install the extension, and run several of the included scripts:

  • GoDynamicStrings.java
  • GoStaticStrings.java
  • GoKnownStrings.java
  • GoFuncCallStrings.java

Some of the scripts will take a while. Go make a coffee (you will need it).

After that, you will have transformed this complete mess:

Into something almost readable:

That’s all I have for now - EOF.

office_excel_xml

Office files include:

  • files with .docx, .xlsx, .pptx extensions
  • files with .docm, .xlsm, .pptm extensions (with macros)

A good way to start analysis is to use oletools. Use oleid to check if there are any macros:

nix-shell -p python310Packages.oletools
oleid sample.xlsx

see also


Other executable type

TODO


Other notes

Check compiler (nauz)

git clone https://github.com/horsicq/Nauz-File-Detector.git
sudo docker build . -t nauz
sudo docker run -v /home/msm/data:/home/msm/data nauz nfdc /home/msm/data/2024-12-18_ov8865sys/9c52d750eba2f72bdd38bcaf950da7f1128d5235223091d98dad2cc7146716fa

Gx64Sync

ImHex

Dumps

malduck fixpe

Themida

Entrypoint starts with a call:

e8 82 01 00 00    CALL       FUN_141fdd237
41 52             PUSH       R10
49 89 e2          MOV        R10,RSP
41 52             PUSH       R10
49 8b 72 10       MOV        RSI,qword ptr [R10 + local_res8]
49 8b 7a 20       MOV        RDI,qword ptr [R10 + local_res18]
fc                CLD
b2 80             MOV        DL,0x80

And later there is a tree of “if” statements:

  if (bVar12) {
    bVar11 = CARRY1(bVar7,bVar7);
    bVar7 = bVar7 * '\x02';
    bVar12 = bVar11;
    if (bVar7 == 0) {
      bVar7 = *local_res8;
      local_res8 = local_res8 + 1;
      bVar12 = CARRY1(bVar7,bVar7) || CARRY1(bVar7 * '\x02',bVar11);
      bVar7 = bVar7 * '\x02' + bVar11;
    }
    if (bVar12) {
      bVar11 = CARRY1(bVar7,bVar7);
      bVar7 = bVar7 * '\x02';
      bVar12 = bVar11;
      if (bVar7 == 0) {
        bVar7 = *local_res8;
        local_res8 = local_res8 + 1;
        bVar12 = CARRY1(bVar7,bVar7) || CARRY1(bVar7 * '\x02',bVar11);
        bVar7 = bVar7 * '\x02' + bVar11;
      }

The called function looks like this in Ghidra:

void FUN_141fdd237(void) {
  puVar2 = (undefined8 *)&stack0xffffffffffffffd8;
  pcVar1 = unaff_retaddr + -0x1de20b5;
  pcStack_30 = unaff_retaddr;
  if (*(int *)(unaff_retaddr + -0xf9f66e) == 0) {
    uStack_38 = 0;
    uStack_48 = 0;
    pcStack_40 = pcVar1;
    pcStack_30 = pcVar1;
    (*unaff_retaddr)();
    puVar2 = &uStack_48;
    pcVar1 = unaff_retaddr + 0x1cf;
  }
  (*(pcVar1 + 0xe42a47))(*(undefined8 *)((longlong)puVar2 + 0x20),*(undefined8 *)((longlong)puVar2 + 0x18));
  return;
}

This is the first layer of the packer. To unpack dynamically, add a breakpoint at the ret 0x20 instruction in the entrypoint and then run the binary.

Second stage is much more interesting, but also I currently don’t have a writeup for it.

Powershell

  • PSDecode