Skip to content

DecomposeInterface

Gil Dabah edited this page Mar 19, 2021 · 20 revisions

In this page I'm going to cover how to parse the Decompose output.

The _DInst structure is very compact, is it designed to be as minimal as possible, knowing that diStorm is aimed for x86 and AMD64 helped in saving memory space.

A few must-follow rules for using the Decompose interface:

  1. Always check that the returned instruction is valid first! Do that by comparing the 'flags' field to FLAG_NOT_DECODABLE.
  2. Access only fields that you know you are allowed to by the type of the operands. In the explanation below, every field indicates when it's set. Some fields are always set, and some are dependent on other fields.
  3. Always use the helper macros that are described along this page, in the future I might change some bits and bytes and it will break your software.
  4. The last rule applies to macros that define values too rather than only 'functionality', like R_NONE. Always check that a register index is set, by comparing it to R_NONE. (For instance, if you compare it yourself to -1, you might get bogus results because of integer promotions).

Again, the following structure was designed specially for the x86 and AMD64 architectures, that's why some fields are global to the instruction although a more appropriate place for them should be inside the Operand structure. This is in order to spare bytes. For example, there's no reason to have an 'immediate' field in every operand, because the x86 defines that an instruction can have at most 1 immediate operand. (This is true except the ENTER instruction, you will find more information below about it).

If you wish to get the textual representation of either a register or an instruction. You should include 'mnemonics.h' in your project. And use GET_REGISTER_NAME to get the string for a given register index. Also, in the 'mnemonics.h' you will find other enum's that will aid your parsing, like all supported registers (R_EAX, R_DS, R_CR0, etc) and opcodes (I_MOV, I_ADD, I_CALL), etc.

If you wish to convert the DInst structure, that you got as a result from the Decompose function, into text, there's a new function that does just that, distorm_format. It requires the result and theold structure _DecodedInst which will hold the textual representation of the instruction, including prefixes. This will save you the hassle from converting the operands into text and taking care of the prefixes and other subtle issues. You should use the distorm_format, rather than calling distorm_decode on the same instruction, because it does what it name suggests, just formats the instruction as text.

struct _DInst:

_OffsetType addr;

  • Always set.
  • The virtual address of the instruction.
  • It is determined according to the given start address of the call to the Decompose function.

uint8_t size;

  • Always set.
  • The size of the whole instruction. Varying from 1 to 15 bytes long.

uint16_t flags;

  • Always set.
  • Very important to check this field before touching the other fields.
  • If it's set to FLAG_NOT_DECODABLE, the instruction is invalid.
  • See Flags for more information.

uint8_t segment;

  • Set when one of the operands is of type O_SMEM, O_MEM, O_DISP.
  • Helper macros: SEGMENT_GET, SEGMENT_IS_DEFAULT.
  • SEGMENT_IS_DEFAULT returns TRUE if the segment register is the default one for the operand. For instance: MOV [EBP], AL - the default segment register is SS. However, MOV [FS:EAX], AL - The default segment is DS, but we overrode it with FS, therefore the macro will return FALSE.
  • To extract the segment register index use the SEGMENT_GET macro.
  • R_NONE if not set.

uint8_t base;

  • Set when one of the operands is of type O_MEM.
  • It is the register index of the Base. I.E: MOV [EAX+EBX*4], EDI - it is R_EAX.
  • R_NONE if not set.

uint8_t scale;

  • Set when one of the operands is of type O_MEM.
  • The Scale is a pair to the Index register in a memory indirection operand, which is described in the Operand structure.
  • The scale can be either 0, 1, 2, 4, 8. If it's not set it is 0.

uint8_t dispSize;

  • Set when one of the operands is of type O_SMEM, O_MEM, O_DISP and the instruction has a displacement.
  • This is the size of the 'disp' field in bits.
  • If there's no displacement set, this field is 0.

uint16_t opcode;

  • Always set.
  • If the instruction is invalid it is set to I_UNDEFINED.
  • Include the file "mnemonics.h" to use the Instructions-Enum.
  • An helper macro to get the textual representation for an instruction is GET_MNEMONIC_NAME.
  • For instance, if you want to check that a decomposed instruction is 'POP', then compare this field to I_POP. Basically add a prefix of "I_" to the upcased name of the instruction you want to check. You can see the whole list in the "mnemonics.h" file.

_Operand ops[OPERANDS_NO];

  • An array of 4 _Operand's.
  • However, they might be empty.
  • See Operands for more information.

uint64_t disp;

  • Set when one of the operands is of type O_SMEM, O_MEM, O_DISP and the instruction has a displacement.
  • The only way to know that an instruction as a displacement is to check that dispSize != 0.
  • Some instructions use a displacement of 0. I.E: MOV [EBP], EAX.

_Value imm;

  • Set when one of the operands is of type O_IMM, O_IMM1&O_IMM2, O_PTR, O_PC.
  • The size of the immediate value itself is the Operand.size field.
  • See Immediate for more information.

uint16_t unusedPrefixesMask;

  • Always set.
  • This field indicates which of the prefixes of the instruction were unused.
  • There are two reasons as for why a prefix is unused, either because it didn't affect the decoding of the instruction. I.E: db 0x66; ADD AL, 1. The 0x66 (Operand Size) prefix doesn't affect the instruction in this case and therefore is unused. The other reason is when there are more than one prefix of the same type (see x86 documentation). I.E: db 0x2e, db 0x3e, MOV [EAX], AL. We tried to set a segment override twice, so only the last one (0x3e) is taken into account, the first one is unused.
  • Normally instructions should not have unused prefixes. It might mean that you disassemble invalid code (or data). Or it might mean you disassemble an aligning instruction such as: 0x66, 0x66, 0x90 to fill in a space of 3 bytes to round up to next multiple of 8/16, etc.
  • A quick check to see if the instruction has unused prefixes is 'unusedPrefixesMask != 0'.
  • So which prefixes are unused really? Since this field is a mask, the first bit denotes the first byte of the instruction, and so on, starting at 'addr' field. Basically use the following code:
for (int i = 0; i < sizeof(uint16_t); i++) {
 if (DecomposedInst.unusedPrefixesMask & (1 << i))
  printf("Unused prefix %02x at offset: %x\n", CodeBuffer[DecomposedInst.addr - StartCodeOffset + i], DecomposedInst.addr + i); 
}

uint16_t meta;

  • Always set.
  • This field holds meta information to the instruction.
  • It contains two sub-fields which should be extracted using the helper macros: META_GET_ISC, META_GET_FC.
  • META_GET_ISC returns the Instruction-Set-Class type of the instruction. I.E: ISC_INTEGER, ISC_FPU, and many more. See distorm.h for the complete list.
  • META_GET_FC returns the Flow-Control type of the instruction. I.E: FC_CALL, FC_BRANCH and others. Usually it's FC_NONE. See the rest of them inside distorm.h.
  • The meta-FC is very useful for flow control analysis.

uint16_t usedRegistersMask;

  • Set when the instruction is valid and uses registers in its operands.
  • This field is actually a mask for all the registers that are used in the operands.
  • Practically, instead of scanning for a specific register in the operands, you should use this field.
  • This field is not a replacement to the operands information! It is just a hint, hence a mask.
  • The registers are categorized to register-classes such as:
Registers Family Mask Name
AL, AH, AX, EAX, RAX RM_AX
CL, CH, CX, ECX, RCX RM_CX
DL, DH, DX, EDX, RDX RM_DX
BL, BH, BX, EBX, RBX RM_BX
SPL, SP, ESP, RSP RM_SP
BPL, BP, EBP, RBP RM_BP
SIL, SI, ESI, RSI RM_SI
DIL, DI, EDI, RDI RM_DI
ST(0) - ST(7) RM_FPU
MM0 - MM7 RM_MMX
XMM0 - XMM15 RM_SSE
YMM0 - YMM15 RM_AVX
CR0, CR2, CR3, CR4, CR8 RM_CR
DR0, DR1, DR2, DR3, DR6, DR7 RM_DR

Note that RIP can be checked with the FLAG_RIP_RELATIVE. Segment registers have the 'segment' field. And R8-R15 are not mapped, I might add them in the future.

The following three fields describe how the instruction affects the CPU flags. It's a simple bit mask that can be tested using the following values:

Flag Name Meaning
D_ZF Zero Flag
D_SF Sign Flag
D_CF Carry Flag
D_OF Overflow Flag
D_PF Parity Flag
D_AF Auxiliary Flag
D_DF Direction Flag
D_IF Interrupt Flag

uint16_t modifiedFlagsMask;

  • Use the above flags to check if a specific CPU flag is being modified (output) by this instruction. Only set if DF_FILL_EFLAGS is enabled, otherwise 0.

uint16_t testedFlagsMask;

  • Use the above flags to check if a specific CPU flag is being tested (input) by this instruction. Only set if DF_FILL_EFLAGS is enabled, otherwise 0.

uint16_t undefinedFlagsMask;

  • Use the above flags to check if a specific CPU flag is being undefined (output) by this instruction. Only set if DF_FILL_EFLAGS is enabled, otherwise 0.

Flags

The 'flags' field has a few more options, they are pretty advanced though, but nothing special. Use the helper macros: FLAG_GET_OPSIZE, FLAG_GET_ADDRSIZE, FLAG_GET_PREFIX. FLAG_GET_OPSIZE returns the DecodeType (Decode16Bits, Decode32Bits or Decode64Bits) of the operand, thus it's the size of the operand.

FLAG_GET_ADRSIZE returns the DecodeType (Decode16Bits, Decode32Bits or Decode64Bits) of the operand, thus it's the size of the referenced memory by the operand.

It is important to understand the meaning of the two sizes: MOV EAX, EBX - operand size is 32 in both operands. MOV [EAX], byte ptr 0 - operand size is 8 in both operands, however the size of the register that references the memory, EAX, is obviously 32 (or Decode32Bits). And one more thing about it, it's the effective operand/address size, rather than the input DecodeType you supply when calling the Decompose function (this can happen with prefixes for the instruction)!

FLAG_GET_PREFIX returns the prefix of the instruction (FLAG_LOCK, FLAG_REPNZ, FLAG_REP). Note that the string instructions CMPS and SCAS treats the FLAG_REP as 'REPZ'.

There are a few more flags such as: FLAG_HINT_TAKEN, FLAG_HINT_NOT_TAKEN, FLAG_IMM_SIGNED. You can check for these flags by doing: if ((DecomposedInst.flags & FLAG_XXX) != 0) The first two are pretty self-explanatory, if you know what they mean :) The FLAG_IMM_SIGNED is important if you want to know whether to treat the immediate as a signed or unsigned integer (some instructions supply this information).

FLAG_DST_WR - This flag indicates whether the first operand, that is the destination operand, is writable or not. This way you can know dependency between instructions. For example: MOV EBX, EAX -> EBX gets overridden, hence the flag will be set. But for: PUSH EBX, the flag is not set, since the PUSH instruction doesn't write to the EBX register. Note that this flag is only supported in Integer instructions.

FLAG_RIP_RELATIVE indicates when an instruction in 64 bits uses the RIP-relative memory indirection addressing. In order to get the absolute target address you can use the INSTRUCTION_GET_RIP_TARGET. This flag will spare you the scanning for the RIP register in the operands.

And last but not least, FLAG_PRIVILEGED_INSTRUCTION indicates that the instruction is a privileged instruction, one that can run only from ring 0.

Operands

As we said earlier, the 'operands' field is an array of 4 operand structures. This is probably the most important information you would want to extract from an instruction.

An operand is defined as:

typedef struct {
	/* Type of operand:
	O_NONE: operand is to be ignored.
	O_REG: index holds global register index.
	O_IMM: instruction.imm.
	O_IMM1: instruction.imm.ex.i1.
	O_IMM2: instruction.imm.ex.i2.
	O_DISP: memory dereference with displacement only, instruction.disp.
	O_SMEM: simple memory dereference with optional displacement (a single register memory dereference).
	O_MEM: complex memory dereference (optional fields: s/i/b/disp).
	O_PC: the relative address of a branch instruction (instruction.imm.addr).
	O_PTR: the absolute target address of a far branch instruction (instruction.imm.ptr.seg/off).
	*/
	uint8_t type; /* _OperandType */

	/* Index of:
	O_REG: holds global register index
	O_SMEM: holds the 'base' register. E.G: [ECX], [EBX+0x1234] are both in operand.index.
	O_MEM: holds the 'index' register. E.G: [EAX*4] is in operand.index.
	*/
	uint8_t index;

	/* Size in bits of:
	O_REG: register
	O_IMM: instruction.imm
	O_IMM1: instruction.imm.ex.i1
	O_IMM2: instruction.imm.ex.i2
	O_DISP: instruction.disp
	O_SMEM: size of indirection.
	O_MEM: size of indirection.
	O_PC: size of the relative offset
	O_PTR: size of instruction.imm.ptr.off (16 or 32)
	*/
	uint16_t size;
} _Operand;

That's pretty self-explanatory I would say. I just wanted to note that I decided to have both O_SMEM and O_MEM separated. Since most of the times instructions have, what I call, simple memory dereference, so only one register, and you get its index in the operand.index already. The thing is, when you encounter O_MEM, it really means you are going to have two registers in the operand.

About the operand.size field, it describes the size of the object that the operand represents, it can be the size of the register if the type is O_REG. Or it can be the size of the memory dereference. I.E: MOV [BX], EAX - The destination is size is 32 bits. But the size of the register is 16 bits, though in such a case the size of the index is not specified and you should know it because you have the index of the register which says, R_BX. So practically, the size of the register when the type is O_REG is a bonus, because we just realized we can get this information by the index of the register.

I hope it is clear that the order of operands is based on the order of this array. If the operand.type is O_NONE you can stop querying the operands in the array.

Immediate

The 'immediate' field should be treated very carefully. If you take a closer look, you will notice it is defined as a _Value type, which defined as:

typedef union {
	/* Used by O_IMM: */
	int8_t sbyte;
	uint8_t byte;
	int16_t sword;
	uint16_t word;
	int32_t sdword;
	uint32_t dword;
	int64_t sqword; /* All immediates are SIGN-EXTENDED to 64 bits! */
	uint64_t qword;

	/* Used by O_PC: */
	_OffsetType addr;

	/* Used by O_PTR: */
	struct {
		uint16_t seg;
		/* Can be 16 or 32 bits, size is in ops[n].size. */
		uint32_t off;
	} ptr;

	/* Used by O_IMM1 (i1) and O_IMM2 (i2). ENTER instruction only. */
	struct {
		uint32_t i1;
		uint32_t i2;
	} ex;
} _Value;

Important note: As you can see, if the operand type is O_IMM, you can get the immediate value from the sbyte, byte, ..., sqword, qword. On first glance it looks intimidating. Let me explain, in most cases you want and should use the qword or sqword parts of the union. diStorm already signed extends the immediates to the size of 64 bits types. So stick to them, it will be much easier for you. Although, you can happily use the other union-fields for accessing the same immediate, if that suits, go ahead. That's why you have the size of the immediate stored in the operand.size field.

Now let's talk about the ENTER instruction, it's the only instruction so far that has two immediates. Now, we surely don't want to allocate a whole _Value structure for a rare used instruction. And besides since the sizeof(_Value) is 8 bytes, we can surely squeeze in a 16 bits and 8 bits immediates inside, and that's why I added the sub-structure 'ex'. So if you encounter an operand type of O_IMM1, it means you should read the immediate from 'imm.ex.i1', and 'imm.ex.i2' for O_IMM2.

The O_PTR type is found only in instruction that access far memory, such as: JMP FAR, CALL FAR, etc. So to get both the segment/selector and the offset use 'imm.ptr.seg' and 'imm.ptr.off'.

The O_PC type is found in branching instruction such as: JNZ, JB, JMP, CALL, etc. The value stored in the union is really the relative offset from the current instruction. Therefore if you want to calculate the target address of the branch, use the helper macro INSTRUCTION_GET_TARGET. In order to access the relative part solely, use 'imm.addr'. The size of the relative value is operand.size field. Note: in the past, before diStorm3 was official, the 'imm.addr' used to be the absolute target address, I might revert it in the future. And unfortunately, I can't recall why I did this change.


Now that you read everything, you should take a look at Showcases and you will understand what's going on. You're ready to go, good luck.


Also a good example of how to use the Decompose API can be found at (look for the distorm_decompose function): https://github.com/gdabah/distorm/blob/master/src/distorm.c

Clone this wiki locally