Add MOVZ+MOVK address symbolization for ARM64 (non-PIE)#91
Add MOVZ+MOVK address symbolization for ARM64 (non-PIE)#91boppitybop wants to merge 2 commits intoGrammaTech:mainfrom
Conversation
ARM64 non-PIE executables construct absolute addresses using MOVZ+MOVK chains (2-4 instructions, each carrying a 16-bit slice with LSL #0/GrammaTech#16/GrammaTech#32/GrammaTech#48). ddisasm had no rules for this pattern, causing the instructions to retain raw immediates. The reassembled binary then contained wrong addresses. Changes: - Arm64Loader: record op_shifted facts for IMM operands with LSL - arm64_symbolization.dl: add movz_movk_chain rules (2/3/4 deep), emit symbolic_operand_candidate with G0-G3 attributes for EXEC binaries when the reconstructed value matches code or data_segment - Disassembler: add G2/G3 to AttributeMap - Add ex_movz_movk regression test
|
@boppitybop Thank you for your contribution! Before we can move forward with this MR, please sign our CLA form: https://github.com/GrammaTech/ddisasm/blob/main/GrammaTech-CLA-ddisasm.pdf Please let me know once it's done or if you have any questions. Thank you! |
|
Hi @boppitybop, Could you let me know how you are reassembling the generated assembly? I tried binary-printing using gtirb-pprinter with After changing: to the"selected processor does not support" errors went away. This is happening after applying your corresponding change in gtirb-pprinter. Could you share the exact toochain version and flags you're using to reassemble successfully? |
|
Thanks for the thorough testing, @junghee. It surfaced a bug in the initial implementation. I realised Capstone aliases So now For the arch, I guess ideally the pprinter would detect which extensions the binary uses and emit the appropriate .arch/.arch_extension directives. I think the error such as I couldn't get static binary reassembly to work on ARM64 due to these pprinter issues
For validating the MOVZ+MOVK symbolization itself, I used a dynamically-linked build of the test file ( Toolchain: aarch64-linux-gnu-gcc 9.4.0, binutils 2.34 (Ubuntu 20.04). |
junghee
left a comment
There was a problem hiding this comment.
@boppitybop Thank you for the contribution!
I've added some inline review comments -- please take a look and address them when you get a chance so we can move this forward.
| // MOVZ/MOVK with an explicit lsl #N (N > 0). | ||
| movz_movk_insn(EA, Reg, Imm band 65535, Shift, Operation):- | ||
| instruction(EA,_,_,Operation,ImmOp,RegOp,0,0,_,_), | ||
| (Operation = "MOVZ" ; Operation = "MOVK"), | ||
| op_immediate(ImmOp, Imm, _), | ||
| op_regdirect_contains_reg(RegOp, Reg), | ||
| op_shifted(EA, 1, Shift, "LSL"). | ||
|
|
||
| // MOVZ/MOVK without an explicit shift (lsl #0). | ||
| movz_movk_insn(EA, Reg, Imm band 65535, 0, Operation):- | ||
| instruction(EA,_,_,Operation,ImmOp,RegOp,0,0,_,_), | ||
| (Operation = "MOVZ" ; Operation = "MOVK"), | ||
| op_immediate(ImmOp, Imm, _), | ||
| op_regdirect_contains_reg(RegOp, Reg), | ||
| !op_shifted(EA, 1, _, _). |
There was a problem hiding this comment.
| // MOVZ/MOVK with an explicit lsl #N (N > 0). | |
| movz_movk_insn(EA, Reg, Imm band 65535, Shift, Operation):- | |
| instruction(EA,_,_,Operation,ImmOp,RegOp,0,0,_,_), | |
| (Operation = "MOVZ" ; Operation = "MOVK"), | |
| op_immediate(ImmOp, Imm, _), | |
| op_regdirect_contains_reg(RegOp, Reg), | |
| op_shifted(EA, 1, Shift, "LSL"). | |
| // MOVZ/MOVK without an explicit shift (lsl #0). | |
| movz_movk_insn(EA, Reg, Imm band 65535, 0, Operation):- | |
| instruction(EA,_,_,Operation,ImmOp,RegOp,0,0,_,_), | |
| (Operation = "MOVZ" ; Operation = "MOVK"), | |
| op_immediate(ImmOp, Imm, _), | |
| op_regdirect_contains_reg(RegOp, Reg), | |
| !op_shifted(EA, 1, _, _). | |
| movz_movk_insn(EA, Reg, Imm band 65535, Shift, Operation):- | |
| instruction_get_operation(EA, Operation), | |
| (Operation = "MOVZ" ; Operation = "MOVK"), | |
| arch.move_reg_imm(EA,Reg,Imm,_), | |
| ( | |
| // MOVZ/MOVK with an explicit lsl #N (N > 0). | |
| op_shifted(EA, 1, Shift, "LSL") | |
| ; | |
| // MOVZ/MOVK without an explicit shift (lsl #0). | |
| !op_shifted(EA, 1, _, _), | |
| Shift = 0 | |
| ), | |
| Shift % 16 = 0. |
Since the two rules are similar, we can merge them into the one above.
Also, I think it would be good to check that Shift % 16 = 0.
| const cs_arm64_op& CsOp = Details.operands[i]; | ||
| // For aliased MOVZ, we fix up the immediate operand to recover the | ||
| // original 16-bit value and shift that capstone folded away. | ||
| cs_arm64_op CsOp = Details.operands[i]; |
There was a problem hiding this comment.
Let's move this line three lines up, after the comment // Load capstone operand.
| all: out.txt | ||
| check: out.txt | ||
| ex: src.s | ||
| $(CC) -no-pie $^ -o $@ -static |
There was a problem hiding this comment.
| $(CC) -no-pie $^ -o $@ -static | |
| $(CC) -no-pie $^ -o $@ |
I tested it locally and found it only works without -static. And without it, the command matches to the one in your comment:
aarch64-linux-gnu-gcc -no-pie src.s -o ex
| @@ -0,0 +1,38 @@ | |||
| // Test: MOVZ+MOVK address construction in non-PIE ARM64 binaries. | |||
| // | |||
| // This exercises the movz_movk_pair Datalog rules. In a non-PIE executable | |||
There was a problem hiding this comment.
| // This exercises the movz_movk_pair Datalog rules. In a non-PIE executable | |
| // This exercises the movz_movk_chain Datalog rules. In a non-PIE executable |
| // | ||
| // This exercises the movz_movk_pair Datalog rules. In a non-PIE executable | ||
| // the linker fills absolute addresses into the MOVZ/MOVK immediates, so | ||
| // ddisasm must recognise the pair as constructing an address and emit |
There was a problem hiding this comment.
| // ddisasm must recognise the pair as constructing an address and emit | |
| // ddisasm must recognize the pair as constructing an address and emit |
| // ddisasm must recognise the pair as constructing an address and emit | ||
| // symbolic operand candidates with G0/G1 attributes. | ||
| // | ||
| // Expected behaviour after the fix: |
There was a problem hiding this comment.
| // Expected behaviour after the fix: | |
| // Expected behavior: |
| movz_movk_complete(EA_first, EA_last, Value):- | ||
| movz_movk_chain(EA_first, EA_last, _, Value, _), | ||
| // This chain cannot be extended by another MOVK. | ||
| ( | ||
| !next(EA_last, _) | ||
| ; | ||
| next(EA_last, EA_after), | ||
| !movz_movk_insn(EA_after, _, _, _, "MOVK") | ||
| ; | ||
| next(EA_last, EA_after), | ||
| movz_movk_insn(EA_after, Reg2, _, _, "MOVK"), | ||
| movz_movk_chain(EA_first, EA_last, Reg, _, _), | ||
| Reg2 != Reg | ||
| ). |
There was a problem hiding this comment.
| movz_movk_complete(EA_first, EA_last, Value):- | |
| movz_movk_chain(EA_first, EA_last, _, Value, _), | |
| // This chain cannot be extended by another MOVK. | |
| ( | |
| !next(EA_last, _) | |
| ; | |
| next(EA_last, EA_after), | |
| !movz_movk_insn(EA_after, _, _, _, "MOVK") | |
| ; | |
| next(EA_last, EA_after), | |
| movz_movk_insn(EA_after, Reg2, _, _, "MOVK"), | |
| movz_movk_chain(EA_first, EA_last, Reg, _, _), | |
| Reg2 != Reg | |
| ). | |
| movz_movk_complete(EA_first, EA_last, Value):- | |
| movz_movk_chain(EA_first, EA_last, Reg, Value, _), | |
| // This chain cannot be extended by another MOVK. | |
| ( | |
| !next(EA_last, _), UNUSED(Reg) | |
| ; | |
| next(EA_last, EA_after), UNUSED(Reg), | |
| !movz_movk_insn(EA_after, _, _, _, "MOVK") | |
| ; | |
| next(EA_last, EA_after), | |
| movz_movk_insn(EA_after, Reg2, _, _, "MOVK"), | |
| Reg2 != Reg | |
| ). |
| movz_movk_insn(EA, _, _, Shift, _), | ||
| movz_movk_shift_group(Shift, Group). | ||
|
|
||
| // Suppress unpaired MOVZ/MOVK from being independently symbolized. |
There was a problem hiding this comment.
This predicate unlikely_have_symbolic_immediate is used in code inference (block_heuristic) checking whether the immediate could be a possible target; it is not used in symbolization.
So, I think the following comment would be sufficient:
// Immediate in movz or movk may not be symbolic unless the instruction is part of movz_movk_chain.
I think it would also be good to add the following rule for potentially better code inference:
may_have_symbolic_immediate(EA,as(Value,address)):-
movz_movk_complete(EA_first,EA_last,Value),
movz_movk_member(EA,EA_first,EA_liast).
| if(IsAliasedMovz && CsOp.type == ARM64_OP_IMM) | ||
| { | ||
| CsOp.imm = AliasedImm16; | ||
| if(AliasedShift > 0) | ||
| { | ||
| CsOp.shift.type = ARM64_SFT_LSL; | ||
| CsOp.shift.value = AliasedShift; | ||
| } | ||
| } |
There was a problem hiding this comment.
| if(IsAliasedMovz && CsOp.type == ARM64_OP_IMM) | |
| { | |
| CsOp.imm = AliasedImm16; | |
| if(AliasedShift > 0) | |
| { | |
| CsOp.shift.type = ARM64_SFT_LSL; | |
| CsOp.shift.value = AliasedShift; | |
| } | |
| } | |
| if(IsAliasedMovz && CsOp.type == ARM64_OP_IMM && AliasedShift > 0) | |
| { | |
| CsOp.imm = AliasedImm16; | |
| CsOp.shift.type = ARM64_SFT_LSL; | |
| CsOp.shift.value = AliasedShift; | |
| } |
I think it can be simplified as above since CsOp.imm is unchanged when shift.value is 0.
Add MOVZ+MOVK address symbolization for ARM64 (non-PIE)
Fixes #90