fix: display the real details for aliases when requested, even if the alias is an uncompressed instruction#2923
Conversation
… alias is an uncompressed instruction
When did we add this? I need to look at it in more detail. How are alias defined in the ISA? |
It's much older than that, it's probably as old as the noalias flag itself.. since the very beginning in November 2025 or so. The noalias flag was reflecting that before noaliascompressed was introduced.
The short answer is that they aren't, the ISA never defines such a thing as an alias, it defines two things: pseudo instructions and compressed equivalents. 1- Pseudo instructions are basically assembly-time macros that allow you to write 2- compressed equivalents are actual instructions with actual encodings, the CPU decoder is aware of them, but they happen to semantically correspond exactly to a restricted use of an equivalent non-compressed instruction (e.g. the compressed add corresponds to a To my humble intuition, those two things look very much the same from a user perspective. They're both "this instruction is actually the same as this other one", with the meaning of "the same" being defined in two slightly different ways each time. Are there any precedent in other architectures that allow us to go one way or another ? I know for a fact ARM has thumb mode which is their compressed mode, but I don't know if they have their own notion of pseudo instructions. (PS: note that this entire PR is about separating the details from the alias text. That is, we can still go with the decision to NOT consider compressed instructions as aliases, but still also allowing the real details flag to populate their details with the non-compressed equivalent details. This is very convienent for Rizin and any downstream consumers of Capstone, as it allows you to basically ignore all the compressed instructions, after all every single one corresponds to a special case of non-compressed instructions.) |
|
Hi @moste00, can you please give more precise examples of where LLVM returns an instruction that is/isn't an alias as expected? I'm a bit confused about what the desired result should be. $ riscv64-linux-gnu-as -march=rv64gc -al - <<< 'add sp, sp, s0'
1 0000 2291 add sp,sp,s0
$ riscv64-linux-gnu-objdump -d -M no-aliases a.out
0: 9122 c.add sp,s0
$ cstool riscv64 2291
0 22 91 add sp, sp, s0
$ cstool riscv64+noalias 2291
0 22 91 c.add sp, s0I don't see the inconsistency at first... |
I just mean that all compressed instructions aren't understood by LLVM core as aliases, maybe the CLI tools implement this on top of the core (as they should, IMO), but the core itself has a function called There IS an equivalent of Like you noticed, most CLI tools probably intuitively know that the user doesn't care about this pedantic distinction, and quietly just redefine "alias" to mean both things, but LLVM doesn't think that decompressed instructions are aliases, so we will be departing from them there. (There are some consequences if we do this, for example we would have no alias ID for decompressed instructions, alias IDs are only assigned to the "fake" pseudo instructions that LLVM considers as aliases, compressed instructions are real from LLVM's POV, they have a real instruction ID and no alias ID.) |
|
Yea, we both understand that these aliases (pseudoinstructions) are just a programmer's convenience and, in a way, a relief from hard-coded decisions on which architecture will execute this. E.g., you just write If you want to have an alias ID for compressed instructions, then we should have to add a table for it, right? Or even better, to just link them somehow to the existing table of aliases, because there is not really a compressed alias instruction. It's just an alias that is or is not compressed. As u said, from the user perspective, an alias represents a functionality, and there is no care if that functionality took 2 or 4 bytes of memory :) |
|
Also, I didn't reiterate that there is no difference between CLI tools and Capstone because CLI tools show the same string as cstool does |
This is my view, but another view is that we should do EXACLTY what LLVM core do, and LLVM core doesn't see compressed instructions as aliases. Maybe we can give them another flag, for example
Yes but this is its own deviation from LLVM too, we will define a manual table and maintain it with no auto-sync from LLVM. So whatever path you go, you will always have to face that you're going against LLVM convention. |
|
Let's backtrack a bit. I'm confused a lot 😅 cstool -d riscv64 67800000
0 67 80 00 00 ret
ID: 31 (jalr)
Is alias: 1698 (ret) with ALIAS operand set
Groups: jump
cstool -d riscv64 8280
0 82 80 ret
ID: 513 (c_jr)
Is alias: 1698 (ret) with ALIAS operand set
Groups: HasStdExtCOrZca jump
alias ID is ret (1698) for both |
|
Ah, so the problem is that those that are aliased only as compressed instructions, while the real instruction counterpart doesn't have an alias... |
|
@slate5 good point, actually now I'm confused too :D I didn't test ret before, but I tested another instruction ( Anyway, let's wait for @Rot127 to do a final judgement call on this, preferably according to the precedent set by ARM. Then we will see the way forward. |
|
Hehe, sext.w (c.addiw t0,0) works well for me XD I think the only "issue" is when you have an "alias" that, in itself, is nothing but the same mnemonic of the real instruction. And then, it only makes sense to call it an "alias" (i.e., alternative name) if it represents a compressed instruction. For example, So, it kinda makes sense, after all, R in RISC-V means reduced, not simple :) |
ARM has aliases :D There it is easy.
Please don't introduce another table we need to maintain. Except it is easy to generate automatically. The purpose of Auto-Sync is to just use the LLVM code as much as possible. Patching here and there a line in is fine. Or extending our LLVM backends to generate it for us of course.
That case is actually a bug (from our POV, not necessarily for LLVM). It usually means that the LLVM definitions have an alias and a real instruction defined with the same mnemonic. You can search for Personally, I wouldn't want the compressed instructions to be counted as "alias". First of all, because this is what it usually means for all other archs. So we can have some consistency between them. If one implements some tool with Capstone they maybe don't care about the mnemonic. IF the compressed instructions are semantically equivalent to the full version of them, we could say that they are an alias. But since the encoding bytes differ, I would prefer to add an extra So something like that: Compressed and not-compressed
Alias
The topology is something like this: Difference: Bytes: 67800000
Alias ID: ret
Real ID: jalr
Detail: cs_insn.details.is_compressed == false
cs_insn.size == 4
if (get_alias_details)
cs_insn.op_count == 0
else
cs_insn.op_count == 1
Bytes: 8280
Alias ID: ret
Real ID: c_jr
Detail: cs_insn.details.is_compressed == true
cs_insn.size == 2
if (get_alias_details)
cs_insn.op_count == 0
else
cs_insn.op_count == 1
wdyt? |
xD very correct, indeed.
This is reasonable, the thing is, compressed instructions satisfy the second condition exactly. Unless I'm misreading the spec/programmer's manual, it really does seem to say that a compressed equivalent MUST do the same effect as the uncompressed inspiration behind it, that's the intention in the first place, to give a size-shortcut to common idioms.
Very reasonable.
We can,
My original use case remains :( I need to be able to treat compressed instructions as basically their non-compressed equivalents, or else lifting would become very painful and repetitive. So one of 3 things: 1- The 2- There is a seperate flag that does the same thing as (1) but is not 3- There is a seperate operands array in RISC-V other than the usual one, the real details flag operates on the usual one, the other flag operates on the other one. Basically, I'm just circling and circling over the idea that I need to be able to obtain the non-compressed details, and since Rizin is just a serious test-drive of Capstone, probably many other tools depending on Capstone will have the same need. |
Sorry, I lost this context while reading. The idea 2 seems good to me, but I would flip it around. By default Because I think your lifting use case is way more common and should require only one flag instead of two. |
|
@Rot127 One final question: Does this mean we no longer treat
|
|
@Rot127 Also, one more note: It's never the case in LLVM that an alias has 2 parents, each alias in LLVM's alias table maps to exactly 1 parent, and most of those parents are the non-compressed. So this presents another difficulty (if we so choose to hande it, ignoring is always an option). Some instruction that "logically" should be aliases, for example a We could handle this: Uncompress the instruction, then if the uncompression maps to an alias and the user hasn't done alias supression, then do print the alias. This way the More work, and this whole topic is surprisingly fractal in complexity and edge cases. |
Yes, I think this follows from it.
That is a tricky one indeed. Generally the assembly output should be as LLVM does it. Being comparable to it is one of the features we have. How is the uncompression done? Does it cost a lot of runtime? @slate5 Feel free to state your opinion as well btw. |
|
@Rot127 Hey, sorry for sleeping on this for a while. I overthinked this way too hard till my teeth fell, and I think I came up with a really intuitive way to navigate this exhaustively. Let's begin by just listing all the facts: 1- There are 2 disjoint sets of instructions, compressed and uncompressed instructions 2- Additionally, there is a 3rd disjoint set of Psuedo-Instructions, instructions that don't really exist in binary encoding, but only as aliases to (mostly) non-compressed instructions, and the occasional compressed instruction (C_ADD_HINT in particular, a special case of C_ADD actually) 3- Some compressed instructions act as quasi-aliases, Let's represent this as either a Venn diagram or Finite State Machine, pick your favorite name for whatever the following diagram is trying to say:
Now here's our existing alias-printing policy, distilled to its simplest phrasing: Unless otherwise specified, always print the alias form of the instruction. Forget that the C extension exist for a moment, this is how it already works for non-compressed instructions:
The simplest generalization of this policy for C instructions is almost begging to get out of this diagram, and it goes a bit like this: 1- Unless otherwise specified by flags, always uncompress compressed instructions THEN print them as aliases, i.e. walk the FSM from non-compressed to alias 2- But flags are levers that allow you to prevent going to aliases, and there are 2 possible stops you can stop at: at compressed instructions using the flag Some issuesC_ADD_HINT ?C_ADD_HINT is a strange creature, it's not an indepedent instruction at all, it's just a special subspace from the encoding space of C_ADD, with 1- Uncompression will work, I haven't tried it yet but looking at the code for 2- HOWEVER, the aliases for C_ADD_HINT are important, they're 3- Maybe simply do whatever LLVM does ? which I suspect is going to be almost certainly (2) 4- Or just by default make C_ADD_HINT print as alias, except when +noalias is given, then it goes through the full path of expandable instructions through (2) and (3) in the diagram. For clarity I chose to represent it in the diagram as a non-expandable instruction, but it's very probably expandable as it's just a sub-encoding of C_ADD and What happens when
|



Your checklist for this pull request
Detailed description
Background:
We depart from LLVM in what we count as aliases. LLVM only counts so-called "Pseudo-Instructions", non-compressed specialized uses of normal instructions. For example, LLVM considers the 4-byte
retas a psuedoinstruction that is just a specialized use of the instructionjalrCapstone expands the meaning of "alias" to also mean the compressed instructions equivalence. For example, Capstone considers
c.addto be an alias of the appropriateaddinstruction, whereas LLVM does NOT considers those 2 instructions to be aliases in the ordinary sense.The problem:
Previously we only populated the real details when an instruction was an alias, but this was checked via
printAliasInstr, which is an LLVM-derieved function that only considers the restricted LLVM-sense of the word "alias". This has an implication: Compressed equivalents don't have the details of the instruction they're equivalent to, even when theCS_OPT_DETAILS_REALis set.This change refactors the real details logic to also include Capstone wider usage of "alias", namely uncompressed instructions.
Test plan
...
Closing issues
...