fix: display the real details for aliases when requested, even if the alias is an uncompressed instruction by moste00 · Pull Request #2923 · capstone-engine/capstone

moste00 · 2026-05-14T19:58:06Z

Your checklist for this pull request

I've documented or updated the documentation of every API function and struct this PR changes.
I've added tests that prove my fix is effective or that my feature works (if possible)

Detailed description

Background:

We depart from LLVM in what we count as aliases. LLVM only counts so-called "Pseudo-Instructions", non-compressed specialized uses of normal instructions. For example, LLVM considers the 4-byte ret as a psuedoinstruction that is just a specialized use of the instruction jalr

Capstone expands the meaning of "alias" to also mean the compressed instructions equivalence. For example, Capstone considers c.add to be an alias of the appropriate add instruction, whereas LLVM does NOT considers those 2 instructions to be aliases in the ordinary sense.

The problem:

Previously we only populated the real details when an instruction was an alias, but this was checked via printAliasInstr, which is an LLVM-derieved function that only considers the restricted LLVM-sense of the word "alias". This has an implication: Compressed equivalents don't have the details of the instruction they're equivalent to, even when the CS_OPT_DETAILS_REAL is set.

This change refactors the real details logic to also include Capstone wider usage of "alias", namely uncompressed instructions.

Test plan

...

Closing issues

...

… alias is an uncompressed instruction

Rot127 · 2026-05-15T12:57:56Z

Capstone expands the meaning of "alias" to also mean the compressed instructions equivalence. For example, Capstone considers c.add to be an alias of the appropriate add instruction, whereas LLVM does NOT considers those 2 instructions to be aliases in the ordinary sense.

When did we add this?
In #2869 ?

I need to look at it in more detail.
But we really should not deviate from LLVM. Except we can say "not categorizing it as alias is an LLVM bug".

How are alias defined in the ISA?

moste00 · 2026-05-16T11:54:45Z

When did we add this?
In #2869 ?

I need to look at it in more detail.

It's much older than that, it's probably as old as the noalias flag itself.. since the very beginning in November 2025 or so.

The noalias flag was reflecting that before noaliascompressed was introduced.

But we really should not deviate from LLVM. Except we can say "not categorizing it as alias is an LLVM bug".

How are alias defined in the ISA?

The short answer is that they aren't, the ISA never defines such a thing as an alias, it defines two things: pseudo instructions and compressed equivalents.

1- Pseudo instructions are basically assembly-time macros that allow you to write ret even though no such instruction exist, no binary encoding for ret exist. The assembler replaces it with a jalr and the CPU never knows the difference.

2- compressed equivalents are actual instructions with actual encodings, the CPU decoder is aware of them, but they happen to semantically correspond exactly to a restricted use of an equivalent non-compressed instruction (e.g. the compressed add corresponds to a +=, something the ordinary add can also do)

To my humble intuition, those two things look very much the same from a user perspective. They're both "this instruction is actually the same as this other one", with the meaning of "the same" being defined in two slightly different ways each time.

Are there any precedent in other architectures that allow us to go one way or another ? I know for a fact ARM has thumb mode which is their compressed mode, but I don't know if they have their own notion of pseudo instructions.

(PS: note that this entire PR is about separating the details from the alias text. That is, we can still go with the decision to NOT consider compressed instructions as aliases, but still also allowing the real details flag to populate their details with the non-compressed equivalent details.

This is very convienent for Rizin and any downstream consumers of Capstone, as it allows you to basically ignore all the compressed instructions, after all every single one corresponds to a special case of non-compressed instructions.)

slate5 · 2026-05-16T15:47:06Z

Hi @moste00, can you please give more precise examples of where LLVM returns an instruction that is/isn't an alias as expected? I'm a bit confused about what the desired result should be.
This is c.add example:

$ riscv64-linux-gnu-as -march=rv64gc -al - <<< 'add sp, sp, s0'
   1 0000 2291     	add	sp,sp,s0
$ riscv64-linux-gnu-objdump -d -M no-aliases a.out
   0:	9122            c.add	sp,s0

as shows instruction as an alias (pseudoinstr) add sp,sp,s0 while the real instruction is shown by objdump. I tested LLVM, and it follows the same logic as GNU utils. Now cstool:

$ cstool riscv64 2291
 0  22 91        add	sp, sp, s0
$ cstool riscv64+noalias 2291
 0  22 91        c.add	sp, s0

I don't see the inconsistency at first...

moste00 · 2026-05-16T16:01:18Z

Hi @moste00, can you please give more precise examples of where LLVM returns an instruction that is/isn't an alias as expected?

I just mean that all compressed instructions aren't understood by LLVM core as aliases, maybe the CLI tools implement this on top of the core (as they should, IMO), but the core itself has a function called printAliasInstr, and this function doesn't think that compressed equivalents are aliases. Aliases are purely ONLY pseudoinstructions, things with no encodings.

There IS an equivalent of printAliasInstr for the compressed instructions, which is uncompressInst, which will give you the equivalent non-compressed instruction of the compressed instruction you passed. But that's not an "alias" as LLVM defines it, it's a decompression.

Like you noticed, most CLI tools probably intuitively know that the user doesn't care about this pedantic distinction, and quietly just redefine "alias" to mean both things, but LLVM doesn't think that decompressed instructions are aliases, so we will be departing from them there.

(There are some consequences if we do this, for example we would have no alias ID for decompressed instructions, alias IDs are only assigned to the "fake" pseudo instructions that LLVM considers as aliases, compressed instructions are real from LLVM's POV, they have a real instruction ID and no alias ID.)

slate5 · 2026-05-16T16:22:14Z

Yea, we both understand that these aliases (pseudoinstructions) are just a programmer's convenience and, in a way, a relief from hard-coded decisions on which architecture will execute this. E.g., you just write add sp,sp,s0 and the assembler decides if it can be replaced by a compressed instruction or not. If it can be, then this is an alias for a compressed instruction, otherwise, this is a real instruction.

If you want to have an alias ID for compressed instructions, then we should have to add a table for it, right? Or even better, to just link them somehow to the existing table of aliases, because there is not really a compressed alias instruction. It's just an alias that is or is not compressed. As u said, from the user perspective, an alias represents a functionality, and there is no care if that functionality took 2 or 4 bytes of memory :)

slate5 · 2026-05-16T16:30:39Z

Also, I didn't reiterate that there is no difference between CLI tools and Capstone because CLI tools show the same string as cstool does add sp,sp,s0 or c.add sp,s0 depends if you ask for -M no-aliases
But if I understood you right, you wanna have alias ID?

moste00 · 2026-05-16T16:30:51Z

Yea, we both understand that these aliases (pseudoinstructions) are just a programmer's convenience and, in a way, a relief from hard-coded decisions on which architecture will execute this. E.g., you just write add sp,sp,s0 and the assembler decides if it can be replaced by a compressed instruction or not. If it can be, then this is an alias for a compressed instruction, otherwise, this is a real instruction.

This is my view, but another view is that we should do EXACLTY what LLVM core do, and LLVM core doesn't see compressed instructions as aliases. Maybe we can give them another flag, for example decompressed ? Or redefine noaliascompressed such that it's not a subset of noalias. (effectively defining two types of aliases, normal aliases, and compressed aliases, both mutually exclusive.)

If you want to have an alias ID for compressed instructions, then we should have to add a table for it, right?

Yes but this is its own deviation from LLVM too, we will define a manual table and maintain it with no auto-sync from LLVM. So whatever path you go, you will always have to face that you're going against LLVM convention.

slate5 · 2026-05-16T16:44:36Z

Let's backtrack a bit. I'm confused a lot 😅
I just tested the famous ret (aka, jalr zero, ra or c.jr ra), and this is how cstool detects it:

cstool -d riscv64 67800000
 0  67 80 00 00  ret	
	ID: 31 (jalr)
	Is alias: 1698 (ret) with ALIAS operand set

	Groups: jump 

cstool -d riscv64 8280
 0  82 80        ret	
	ID: 513 (c_jr)
	Is alias: 1698 (ret) with ALIAS operand set

	Groups: HasStdExtCOrZca jump

alias ID is ret (1698) for both

slate5 · 2026-05-16T17:01:13Z

Ah, so the problem is that those that are aliased only as compressed instructions, while the real instruction counterpart doesn't have an alias...

moste00 · 2026-05-16T19:23:22Z

@slate5 good point, actually now I'm confused too :D

I didn't test ret before, but I tested another instruction (sext.w or something, the alias is sign extension but the core operation is compressed addition) and it had an invalid alias ID. So perhaps my statement doesn't apply to all decompressed instructions, but it certainly applies to some of them.

Anyway, let's wait for @Rot127 to do a final judgement call on this, preferably according to the precedent set by ARM. Then we will see the way forward.

slate5 · 2026-05-16T21:47:54Z

Hehe, sext.w (c.addiw t0,0) works well for me XD

$ cstool -rd riscv64 8122
 0  81 22        sext.w	t0, t0
	ID: 495 (c_addiw)
	Is alias: 1684 (sext.w) with REAL operand set
	op_count: 2
		operands[0].type: REG = t0
		operands[0].access: READ | WRITE
		operands[1].type: IMM = 0x0
		operands[1].access: READ

	Groups: HasStdExtCOrZca IsRV64

I think the only "issue" is when you have an "alias" that, in itself, is nothing but the same mnemonic of the real instruction. And then, it only makes sense to call it an "alias" (i.e., alternative name) if it represents a compressed instruction. For example, sext.w can be used as an alias to both addiw and c.addiw, while addi doesn't exist as an alias to a full instruction and only exists as an "alias" to a compressed one (addi t0,t0,2 can be the real instruction and there is no pseudo version of it except if it represents a compressed one, c.addi t0,2)

So, it kinda makes sense, after all, R in RISC-V means reduced, not simple :)

Rot127 · 2026-05-17T17:54:56Z

preferably according to the precedent set by ARM

ARM has aliases :D There it is easy.

Yes but this is its own deviation from LLVM too, we will define a manual table and maintain it with no auto-sync from LLVM

Please don't introduce another table we need to maintain. Except it is easy to generate automatically.

The purpose of Auto-Sync is to just use the LLVM code as much as possible. Patching here and there a line in is fine. Or extending our LLVM backends to generate it for us of course.

I think the only "issue" is when you have an "alias" that, in itself, is nothing but the same mnemonic of the real instruction

That case is actually a bug (from our POV, not necessarily for LLVM).
If, for example, there is an "alias" instruction with the mnemonic addi this should be fixed.

It usually means that the LLVM definitions have an alias and a real instruction defined with the same mnemonic. You can search for InstAlias.*<mnemonic> in the RISCV.*.td files in the llvm-capstone repo.
We can change these definitions (remove the InstAlias) to fix it. But please leave a comment there that it is a Capstone edit.

Personally, I wouldn't want the compressed instructions to be counted as "alias".
An alias should really just be a different mnemonic or a "shortcut" writing for an instruction.

First of all, because this is what it usually means for all other archs. So we can have some consistency between them.
And second, because the alias must execute semantically the exact same way as its real counter part.

If one implements some tool with Capstone they maybe don't care about the mnemonic.
So sext.w being semantically equivalent to addi.w might be enough to know for them.
This is why we have the alias feature. So people can just get the operands of the real instructions and use them. Knowing that any alias of it, is semantically equivalent.

IF the compressed instructions are semantically equivalent to the full version of them, we could say that they are an alias. But since the encoding bytes differ, I would prefer to add an extra decompressed flag for them and treat them as real.

So something like that:

Compressed and not-compressed

Compressed instructions are real instructions.
They are distinct from their "not-compressed" equivalents because the encoding differs.
Compressed instructions have a flag "is_compressed" set to true.
Optionally: It stores the ID of the not-compressed instruction somewhere (if we can somehow generate the mapping table for it nicely).

Alias

Alias instructions only differ in mnemonic and/or used operands from the real instruction.
Alias and real instruction byte encodings are always the same.
Alias can have two real instruction parents. Not-compressed and compressed.

The topology is something like this:

alias:       ret
             / \
          is alias of
           /      \
real:    c_jr    jalr

Difference:

Bytes:      67800000
Alias ID:   ret
Real ID:    jalr
Detail:     cs_insn.details.is_compressed == false
            cs_insn.size == 4
            if (get_alias_details)
               cs_insn.op_count == 0
            else
               cs_insn.op_count == 1

Bytes:      8280
Alias ID:   ret
Real ID:    c_jr
Detail:     cs_insn.details.is_compressed == true
            cs_insn.size == 2
            if (get_alias_details)
               cs_insn.op_count == 0
            else
               cs_insn.op_count == 1

Anything which doesn't follow this definition is a bug.

wdyt?
Have I overlooked/over-read something?

moste00 · 2026-05-21T19:37:17Z

preferably according to the precedent set by ARM

ARM has aliases :D There it is easy.

xD very correct, indeed.

Personally, I wouldn't want the compressed instructions to be counted as "alias". An alias should really just be a different mnemonic or a "shortcut" writing for an instruction.

First of all, because this is what it usually means for all other archs. So we can have some consistency between them. And second, because the alias must execute semantically the exact same way as its real counter part.

This is reasonable, the thing is, compressed instructions satisfy the second condition exactly. Unless I'm misreading the spec/programmer's manual, it really does seem to say that a compressed equivalent MUST do the same effect as the uncompressed inspiration behind it, that's the intention in the first place, to give a size-shortcut to common idioms.

IF the compressed instructions are semantically equivalent to the full version of them, we could say that they are an alias. But since the encoding bytes differ, I would prefer to add an extra decompressed flag for them and treat them as real.

Very reasonable.

Optionally: It stores the ID of the not-compressed instruction somewhere (if we can somehow generate the mapping table for it nicely).

We can, uncompressInst function is basically this table.

wdyt? Have I overlooked/over-read something?

My original use case remains :( I need to be able to treat compressed instructions as basically their non-compressed equivalents, or else lifting would become very painful and repetitive. So one of 3 things:

1- The r real details flag treats compressed instructions as a "quasi-alias", they're not an alias, sure, but the real details flag would still replace the details of an is_compressed instruction with the non-compressed details

2- There is a seperate flag that does the same thing as (1) but is not r, maybe rc (real compressed?) ?

3- There is a seperate operands array in RISC-V other than the usual one, the real details flag operates on the usual one, the other flag operates on the other one.

Basically, I'm just circling and circling over the idea that I need to be able to obtain the non-compressed details, and since Rizin is just a serious test-drive of Capstone, probably many other tools depending on Capstone will have the same need.

Rot127 · 2026-05-22T10:55:14Z

My original use case remains :( I need to be able to treat compressed instructions as basically their non-compressed equivalents, or else lifting would become very painful and repetitive. So one of 3 things:

Sorry, I lost this context while reading.

The idea 2 seems good to me, but I would flip it around.

By default -r shows details of the real instruction for alias AND compressed. And we add an additional flag (--rc or something) which makes -r show the real details ONLY for proper alias, but not for compressed ones.

Because I think your lifting use case is way more common and should require only one flag instead of two.

moste00 · 2026-05-27T20:38:24Z

@Rot127 One final question: Does this mean we no longer treat noalias as a supression of the compressed instruction text ? since we don't classify compressed instructions as aliases, it would imply that noalias would no longer supress their text.

noaliascompressed would still be present, but preferably renamed to nocompressed so as to not imply that compressed instruction are a subset of aliases ? WDTY ?

moste00 · 2026-05-27T20:46:47Z

@Rot127 Also, one more note: It's never the case in LLVM that an alias has 2 parents, each alias in LLVM's alias table maps to exactly 1 parent, and most of those parents are the non-compressed.

So this presents another difficulty (if we so choose to hande it, ignoring is always an option). Some instruction that "logically" should be aliases, for example a c_addi that logically performs a move, will not be counted as aliases in the new classification.

We could handle this: Uncompress the instruction, then if the uncompression maps to an alias and the user hasn't done alias supression, then do print the alias. This way the c_addi will first uncompress to an addi, which , if it has the right operands, will then alias-map to a mv, and the net effect is that c_addi was successfully mapped to mv if they're equivalent.

More work, and this whole topic is surprisingly fractal in complexity and edge cases.

Rot127 · 2026-05-28T09:18:05Z

since we don't classify compressed instructions as aliases, it would imply that noalias would no longer supress their text.

Yes, I think this follows from it. nocompressed sounds good to me as addition.

We could handle this: Uncompress the instruction, then if the uncompression maps to an alias and the user hasn't done alias supression, then do print the alias. This way the c_addi will first uncompress to an addi, which , if it has the right operands, will then alias-map to a mv, and the net effect is that c_addi was successfully mapped to mv if they're equivalen

That is a tricky one indeed. Generally the assembly output should be as LLVM does it. Being comparable to it is one of the features we have.

How is the uncompression done? Does it cost a lot of runtime?
Because, if it is relatively low (or can be disabled alternatively), then I am fine with it. Of course, the Alias details must have the correct (compressed) id set as "real isntruction".

@slate5 Feel free to state your opinion as well btw.

moste00 · 2026-06-04T22:42:44Z

@Rot127 Hey, sorry for sleeping on this for a while.

I overthinked this way too hard till my teeth fell, and I think I came up with a really intuitive way to navigate this exhaustively.

Let's begin by just listing all the facts:

1- There are 2 disjoint sets of instructions, compressed and uncompressed instructions

2- Additionally, there is a 3rd disjoint set of Psuedo-Instructions, instructions that don't really exist in binary encoding, but only as aliases to (mostly) non-compressed instructions, and the occasional compressed instruction (C_ADD_HINT in particular, a special case of C_ADD actually)

3- Some compressed instructions act as quasi-aliases, uncompressInst maps a compressed instruction to its equivalent "verbose" form in non-compressed instructions.

Let's represent this as either a Venn diagram or Finite State Machine, pick your favorite name for whatever the following diagram is trying to say:

Now here's our existing alias-printing policy, distilled to its simplest phrasing: Unless otherwise specified, always print the alias form of the instruction.

Forget that the C extension exist for a moment, this is how it already works for non-compressed instructions:

The simplest generalization of this policy for C instructions is almost begging to get out of this diagram, and it goes a bit like this:

1- Unless otherwise specified by flags, always uncompress compressed instructions THEN print them as aliases, i.e. walk the FSM from non-compressed to alias

2- But flags are levers that allow you to prevent going to aliases, and there are 2 possible stops you can stop at: at compressed instructions using the flag +keepcompressed, this will refuse to uncompress. Or you can do uncompression alone but stop at just the alias mapping, using +noaliascompressed.

In diagram:

Some issues

C_ADD_HINT ?

C_ADD_HINT is a strange creature, it's not an indepedent instruction at all, it's just a special subspace from the encoding space of C_ADD, with rd equaling zero. This instruction pattern is the sole compressed encoding in printAliasInstr's domain, and this begs the question: when no flags are given, do we uncompress then print the alias or print the alias directly ?

1- Uncompression will work, I haven't tried it yet but looking at the code for uncompressInst tells me it's very probably just going to uncompress to add x0, ...., which in turn aliases to a nop

2- HOWEVER, the aliases for C_ADD_HINT are important, they're ntl.p1, ntl.pall, ntl.s1, ntl.all, Non-Temporal Locality Hints to hint to the cache logic not to waste its time caching things that won't be reused. Yes, semantically NOPs and specifically designed so that oblivious simpe microarchiectures can just swallow them as nops without special handling, but some microarchiectures can change their behaviour due to them, so it's important to show them in the disassembly.

3- Maybe simply do whatever LLVM does ? which I suspect is going to be almost certainly (2)

4- Or just by default make C_ADD_HINT print as alias, except when +noalias is given, then it goes through the full path of expandable instructions through (2) and (3) in the diagram.

For clarity I chose to represent it in the diagram as a non-expandable instruction, but it's very probably expandable as it's just a sub-encoding of C_ADD and uncompressInst always expands arbitrary C_ADD to an equivalent ADD.

What happens when `+keepcompressed` and `+noaliascompressed` are both given ?

With the FSM phrasing I gave above, the answer is strongly implied to be: +keepcompressed wins, as it's an the "first" test performed if you imagine yourself starting in the set of compressed instructions and walking the transitions.

Or maybe validate options and exit early if both are given. Or maybe whatever textually comes last wins.

What about the details

Detail replacement is MUCH simpler, if the details are replaced, then it's ALWAYS the details of a non-alias non-compressed instructions.

1- For non-compressed instructions that print as aliases, the replacement details are their original details before they got printed as aliases

2- For expandable compressed instructions, the details are those of the non-compressed equivalent

3- C_ADD_HINT is again the odd one out but it probably has no details worth worrying about at all

Details are replaced according to the following logic:

1- The -r flag implies replacing details for both aliases and compressed instructions

2- Another, weaker flag, maybe -R ?, replaces details ONLY for aliases, keeps the details of compressed instruction as their compressed details (or maybe as whatever their text ended up on, whether non-compressed or alias)

3- When the 2 flags are given, -r wins because it's the strongest most aggressive one

That's it. Sorry for the wall of text, but it's honestly astonishing how much this problem is deep and branching.

fix: display the real details for aliases when requested, even if the…

f78736a

… alias is an uncompressed instruction

github-actions Bot added the RISCV Arch label May 14, 2026

Rot127 marked this pull request as draft May 18, 2026 09:58

add frm details

86c8301

treat uncompressed instruction as non-aliases

e7e5383

github-actions Bot added the CS-core-files auto-sync label May 27, 2026

Conversation

moste00 commented May 14, 2026

Uh oh!

Rot127 commented May 15, 2026

Uh oh!

moste00 commented May 16, 2026

Uh oh!

slate5 commented May 16, 2026

Uh oh!

moste00 commented May 16, 2026

Uh oh!

slate5 commented May 16, 2026

Uh oh!

slate5 commented May 16, 2026

Uh oh!

moste00 commented May 16, 2026

Uh oh!

slate5 commented May 16, 2026

Uh oh!

slate5 commented May 16, 2026

Uh oh!

moste00 commented May 16, 2026

Uh oh!

slate5 commented May 16, 2026

Uh oh!

Rot127 commented May 17, 2026

Uh oh!

moste00 commented May 21, 2026

Uh oh!

Rot127 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moste00 commented May 27, 2026

Uh oh!

moste00 commented May 27, 2026

Uh oh!

Rot127 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moste00 commented Jun 4, 2026

Some issues

C_ADD_HINT ?

What happens when +keepcompressed and +noaliascompressed are both given ?

What about the details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Rot127 commented May 22, 2026 •

edited

Loading

Rot127 commented May 28, 2026 •

edited

Loading

What happens when `+keepcompressed` and `+noaliascompressed` are both given ?