What’s in that .wasm? Introducing wasm-decompile

snazz · on April 28, 2020

This looks much nicer than the wasm2c output for that binary. I compiled it with `clang wasm.c -c -target wasm32 -O2` just like in the instructions (I'm on LLVM 10), and used the latest wasm2wat with `wasm2wat -f wasm.o` and got this instead:

  (module
    (type (;0;) (func (param i32 i32) (result f32)))
    (import "env" "__linear_memory" (memory (;0;) 0))
    (import "env" "__indirect_function_table" (table (;0;) 0 funcref))
    (func (;0;) (type 0) (param i32 i32) (result f32)
      (f32.add
        (f32.add
          (f32.mul
            (f32.load
              (local.get 0))
            (f32.load
              (local.get 1)))
          (f32.mul
            (f32.load offset=4
              (local.get 0))
            (f32.load offset=4
              (local.get 1))))
        (f32.mul
          (f32.load offset=8
            (local.get 0))
          (f32.load offset=8
            (local.get 1))))))

wasm2c (also from WABT) returns this thing: https://paste.linux.community/view/7877995f

Aardappel · on April 28, 2020

wasm2c has a different objective though: to be recompile-able again while preserving semantics. wasm-decompile was designed for readability first.

snazz · on April 28, 2020

Fair enough. I’m still surprised at just how unreadable (for me) the wasm2c output was, though. The compiler must have done quite a bit of optimizing that wasm2c was unable to undo.

Aardappel · on April 28, 2020

It doesn't actually try to undo anything, it just translates Wasm instructions 1:1 (they're in your link at lines 205-221). wasm-decompile does try to "undo" some thing, but it is generally impossible given LLVM's optimized output and how low-level Wasm is (see also article).

snazz · on April 28, 2020

Okay, that makes sense.

klodolph · on April 28, 2020

This is fascinating. For various reasons, WASM is less like a target bytecode format and more like a peculiar IR for compilers. I’m sure this has all sorts of effects on the tooling.

k__ · on April 28, 2020

What's the difference?

klodolph · on April 28, 2020

If you were designing a bytecode as a compilation target, you would provide an easy correspondence in the bytecode to basic blocks.

See: https://en.wikipedia.org/wiki/Basic_block

WASM instead provides traditional control structures. So the compiler either has to preserve control structures through to the IR, or has to work backwards from basic blocks to control structures. Both options are undesirable, from the perspective of compiler writers, and would be unnecessary if the VM were a greenfield project.

kazinator · on April 28, 2020

Nested, structured control can be broken into basic blocks without flattening it into if/goto first. It sounds easier, in fact.

In fact, the construction of a basic block from flat code is a form of recovery of control structure. If the original code had nested loops, nesting will emerge in the basic block graph.

A recursive traversal of the original structure could produce that graph more directly; it doesn't have to walk a flat list of instructions asking, "does this have a label on it which is the target of a branch". without having to walk a flat list of instructions asking questions like "is this the target of a branch".

For instance, if we walk an if/then/else AST node, we can recursively get the basic block graph for the test, the then and else part, and then integrate that into a larger basic block graph according to a rigid pattern.

tedmielczarek · on April 28, 2020

The latter is very achievable using the Relooper algorithm developed by Alon Zakai and described in the Emscripten paper: https://dl.acm.org/doi/10.1145/2048147.2048224

klodolph · on April 28, 2020

Achievable, but extra front-end work that belongs in the backend.

MaxBarraclough · on April 28, 2020

I get the impression WASM is really a very clunky representation, far from what any greenfield project would ever have arrived at. That is to say, its decisions aren't just tradeoffs that people disagree about, they're simply inappropriate for what it's trying to do. More than once I've encountered comments and blog-posts lamenting its fundamentals, e.g. [0].

[0] http://troubles.md/wasm-is-not-a-stack-machine/

tedmielczarek · on April 28, 2020

I don't think that's true. It was designed for its purpose by people that knew what they were doing based on experience gathered from implementing previous incremental steps including emscripten's original JavaScript output and asm.js (a formalization of some techniques emscripten employed). The design rationale is right there in the spec repository: https://github.com/WebAssembly/design/blob/master/Rationale....

It was built to be a better target than asm.js for compiled languages to run in JavaScript VMs and it seems to have succeeded on that front. That it's not a perfect fully-general bytecode format doesn't seem like a real knock against it, that's a much harder problem to solve. In any event it seems to be getting a lot more traction than any previous entries in the space, which is exciting!

klodolph · on April 28, 2020

The criticisms in the linked article are simply not grounded in fact, and the article was obviously written by someone without any expertise in how modern compilers work. There are also a number of basic errors in the article.

Liveness information simply doesn’t belong in the bytecode. SSA is trivial to recreate from local mutable variables (it would make a good homework assignment for someone in an undergrad “intro to compilers” class). WebAssembly is obviously a register machine.

> With this, it becomes possible to get rid of locals entirely.

Both factually incorrect and pointless. There is no tangible benefit in getting rid of locals entirely. You are merely changing out one representation (register machine) for a different one (stack machine).

There are valid criticisms of WASM, but the linked article doesn’t have any.

The weird part of WASM is the control structures. The rest of it is a fairly sensible, actually rather nice register machine. You can see that older bytecode systems like the JVM are stack machines, but newer ones tend to be register machines. This isn’t because people are getting stupid, it’s because there are legitimate reasons to prefer register machines, and on the balance of things, my observations are that people with experience in the field tend to prefer register machines.

readittwice · on April 28, 2020

I would still consider WASM a stack machine and not a register machine. Yes, there are mutable local variables in WASM but Java bytecode has them as well - which you consider a stack machine. BTW the designers of WASM explicitly call WASM a stack machine here: https://github.com/WebAssembly/design/blob/master/Rationale..... With WASM's MVP it was necessary to store e.g. loop state in local variables, thanks to recent changes this doesn't seem to be necessary anymore. I think this was the main argument that blog post considered WASM to be a register machine. javac also makes heavy use of variables in bytecode, but somehow no one considers the JVM a register machine.

> my observations are that people with experience in the field tend to prefer register machines

That's actually the opposite of my observation, they seem to prefer stack machines.

kazinator · on April 28, 2020

The clang disassembly given in the article sure makes it look like WASM is a nested expression tree, which leaves the choice of stack versus register to the implementation.

      (f32.add
        (f32.add
          (f32.mul
            (f32.load
              (local.get 0))
            (f32.load
              (local.get 1)))
          (f32.mul
            (f32.load offset=4
              (local.get 0))
            (f32.load offset=4
              (local.get 1))))
        (f32.mul
          (f32.load offset=8
            (local.get 0))
          (f32.load offset=8
            (local.get 1))))

The outer f32.add could translate into a byte code instruction that finds its two operands on a stack, or to one which gets them from registers.

The code only says that there is a f32.add call which has two operands that are the result of a f32.add and f32.mul and so on.

The implementations will agree in their treatment of locals: that there are two locals 0 and 1, which support loading at offsets and such.

Both stack and register machines can support locals.

jsjolen · on April 28, 2020

That's just the pprinter working its magic :-).

Just read the standard, it's quite clear from the semantics that it's a stack machine: https://webassembly.github.io/spec/core/exec/runtime.html#st... https://webassembly.github.io/spec/core/exec/instructions.ht...

klodolph · on April 28, 2020

Yes, that’s exactly true. The “stack machine” here can be seen as nothing more than a way of encoding the expression tree.

afiori · on April 29, 2020

The stack machine is a model for the semantics of wasm, in the sense that the safety properties of wasm are defined in terms of stack types rather than SSA of register properties (for those that are unfamiliar with stack machines, this stack is a different type of concept from the call stack, that in wasm is used store the local variables). Whether during execution it is better implemented as a register machine or a stack machine is an implementation detail.

klodolph · on April 28, 2020

Java has instructions like dup, swap etc. To me, that is the critical difference here, and where I draw the line between “stack machine” and “register machine”.

readittwice · on April 28, 2020

I have to admit this line seems arbitrary to me. So WASM is a register machine to you but if they would simply add those 2 instruction would it suddenly become a stack machine then? Those instructions would actually be trivial to add. I think those terms are relatively well defined and when you argue that WASM is a register machine even though the inventors explicitly claim it's a stack machine you should have really good arguments for that. Personally I would be surprised if you could point me to any literature that supports your definition.

klodolph · on April 28, 2020

Turing completeness sounds like an arbitrary distinction to those outside of the field of CS, but it’s not.

To me, the distinction here is that the stack machine in WASM is restricted to the point that it corresponds 1:1 with an expression tree—not even a graph, just a tree. This means that every function in Web Assembly can be thought of as a collections of statements and expressions, and the stack machine abstraction is nothing more than a serialization format for the expressions.

Maybe dial it back a bit with the challenge to point at literature. The literature has not really caught up with the existence of WASM yet.

readittwice · on April 28, 2020

> Maybe dial it back a bit with the challenge to point at literature. The literature has not really caught up with the existence of WASM yet.

It wasn't me who claimed that WASM is "obviously" a register machine, despite the inventors saying otherwise. They even explicitly state that they decided against a register machine. I guess it's then reasonable for me to ask on what definition of stack vs register you are basing this opinion on. Let me be clear: I was not asking here for literature about WASM specifically but a definition of register/stack machines that supports your claim.

WASM's instruction encoding is very much based on a stack machine. Even with the initial limitations you mentioned I don't think it qualifies as "obviously a register machine". As already mentioned in multiple comments those restrictions were already lifted with the multi value proposal.

I understand that there is a grey area, but simply claiming "obviously a register machine" doesn't seem right to me. Implementation-wise WASM is a stack machine even if it needs/needed locals to be turing-complete.

Aardappel · on April 28, 2020

The multi-value proposal breaks the ability to turn Wasm into expressions easily, and thus makes it even more of a stack machine than it already is. Dup and swap may still be added in the future.

A defining feature of a register machine is that the actual instruction encoding has direct references to source and destination registers in it. Wasm doesn't have those, it has explicit get_local instructions instead.

That said, if you turn off LLVM's WebAssemblyRegStackify pass, all LLVM IR's values will end up in locals, with little to no stack usage. Still no register machine, but a bit more of a grey area :)

afiori · on April 29, 2020

dup and swap can be implemented with locals in wasm (on well typed stacks)

drew-y · on April 28, 2020

> The weird part of WASM is the control structures.

Out of curiosity, what do you find weird about the control structure portion?

I just wrote a basic language that compiles to wasm and found the built in control structures made my life easier.

klodolph · on April 28, 2020

Open any compilers book, WASM will look alien. Control flow analysis is normally done through basic blocks. If you make a toy language and don’t do control flow analysis you won’t notice what’s so weird about WASM. If you take an existing compiler and port it to WASM, you’ll cry, then gather yourself together and read how the relooper algorithm works.

afiori · on April 29, 2020

The reason wasm is like this is that with "normal" control flow chrome would have had to implement itself the relooper algorithm internally before the MVP release.

Overall the main negative point seems to be that compilers toward wasm are more complex, but under the constraints of the web moving complexity from the client looks like a valuable positive point.

Two other negative points would possibly be performance and code size (which is the reason for a if-then-else specific instruction for example) about which I know very little. Do you think it would have made a difference here?

MaxBarraclough · on April 28, 2020

Interesting, thanks for the analysis.

> You can see that older bytecode systems like the JVM are stack machines, but newer ones tend to be register machines.

Like LLVM. Is there a good reason for WebAssembly not being SSA?

klodolph · on April 28, 2020

There’s no good reason for WebAssembly to be SSA. That is a good enough reason for it not to be.

The purpose of SSA is to make it easier to write optimizations. However, you wouldn’t be optimizing WASM anyway, you would necessarily translate it first into some kind of IR suitable for optimization passes. Might as well convert to SSA at that point, rather than bloat the bytecode by exposing implementation details of the target.

Remember that WASM’s purpose is to be portable and safe. Making it more complicated just in order to make sophisticated back-ends slightly faster is a net loss. Using SSA would also make naïve/simple backends slower.

afiori · on April 29, 2020

The control structure as nested block are a must be for wasm, because chrome compilation model of Javascript is incompatible with an LLVM-like goto-based control flow.

saagarjha · on April 28, 2020

Most bytecode isn’t optimized much.

mmastrac · on April 28, 2020

This is super handy. Pseudocode is very useful for understanding flow - so much more than actual assembly. I've always found it an order of magnitude to understand bad asm-to-C decompilation from IDA or Ghidra over perfect disassembly.

dlojudice · on April 28, 2020

> Decompile to what?

> `wasm-decompile` produces output that tries to look like a "very average programming language" while still staying close to the Wasm it represents.

> #1 goal is readability

> #2 goal is to still represent Wasm as 1:1 as possible

It seems AssemblyScript would do the job

[1] https://assemblyscript.org/

Aardappel · on April 28, 2020

AssemblyScript would certainly do worse at #2, and possibly also at #1. To be translate to Wasm or from Wasm lead to different optimal designs, see for example how these two systems deal with loads and stores.

3pt14159 · on April 28, 2020

It would be nice if the decompiled output were runnable through an interpreter so you could step through it with a debugger of some kind and rename or annotate the variables and functions as you reverse engineer what is going on.

Aardappel · on April 28, 2020

I'm the author, if anyone has specific questions :)

6nf · on April 28, 2020

I notice that your code supports the 'name' custom section as expected, and furthermore you support a few other custom sections too - 'dylink' for example. Where did you find the documentation for these sections? The reason I ask is that I don't believe the official webassembly specs talk about those sections, so I guess they are somewhat compiler specific perhaps?

Aardappel · on April 28, 2020

They are indeed not part of the spec since they are somewhat tool specific, for example the linking symbol names so far are only consumed by LLD. Docs here: https://github.com/WebAssembly/tool-conventions/blob/master/...

6nf · on April 29, 2020

Thank you!

One more question please - does this tool support naming of global variables? The official wasm documented 'name' section only supports local variable names I think?

Aardappel · on April 29, 2020

It can pick names from the name section, linking section, or import/export name, in that order of preference (see https://github.com/WebAssembly/wabt/blob/master/docs/decompi...). In the case of globals, the only way to name a global is thus if its imported or exported.

ellis0n · on April 29, 2020

I'm new in wasm code base. Where is export code generation located? How complex to rewrite export? I want to make export wasm to .acpul programming language for run wasm modules on animation cpu platform. Link to architecture schemas & docs will be a great.

hardwaregeek · on April 28, 2020

Loving the tooling around wasm getting better. I've been debugging my compiler output with hexl-mode and reading the binary format and while it's not that bad, it'd be nice to do more advanced debugging with a text format.

There was a project I saw too that intended to visualize WebAssembly's execution. That'd be extremely helpful too

cfallin · on April 28, 2020

> reading the binary format ... it'd be nice to do more advanced debugging with a text format.

Do you know about `wasm2wat` (from the WebAssembly binary toolkit, "WABT")? It produces a 1-to-1 text representation of the bytecode and is meant to always roundtrip via `wat2wasm` back to the same bytecode.

hardwaregeek · on April 28, 2020

Yeah...I should probably use that. But does it work on mangled WASM? Part of the issue was that my compiler wasn't producing valid WASM

cfallin · on April 28, 2020

Ah, no, probably doesn't do parsing recovery. But `wasm-validate` from the same toolkit will at least tell you the offset at which your wasm file has an error (I just flipped some bits in a wasm file to test this), which may be helpful!

irrational · on April 28, 2020

When I first started learning JavaScript in the late 90s, the primary way I learned new things was from reading other peoples code in my browser. Nowadays this isn't as easy since you often have to run obfuscated code through a prettifier to get it back into a human readable format, but it is still possible with some effort. I was concerned that WASM would make this impossible (despite the stated goal of "Be readable and debuggable — WebAssembly is a low-level assembly language, but it does have a human-readable text format (the specification for which is still being finalized) that allows code to be written, viewed, and debugged by hand."), but WASM-decompile gives me hope.

https://developer.mozilla.org/en-US/docs/WebAssembly/Concept...

fowl2 · on April 28, 2020

Can we compile it back to wasm again? ;P

frosted-flakes · on April 28, 2020

No.

> Its #1 goal is readability: help guide readers understand what is in a .wasm with as easy to follow code as possible. Its #2 goal is to still represent Wasm as 1:1 as possible, to not lose its utility as a disassembler. Obviously these two goals are not always unifiable.

> This output is not meant to be an actual programming language and there is currently no way to compile it back into Wasm.

vbezhenar · on April 28, 2020

Actually I thought about implementing a programming language which is a bit more pleasant to work with than raw wat format, but which still translated roughly 1 to 1 to wasm. Something like in this link, actually. But that seems outside of my capabilities and I'm not sure if it's really useful to anyone.

rhencke · on April 28, 2020