Hi! I've been playing with the compiler backend fo...
# compiler
m
Hi! I've been playing with the compiler backend for a while and, in short, I managed to make it 2.3x faster 🚀 (so far), while also somewhat simplifying the code 🌟 ! More in 🧵 .
👀 10
😮 3
🔥 21
👍🏾 1
👍 4
😱 6
More precisely, I reimplemented the first 13 out of 79 lowering phases of WASM backend (as it is the simplest backend) and compared them to their original ones over 'compiling' the WASM stdlib (because it has loads of code and also it was the easiest to set up). They are semantically equal (they produce the same IR dumps as the OGs), their logic is also almost 1:1. I mostly just recoded the framework.
Now, one of those first phases is function inlining. This is the single most expensive lowering and I only sped it up about 40%, because its work of cloning subtrees cannot be avoided (without restructuring the lowerings' order). But the other 12 phases run 4x faster 🚀! And because most lowerings behave more like these, I expect that final outcome would shift more toward this factor. OOTH the initial construction of IR tree (fir2ir) will likely be slower. It only happens once per compilation though.
Still, I see at least 3 more sophisticated ways to significantly optimize the compilation. I'm afraid I won't have free time to try them out however.
e
What’s the trick?
m
Couple of things, I didn't measure each one though. The most important ones I think would be: • Proper tree management API • Including automatically storing back-references (bi-directional graph) • Per class element cache (although ideally it should be per feature) • Reduced indirection of symbols • Much simplified tree walking code
Quite interesting one, but in the end I don't think so important, part is that the elements are connected as a linked list instead of using arrays.
e
Can you share a link to the resulting code (and the dfifs)?
m
Yes, I will share it once I remove some hard coded local paths like things. There is not much of a diff because I've added a new modules instead of changing current ones.
e
Did you patch the existing IR tree structures or created new ones and convert between them?
m
Yes, I cloned the IR tree code and applied some minor changes. Then I created a converter from current to new IR. (Actually I generated it thanks to the tree-generator module).
b
Promising numbers 👍 I’m looking forward to seeing changes. Also, it would be interesting to measure memory footprint change.
m
As to memory usage, I've added some bits here and removed some there, so the net outcome should be close to 0. Reducing peak memory for good would require quite different approach.
Ofc its often hacky and there are quite a few redundant pieces. Overall I copy-pasted the whole of compiler/ir/ir.tree and other relevant parts. My copy of IR is named BIR as for Backend IR, but it could also mean IR version B.
Have you tried it out? WDYT?
u
Hey @mcpiroman, we've started an internal discussion about your changes and we need some time to figure out the next steps. Obviously the improvement seems very promising, so eventually we'd like to incorporate your changes fully or partially, but right now we're still reviewing your implementation and thinking of a proper process to work on it further, so that you wouldn't end up working on something which is unmergeable. I hope to come back in ~2 weeks with some results and continue the discussion.
By the way, what other optimizations do you mean here?
Still, I see at least 3 more sophisticated ways to significantly optimize the compilation.
m
Hi. What I meant are more general approaches that can be taken rather than optimizations. With the implementation I presented I only just explored what can be easily improved without moving too much things around. Still, it doesn't address many current issues. For example, memory usages has been mentioned. Now, the IR still keeps the whole code of a module in memory at once, so, while not without some pain, the footprint can probably be reduced by a few, maybe a few dozen percent, but not more. To actually make a difference IR has to be somehow sliced and processed in chunks. Secondly, while I reduced it quite a bit, still the most time is spent on searching the tree for an opportunity for transformation, rather than doing it (except for inlining), likely also because of data locality. It, along with optimizing for code cache would be a broad topic, but it can be improved upon by, from e.g. replacing the by class with by feature cache, to preventing each phase from traversing whole tree, to specifically applying code on data that is already local. Then comes multithreading - it is even harder, if not impossible to do with what I changed, but it has a strong potential to come quite naturally along with some other approach to lowering. Then, not the least important, comes an integration with outside world. Wow, that's long. I would even suggest that importing just these changes I sent as they are is not even worth it, at least not without first considering those new ways, lest to rewrite everything twice. I also think I'd be better to do something like has been done with FIR, to fork the backend code and keep both implementations for a while, so that more experiments can be made without affecting the existing code. I have more concreate ideas how it can be done and, as discussed with @Ilmir Usmanov [JB], I might have a chance to discuss or implement it further.
a
I’m curious if anything ever came out of these changes
3