Hi. What I meant are more general approaches that can be taken rather than optimizations. With the implementation I presented I only just explored what can be easily improved without moving too much things around. Still, it doesn't address many current issues. For example, memory usages has been mentioned. Now, the IR still keeps the whole code of a module in memory at once, so, while not without some pain, the footprint can probably be reduced by a few, maybe a few dozen percent, but not more. To actually make a difference IR has to be somehow sliced and processed in chunks. Secondly, while I reduced it quite a bit, still the most time is spent on searching the tree for an opportunity for transformation, rather than doing it (except for inlining), likely also because of data locality. It, along with optimizing for code cache would be a broad topic, but it can be improved upon by, from e.g. replacing the by class with by feature cache, to preventing each phase from traversing whole tree, to specifically applying code on data that is already local. Then comes multithreading - it is even harder, if not impossible to do with what I changed, but it has a strong potential to come quite naturally along with some other approach to lowering. Then, not the least important, comes an integration with outside world. Wow, that's long.
I would even suggest that importing just these changes I sent as they are is not even worth it, at least not without first considering those new ways, lest to rewrite everything twice. I also think I'd be better to do something like has been done with FIR, to fork the backend code and keep both implementations for a while, so that more experiments can be made without affecting the existing code.
I have more concreate ideas how it can be done and, as discussed with @Ilmir Usmanov [JB], I might have a chance to discuss or implement it further.