Hey guys, regarding to Kotlin/Native runtime perfo...
# kotlin-native
d
Hey guys, regarding to Kotlin/Native runtime performance:
đź‘€ 4
âť“ 1
I have checked generated diassembled code and noticed that all the functions and property accessors include a
EnterFrame
/
LeaveFrame
call, that's the hottest performance issue right now I guess. In the case of windows at least that ends calling emults to get a local thread variable with the head node for the stacktrace with the strict memory model. For what I have noticed that is used for stacktraces and GC. Maybe you have already plans for that or I have missed something, but have you considered performing a conservative stack processing (scan all values from the start of the stack to the end, including stuff that might be numbers)? Then generate stacktraces by going down function frames, or use a library like https://github.com/certik/stacktrace for stacktrace generation? It seems that C++23 will include a [std::basic_stacktrace](https://en.cppreference.com/w/cpp/utility/basic_stacktrace) too. Even if that's not possible if the stacktraces are not required for GC, you could expose an annotation like
@NoStackTrace
to allow users to mark functions/property accessors to not include the EnterFrame/LeaveFrame for critical paths.
l
That gets me thinking, maybe these `EnterFrame`/`LeavFrame` could be removed when the compiler can prove the given function or property accessor/setter cannot throw an exception?
s
Generally this is known issue.
all the functions and property accessors include a
EnterFrame
/
LeaveFrame
call,
Not all. Also, it might be different for release and debug binaries. Finally, latest releases forcibly inline some of the accessors (without redundant “frame management”). Please recheck. Also, providing a particular example would help.
For what I have noticed that is used for stacktraces and GC.
Only for GC, IIRC.
performing a conservative stack processing (scan all values from the start of the stack to the end,
including stuff that might be numbers)?
We don’t plan to scan stacks conservatively. This approach has a lot of drawbacks and negative consequences. For example, runtime would need to distinguish arbitrary bits from valid pointers to objects, and this requires complicated infrastructure in runtime.
Even if that’s not possible if the stacktraces are not required for GC, you could expose an annotation like
@NoStackTrace
to allow users to mark
functions/property accessors to not include the EnterFrame/LeaveFrame for critical paths.
It might be possible to implement something like this, but such an annotation should also carefully disable GC while thread execute such a function (or anything it calls).
That gets me thinking, maybe these `EnterFrame`/`LeavFrame` could be removed when the compiler can prove the given function or property accessor/setter cannot throw an exception?
Yes, and make GC impossible during non-throwing functions, right?
d
@svyatoslav.scherbina I see. Regarding to conservatively: I tried to implement it and somehow achieved it (though not for Kotlin/Native) sometime ago. I'm not an expect myself, so might not be state of the art. What I did to not require pointer tricks, was to make determining if a valid pointer as fast as possible (that should be already aligned to 8-bytes or 4-bytes in the stack depending on the platform size). In my case I used a custom allocator. That allocator pre-allocated "chunks" maybe of 8-MB or things like that, then stored objects there. I also stored the minimum address and the maximum address of any allocated object. Then a first filter to determine if an object was allocated, was to check the minimum and maximum address. On 64-bit that already filters most numeric values. Then checked that the object was in the range of the allocated chunks, which was a bit more costly but not too much costly. Then I believe I stored like a magic value to determine that there was an allocated object at that address, and in general that was pretty cheap. For the allocator I also exploited the fact of knowing the size of the object. I was not supporting memory compaction, but that didn't require anything special like bit tricks. Not sure if that approach could fit your case or inspire you somehow. But since I believe that part is critical and probably the major performance issue Kotlin/Native has now, I hope you can find something that is viable for you that reduces overhead there. In my case, being able to annotate some methods that do not throw, and maybe even don't reference anything extra in the stack that even if that's unsafe, could help. For example: • This single call: https://github.com/korlibs/korge-next/blob/0b35d93836ac9af01592083577cb6fb7d39e6b28/samples/instanced-rendering/src/commonMain/kotlin/main.kt#L[…]79 • https://github.com/korlibs/korge-next/blob/0b35d93836ac9af01592083577cb6fb7d39e6b28/samples/instanced-rendering/src/commonMain/kotlin/main.kt#L[…]29 • https://github.com/korlibs/korge-next/blob/0b35d93836ac9af01592083577cb6fb7d39e6b28/kmem/src/nativeCommonMain/kotlin/com/soywiz/kmem/BufferNative.kt Ends entering/exiting several frames. And they are not going to throw. That's a benchmark and it is a critical path. In fact that benchmark can produce 800K sprites on JVM at 144fps, but on windows Kotlin/Native it is already working maybe at 20fps with 20K sprites. It works better on Macos and Linux, but still much worse than on the JVM. So even if a temporal solution could be to mark with something like
@ExperimentalKotlinNativeNoStackNoThrow
would be worth already.
In the case it helps as reference, this is where I'm performing the stack scanning: https://github.com/jtransc/jtransc/blob/960f6bf22cbb0a6be830bf2622385995b20a85b8/jtransc-rt/resources/cpp/GC.cpp#L501 I got the idea for tracing here: https://chromium.googlesource.com/chromium/src/+/master/third_party/blink/renderer/platform/heap/BlinkGCAPIReference.md Maybe using oilpan directly could work? I wanted something simpler (one single file) and learn about GCs so didn't use it directly. But I believe oilpan is OSS. The good part of that one is that you can integrate it with C++ and probably it already uses a pretty fast allocator. It is precise for heap references, and conservative for stack. Doesn't require pointer tricks either as far as I can tell
s
I tried to implement it and somehow achieved it (though not for Kotlin/Native) sometime ago.
Thank you for the detailed writeup! Your approach sounds pretty much like what I mentioned above:
For example, runtime would need to distinguish arbitrary bits from valid pointers to objects, and this requires complicated infrastructure in runtime.
Regarding the performance:
Ends entering/exiting several frames.
Have you checked this on debug or release binaries?
d
I see, I though you meant use some bits of pointers in 64-bit to include some information, which is not the case for oilpan. In the end that approach only requires to have a method to visit fields, everything else is handled by the library (I guess allocation, and GC). In theory that wouldn't require extra effort from your side, since that's handled by oilpan, as long as you extend the exposed class and define a Trace method that visit all the field object references. I believe I explored generated code un Debug, but fps in release in this case only increased slightly on windows, and still a few orders of magnitude slower than the JVM. You can try by yourself here:
Copy code
git clone <https://github.com/korlibs/korge-next.git>
gradlew :samples:instanced-rendering:runJvm
gradlew :samples:instanced-rendering:runNativeRelease
Please, try on Windows. The performance is specially bad there. EnterFrame/ExitFrame in the end ends adding an overhead, and in windows it seems to be specially bad because it uses _emutls instead of a plain register, but even with support, the EnterFrame ends adding an overhead that ideally shouldn't be there, specially when some methods could be inlined being a memory indirection, which is probably what happens on the JVM, but maybe they couldn't be inlined by the LLVM because the EnterFrame/ExitFrame, and that could avoid further optimizations too. Here you can see screenshots showing >100 fps with 800k sprites on JVM (GPU-bound) while on K/N Windows in release with 50K sprites is running at 11fps (screenshot is from release version) which is at least two orders of magnitude slower I used IDA free for the disassembler screenshots: https://www.hex-rays.com/products/ida/support/download_freeware/ You can see that _emutls do extra stuff with mutexes and stuff.
s
I believe I explored generated code un Debug
I’m not sure this makes a lot of sense. Please recheck the generated code in Release for the functions you mentioned above. Inspecting generated LLVM IR might be easier than for machine code.
-Xprint-bitcode
compiler flag prints the code before LLVM optimizations (including LLVM inlining). This way it might be easier to check for `EnterFrame`/`LeaveFrame` presence.
d
EnterFrame
is inlined, but again
__emutls_get_address
is called two times per method (for the enter and the leave), and those internal method calls are not inlined even if they are final, probably because of the
Enter/ExitFrame
, that grows the function and the compiler decides to not inline it. In C++ and JVM the
getByteIndex
from the screenshot would be inlined, and with constant propagation that could even be an integer or a couple of inlined operations, but I guess LLVM is not inlining it because of the EnterFrame. It also seems that the
Float32Buffer.SIZE
that is a
const val
is initializing the singleton, instead of using the computed value at compile time, though that's a separate issue. But in the end there are a lot of places where
_emutls_get_address
is called for that code. But in the end what I'm doing is pretty inlineable by a compiler and should be converted in a few instructions reading/writing from memory and even if it is not inlined, at normally won't have that overhead. Also a function that is performing arithmetic with parameters shouldn't have the overhead of writing/reading memory at all. In the end I'm aiming to access those properties like 60 million of times per second, so that overhead is impacting the experience a lot. And that's why I believe this is important to address, since it looks like the major performance issue is that one. I believe K/N could have a similar performance to Java or better with LLVM advance optimizations if you manage to remove the
EnterFrame
and
LeaveFrame
s
I see. Please recheck with 1.5.0, it includes relevant improvements.
It also seems that the 
Float32Buffer.SIZE
 that is a 
const val
is initializing the singleton, instead of using the computed value at compile time, though that’s a separate issue.
Yes. Please report this to YouTrack.
Also a function that is performing arithmetic with parameters shouldn’t have the overhead of writing/reading memory at all.
AFAIK, it doesn’t. Functions you examine in the screenshots don’t just perform arithmetic operations with parameters. At least they read fields from objects. If you have examples of functions that only perform arithmetic operations directly with parameters and have `EnterFrame`/`LeaveFrame`, please provide them.
I believe K/N could have a similar performance to Java or better with LLVM advance optimizations if you manage to remove the 
EnterFrame
 and 
LeaveFrame
We don’t have short-term plans for removing the frame management completely. And the cases you’ve shown can be fixed with local optimizations in the compiler. Please make small self-contained reproducers for the remaining issues, and report them to YouTrack. That would be a great way to contribute. I feel it’s worth repeating: conservative stack scanning is not a silver bullet. Yes, it would allow to get rid of explicit frame management (which might improve the performance in some cases) but it definitely would bring a lot of other issues, including performance ones.
d
Then there are cases where you are not adding EnterFrame/LeaveFrame? I still don't know how you are doing the precise stack scanning, didn't deep in the implementation too much. Do you think it could be possible to do not use the EnterFrame/LeaveFrame in additional functions? For example, if the function is not allocating, and for example it is only accessing
this
and val fields/properties in
this
, this should be already on a stackframe on previous function calls since someone called it and thus nobody should be able to deallocate any referenced objects from any children or descendant, and being val they won't be able to be mutated later. All the critical paths I use doesn't allocate and only use this and properties referenced in this and if the
companion object { const val }
is optimized, and doesn't require to include the singleton, I believe the combination of both improvements would make it much faster in most of my code. This code:
Copy code
class FSprites {
  val data = FBuffer(maxSize * FSprites.STRIDE * 4)
  private val f32 = data.f32

  // These property accessors shouldn't require EnterFrame/LeaveFrame. Since `this` is already referenced from the caller, and nobody can change f32 since it is a val and it is already referenced by this, so won't be deallocated
  var FSprite.x: Float get() = f32[offset + 0] ; set(value) { f32[offset + 0] = value } 
}
s
Then there are cases where you are not adding EnterFrame/LeaveFrame?
Yes. For example, code that doesn’t do anything with object references shouldn’t have these calls.
For example, if the function is not allocating, and for example it is only accessing 
this
 and val fields/properties in 
this
, this should be already on a stackframe on previous function calls since someone called it and thus nobody should be able to deallocate any referenced objects from any children or descendant, and being val they won’t be able to be mutated later.
This should already work similarly in 1.5.0, at least for release binaries. Please recheck on 1.5.0.
This code:
Having a self-contained reproducer would help.
d
@svyatoslav.scherbina I have created this repo: https://github.com/korlibs/kotlin-native-performance-experiment (check the readme for results) This version should be as fast as the inlined version: https://github.com/korlibs/kotlin-native-performance-experiment/blob/b30901e989e7b[…]88f1eeb184979b967a48a9/src/nativeMain/kotlin/noninlined/code.kt Indeed it seems that EnterFrame/LeaveFrame is removed in a few more cases in 1.5.0! 🎉 though accessing a
const val
from a companion, initializes the companion, and thus includes the EnterFrame/LeaveFrame. I have tried with 1.5.20-dev-5655 with the same results. Going to create an issue about that. I'm going to temporarily manually inline all
const val
in critical places in my code, and will use it again when that issue is resolved 👍. Still the code is 2x slower than when inlining but at least not 70x times slower that was the case on windows with 1.4.32. Nice job!
🙂 1
👍 1
I have managed to reduce the overhead of the
currentFrame
on windows by 4x. I have created a PR here: https://github.com/JetBrains/kotlin/pull/4339
s
Thanks! We will take a look.
👍 1
though accessing a 
const val
 from a companion, initializes the companion, and thus includes the EnterFrame/LeaveFrame.
Btw, moving
const val
from
companion object
to file level should help. This might make manual inlining unnecessary.
🙏 1
👍 1
d
Hey @svyatoslav.scherbina how are you doing? Do you know if there is anything I can do for this to get reviewed https://github.com/JetBrains/kotlin/pull/4339 or it is just a matter of waiting? That patch helps a lot in terms of performance for my game engine on windows, and in my case, games are usually released for windows, so would be super nice if this could be integrated in 1.5.x at some point.
s
Hi. We’ve discovered that removing `-femulated-tls` helps a lot. But we haven’t yet investigated if it is safe. Could you please check if your pull request has any effect if
-femulated-tls
is removed? Also, please note that you can remove the flag without patching K/N, by using
-Xoverride-konan-properties
. So a workaround is already available.
đź‘€ 1
d
Will check. Thanks! According to the documentation
TlsGetValue
is available on UWP here: https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-tlsgetvalue I'll check the generated assembly without the emulated-tls flag uses registers directly or something.
For the record: when not using emulated-tls, it uses the gs: register that is per thread. It should be as cheap as on unix with that. Maybe in UWP that register is being used somehow or is used different and that causes memory issues? TlsGetValue and company should reduce a bit the overhead there while being compatible with UWP. But looking at the code it looks that the performance is going to be better without -femulated-tls, so I'm going to go that way for now 👍 thanks for pointing out all this!
👍 1