Deactivated User
04/25/2021, 11:26 AMDeactivated User
04/25/2021, 11:26 AMEnterFrame
/ LeaveFrame
call,
that's the hottest performance issue right now I guess.
In the case of windows at least that ends calling emults to get a local thread variable with the head node for the stacktrace with the strict memory model.
For what I have noticed that is used for stacktraces and GC.
Maybe you have already plans for that or I have missed something, but have you considered performing a conservative stack processing (scan all values from the start of the stack to the end,
including stuff that might be numbers)?
Then generate stacktraces by going down function frames, or use a library like https://github.com/certik/stacktrace for stacktrace generation?
It seems that C++23 will include a [std::basic_stacktrace](https://en.cppreference.com/w/cpp/utility/basic_stacktrace) too.
Even if that's not possible if the stacktraces are not required for GC, you could expose an annotation like @NoStackTrace
to allow users to mark
functions/property accessors to not include the EnterFrame/LeaveFrame for critical paths.louiscad
04/26/2021, 6:43 PMsvyatoslav.scherbina
04/27/2021, 2:36 PMall the functions and property accessors include aNot all. Also, it might be different for release and debug binaries. Finally, latest releases forcibly inline some of the accessors (without redundant “frame management”). Please recheck. Also, providing a particular example would help./EnterFrame
call,LeaveFrame
For what I have noticed that is used for stacktraces and GC.Only for GC, IIRC.
performing a conservative stack processing (scan all values from the start of the stack to the end,
including stuff that might be numbers)?We don’t plan to scan stacks conservatively. This approach has a lot of drawbacks and negative consequences. For example, runtime would need to distinguish arbitrary bits from valid pointers to objects, and this requires complicated infrastructure in runtime.
Even if that’s not possible if the stacktraces are not required for GC, you could expose an annotation liketo allow users to mark@NoStackTrace
functions/property accessors to not include the EnterFrame/LeaveFrame for critical paths.It might be possible to implement something like this, but such an annotation should also carefully disable GC while thread execute such a function (or anything it calls).
That gets me thinking, maybe these `EnterFrame`/`LeavFrame` could be removed when the compiler can prove the given function or property accessor/setter cannot throw an exception?Yes, and make GC impossible during non-throwing functions, right?
Deactivated User
04/28/2021, 7:05 PM@ExperimentalKotlinNativeNoStackNoThrow
would be worth already.Deactivated User
04/28/2021, 7:33 PMsvyatoslav.scherbina
04/29/2021, 7:09 AMI tried to implement it and somehow achieved it (though not for Kotlin/Native) sometime ago.Thank you for the detailed writeup! Your approach sounds pretty much like what I mentioned above:
For example, runtime would need to distinguish arbitrary bits from valid pointers to objects, and this requires complicated infrastructure in runtime.Regarding the performance:
Ends entering/exiting several frames.Have you checked this on debug or release binaries?
Deactivated User
04/29/2021, 10:07 AMgit clone <https://github.com/korlibs/korge-next.git>
gradlew :samples:instanced-rendering:runJvm
gradlew :samples:instanced-rendering:runNativeRelease
Please, try on Windows. The performance is specially bad there. EnterFrame/ExitFrame in the end ends adding an overhead, and in windows it seems to be specially bad because it uses _emutls instead of a plain register, but even with support, the EnterFrame ends adding an overhead that ideally shouldn't be there, specially when some methods could be inlined being a memory indirection, which is probably what happens on the JVM, but maybe they couldn't be inlined by the LLVM because the EnterFrame/ExitFrame, and that could avoid further optimizations too.
Here you can see screenshots showing >100 fps with 800k sprites on JVM (GPU-bound) while on K/N Windows in release with 50K sprites is running at 11fps (screenshot is from release version) which is at least two orders of magnitude slower
I used IDA free for the disassembler screenshots: https://www.hex-rays.com/products/ida/support/download_freeware/
You can see that _emutls do extra stuff with mutexes and stuff.svyatoslav.scherbina
04/29/2021, 12:23 PMI believe I explored generated code un DebugI’m not sure this makes a lot of sense. Please recheck the generated code in Release for the functions you mentioned above. Inspecting generated LLVM IR might be easier than for machine code.
-Xprint-bitcode
compiler flag prints the code before LLVM optimizations (including LLVM inlining). This way it might be easier to check for `EnterFrame`/`LeaveFrame` presence.Deactivated User
04/29/2021, 1:49 PMEnterFrame
is inlined, but again __emutls_get_address
is called two times per method (for the enter and the leave), and those internal method calls are not inlined even if they are final, probably because of the Enter/ExitFrame
, that grows the function and the compiler decides to not inline it.
In C++ and JVM the getByteIndex
from the screenshot would be inlined, and with constant propagation that could even be an integer or a couple of inlined operations, but I guess LLVM is not inlining it because of the EnterFrame. It also seems that the Float32Buffer.SIZE
that is a const val
is initializing the singleton, instead of using the computed value at compile time, though that's a separate issue. But in the end there are a lot of places where _emutls_get_address
is called for that code.
But in the end what I'm doing is pretty inlineable by a compiler and should be converted in a few instructions reading/writing from memory and even if it is not inlined, at normally won't have that overhead. Also a function that is performing arithmetic with parameters shouldn't have the overhead of writing/reading memory at all. In the end I'm aiming to access those properties like 60 million of times per second, so that overhead is impacting the experience a lot.
And that's why I believe this is important to address, since it looks like the major performance issue is that one. I believe K/N could have a similar performance to Java or better with LLVM advance optimizations if you manage to remove the EnterFrame
and LeaveFrame
svyatoslav.scherbina
04/29/2021, 3:26 PMIt also seems that theÂYes. Please report this to YouTrack. that is aÂFloat32Buffer.SIZE
is initializing the singleton, instead of using the computed value at compile time, though that’s a separate issue.const val
Also a function that is performing arithmetic with parameters shouldn’t have the overhead of writing/reading memory at all.AFAIK, it doesn’t. Functions you examine in the screenshots don’t just perform arithmetic operations with parameters. At least they read fields from objects. If you have examples of functions that only perform arithmetic operations directly with parameters and have `EnterFrame`/`LeaveFrame`, please provide them.
I believe K/N could have a similar performance to Java or better with LLVM advance optimizations if you manage to remove theÂWe don’t have short-term plans for removing the frame management completely. And the cases you’ve shown can be fixed with local optimizations in the compiler. Please make small self-contained reproducers for the remaining issues, and report them to YouTrack. That would be a great way to contribute. I feel it’s worth repeating: conservative stack scanning is not a silver bullet. Yes, it would allow to get rid of explicit frame management (which might improve the performance in some cases) but it definitely would bring a lot of other issues, including performance ones. andÂEnterFrame
LeaveFrame
Deactivated User
04/29/2021, 5:42 PMthis
and val fields/properties in this
, this should be already on a stackframe on previous function calls since someone called it and thus nobody should be able to deallocate any referenced objects from any children or descendant, and being val they won't be able to be mutated later.
All the critical paths I use doesn't allocate and only use this and properties referenced in this and if the companion object { const val }
is optimized, and doesn't require to include the singleton, I believe the combination of both improvements would make it much faster in most of my code.
This code:
class FSprites {
val data = FBuffer(maxSize * FSprites.STRIDE * 4)
private val f32 = data.f32
// These property accessors shouldn't require EnterFrame/LeaveFrame. Since `this` is already referenced from the caller, and nobody can change f32 since it is a val and it is already referenced by this, so won't be deallocated
var FSprite.x: Float get() = f32[offset + 0] ; set(value) { f32[offset + 0] = value }
}
svyatoslav.scherbina
04/30/2021, 6:30 AMThen there are cases where you are not adding EnterFrame/LeaveFrame?Yes. For example, code that doesn’t do anything with object references shouldn’t have these calls.
For example, if the function is not allocating, and for example it is only accessingÂThis should already work similarly in 1.5.0, at least for release binaries. Please recheck on 1.5.0. and val fields/properties inÂthis
, this should be already on a stackframe on previous function calls since someone called it and thus nobody should be able to deallocate any referenced objects from any children or descendant, and being val they won’t be able to be mutated later.this
This code:Having a self-contained reproducer would help.
Deactivated User
04/30/2021, 7:01 PMconst val
from a companion, initializes the companion, and thus includes the EnterFrame/LeaveFrame. I have tried with 1.5.20-dev-5655 with the same results. Going to create an issue about that. I'm going to temporarily manually inline all const val
in critical places in my code, and will use it again when that issue is resolved 👍. Still the code is 2x slower than when inlining but at least not 70x times slower that was the case on windows with 1.4.32. Nice job!Deactivated User
04/30/2021, 7:04 PMDeactivated User
05/01/2021, 11:46 AMcurrentFrame
on windows by 4x. I have created a PR here: https://github.com/JetBrains/kotlin/pull/4339svyatoslav.scherbina
05/06/2021, 7:35 AMsvyatoslav.scherbina
05/06/2021, 3:22 PMthough accessing aÂBtw, moving from a companion, initializes the companion, and thus includes the EnterFrame/LeaveFrame.const val
const val
from companion object
to file level should help. This might make manual inlining unnecessary.Deactivated User
05/13/2021, 10:47 AMsvyatoslav.scherbina
05/13/2021, 12:33 PM-femulated-tls
is removed?
Also, please note that you can remove the flag without patching K/N, by using -Xoverride-konan-properties
. So a workaround is already available.Deactivated User
05/13/2021, 2:10 PMTlsGetValue
is available on UWP here: https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-tlsgetvalue I'll check the generated assembly without the emulated-tls flag uses registers directly or something.Deactivated User
05/13/2021, 7:30 PM