I've found a weird performance inconsistency with ...
# kotlin-native
k
I've found a weird performance inconsistency with my code on Linux and Windows... I'm working on a Raylib binding to K/N (game dev framework) and I've created a bunnymark test. What makes no sense to me that on my main PC - 32GB RAM, Ryzen 5 5600X and a RX5600XT I can render max 6500 entities (on Windows) until FPS drops down to 30... On my Linux laptop that runs on 16GB RAM, i7 11th gen with Iris iGPU I score 124200 entities till 30fps? Is there anything that could be causing this issue? Does K/N on Windows use different memory allocator or something that could also be causing this?
k
have you isolated K/N from the equation? It’s possible the underlying APIs on Windows could perform differently than on Linux. Maybe try writing a small C repro that confirms this isn’t the case?
k
I've just ran the exec on a live linux usb and it managed to draw 130700 entities with the K/N binding. I've also ran a pure C code and it performance perfectly fine and renders correct results
So in this end there is something VERY weird going on here with K/N on windows when it comes to performance
Also both K/N and C benchmarks run on the same lib, no difference there.
a
it sounds like you have a solid basis for making a ticket - such an example means performance could be tested and improved https://youtrack.jetbrains.com/newIssue
k
Interesting. I’ve not had to do this, but maybe try conjuring some flame graphs to see what functions are acting slowly? I wonder if it’s in cinterop land or something else
Could even do more simple timings of methods,
println
them and compare between windows and linux
Also I think you’d have to go in with more information on what components might be slow (cinterop generated code? allocations? function calls? etc etc) to make a ticket.
k
yeah I will probably have to raise a ticket albeit I'm still rather new to all of this and generating such info might be tad difficult for me. Worth mentioning that as I write the binding (it hide ugly parts of K/N of running allocs etc everywhere in a wrapper) there is no performance difference between the binding and writing it in a pure K/N style
k
I suggest you do something like this.
Copy code
fun <T> traceTime(functionName: String, block: () -> T): T {
  val timedResult = measureTimedValue(block)
  println("$functionName took ${timedResult.duration.inMilliseconds} ms.")
}

val functionResult = traceTime("someRaylibFunction") { someRaylibFunction() }
That would measure the amount of time it takes to perform
block
and print the result to stdout while allowing your program to do it’s thing. Peppering this around your codebase could allow for fairly easy comparisons of windows and linux
k
Thank you, I will do that and create some data with this.
There is only one function that I would really suspect causing this which is a function that draws a texture that is loaded into VRAM (GPU) which internally ofc calls OpenGL. Doing the measuring I don't see anything abnormal. On linux it's around 5.36E-4 ms. to max 6.02E-4 ms while on Windows its 5.0E-4 ms (dipping down to 4 not so often) and max 7/9. It also tends to occasionally drop down to 0.001028 ms on Linux and 0.001ms on Windows so pretty much the same. Small problem here is that I have no access to the internal opengl calls so I can't really investigate those
k
The internal opengl calls are all implemented in C and K/N wouldn’t impact their performance at all
IF a single call to opengl is taking slightly different amounts of time on windows and linux, what happens when you make 10 million opengl calls on each platform?
I think the best way to test this would be to measure times and run your original program that renders 124,000 entities to the screen
k
That does make it slightly difficult as calling the function via traceTime causes the FPS to tank down almost instantly to 10s and 7s and there isn't much difference in time it took to call
Okay there is one function that does not match Linux at all which is a function that returns the X and Y position values of the mouse cursor. On linux we get 1.42E-4 ms. while on Windows min 3.0E-4 ms max 6.0E-4 ms Removing that function from Windows ver gave additional 600 entities but it does not affect pure C code test nor Linux K/N one
I have also found this https://kotlinlang.slack.com/archives/C3SGXARS6/p1619690840268400?thread_ts=1619349974.244300&amp;cid=C3SGXARS6 which seems to talk about the exactly same problem. Abysmal performance on Windows compared to Linux/macOS
-femulated-tls
flag was meant to be removed since 1.6 (https://youtrack.jetbrains.com/issue/KT-47605) but it's still present in konan.properties so I wonder if this is causing all the performance issues on Windows?
k
Are you able to clarify @sergey.bogolepov?
k
It would also explain why performance is much better on Linux and macOS as
-femulated-tls
flag is not present and confirmed in the post I've linked earlier
k
Yeah, nice sleuthing
k
Much better than living in hell trying to comprehend why things just don't work in the code 🫢
s
Yep,
-femululated-tls
is most likely the cause. Unfortunately, it can’t be dropped because toolchain we use is compiled with that flag. It probably can be solved by updating toolchain, but Windows-specific performance problems are out of our focus at the moment.
BTW
-femulated-tls
flag was meant to be removed since 1.6
No, it was not. The issue is purely about LLD, not that flag.
k
Ah my apologies then, misread the comment on the ticket in that case