Is all that parking overhead from coroutines? Are ...
# coroutines
m
Is all that parking overhead from coroutines? Are there approaches to get that down? Looks like that 50% of the application’s CPU time is spent on locking.
👀 2
Actual CPU-bound work is rarely > 200ms in my coroutines. Lots of IO though.
Is there any way to further debug that? So that I find unnecessary dispatcher switches for example.
s
IO-bound coroutines should be run on the Dispatchers.IO dispatcher which is specifically targeted to be used in blocking (io) code, where threads will be doing a lot of waiting/parking.
m
Yeah that’s what I’m doing. But there may be unintentional switches between Default and IO or Main and IO etc. Finding those would be helpful, should there be some. Mistakes happen 🙂
s
What do you mean by unintentional switches? Is it important on which thread (if it is jvm/native) your code is run?
m
Dispatcher switches, not Thread switches. Dispatcher switches aren’t that cheap for coroutines afaik, at least not at high volume.
s
Ah, I see, performance wise. I know a coroutine’s resumption, after a suspension, has an affinity to the same dispatcher and even thread. But I can’t comment on the exact performance impact on this…. I don’t know enough about it. 🙂
m
Yeah me too. I’m trying to find info on that.
I’ve also just noticed that the results look very different with Java Flight Recorder than with CPU Profiling. I need to run more tests.
Okay I ran it another hour with the normal CPU profiler and the result looks pretty much the same: Lots of parking.
d
Last time I looked through the scheduler: CPU -> IO is free because you stay on the same thread, it just gives up it’s CPU lease. IO -> CPU may incur a park and thread switch if it can’t acquire a CPU lease in time
m
I see. So basically I need to check for too many
flowOn
,
launch(In)
and
withContext()
calls that delegate to Default dispatcher where it’s not strictly necessary. I’ll check the coroutines code. Maybe I Can put a breakpoint somewhere and get an idea how often it happens and where.
d
You know, I see the same unpark overhead here - CPU or bound worker, it doesn’t matter. I’ve been tinkering with a reactor flux as a flow, quite a bit of CPU overhead measured both with sampling and async profiler
I’m using 1.4.0-M1 right now, because I wanted to try out sharing:
Think I’ll go all Flux for now
m
What do you use for profiling?
g
unintentional switches between Default and IO
Default and IO share threads, so there is no switch between them
1
d
Marc - that’s YourKit
🙏 1
b
so there is no switch between them
I think you can rely on it only in one direction, when switching from Default to IO. When you're switching from IO to Default back, I think it's very likely that there will be thread switch.
g
why do you think so?
it’s easy to check, and no, it works to both directions: https://pl.kotl.in/8IE7FyoQf
b
let's say you have 64 running coroutines in IO, what will happen if all of them will switch to Default dispatcher and call blocking call?
l
@bezrukov Your code has race conditions. You need to specify which coroutines is printing its thread.
b
is it? the only async piece of code is
repeat(8) {
launch(Dispatchers.Default) {
Thread.sleep(50) // blocking call
}
}
that doesn't print anything. All other print statements are "sequential" in terms of coroutines
l
Ah yes, you're right, looked too fast.
b
it's absolutely the same as Andrey's example, just with quoted code mentioned above. It just makes Default dispatcher "busy"
l
Correct me if I am wrong: coroutines on jvm are just built on top of executors and Runnable. It takes a suspend method and splits into suspension points with a Runnable for each split and puts it in queues which are used like an event loop and may context switched when pulled off queue. I appreciate someone quantifying the overhead.
@Marc Knaup with that said, if your application is IO bound wouldn't you expect the CPU to spend most of its time parked, i.e., waiting on IO? You said 50 of time is spent on locking, what else is it suppose to when it mostly waits? that 50 isn't cpu usage of the system, that is cpu usage of app of the few cpu bound tasks it does, right?
m
I’m actually not sure whether my app was IO or CPU bound. Both usages were pretty high. I’ve done a lot of optimizations by now and am currently recording with Java Flight Recorder to check if that yields different results. Then again with CPU profiling. If it’s IO bound I would expect a lot of parking. But not parking by coroutines but by the actual IO operation, e.g. in the HTTP client or the database client.
I have no idea what the coroutine parking is about. So I can merely speculate how it works under the hood ¯\_(ツ)_/¯
l
i have an app that does millions of http requests (using netty), a little xml parsing, I look at total server cpu usage it is 5-10%. I do application profiling most time (over 50%) is spent on coroutines like you show. My conclusion everything is as expected.
m
Yeah it may also be nothing at all, just the run loop waiting for work.
I’ll have another profiling with JFR ready in a few minutes. Let’s see what happens 🙂 After I did regular CPU profiling again and the parking is really just waiting for work in coroutine dispatchers then the amount should be well above 50% as the program is now more optimized.
JFR seems to ignore all parking. Or the sample rate is just too bad. It has very low amount of samples compared to regular CPU profiling 😞
443k exceptions created within half an hour. That’s 246 per second 😮 I guess that’s
CancellationException
? 😄
I somehow managed to fix it 😮 My latest profiling result has so few parkings compared to before that I barely see any of them now. And the CPU usage is down significantly.
Now I’m puzzled even more. I have much less CPU-bound work and the same IO-bound work as before. If parking means that dispatchers are waiting for work, there should be more parking now, not less.
Maybe it only happens after 2 hours. This one was 30 minutes. 🤷‍♂️
l
Maybe parking is done for longer periods of time, eating less CPU?
m
Yeah, maybe I had hit a big inefficiency before, where the dispatchers were almost always busy and now the thread are rarely exhausted so they can rest in peace ¯\_(ツ)_/¯
Much better now, even after 2 hours. But I wonder where all those terminated workers are coming from (there are more).
l
Is your application open source? I would be interested in looking at code. If those were threads I would suspect over use of concurrency / creating too many threads. There is such a thing as using too many threads where the overhead of switching threads out weights the benefits of having so many and you should reduce the limit. With coroutines that is way more unlikely though.
m
I'm afraid it's not 😅 I also wonder of its the amount of Flows I've created (50k+). But most of them will just idle in
delay
l
i've had problems with using delay before. I have a project where I was using stomp protocol, which is persistent and was sending a ping every 15 seconds on each connection. I had millions of connections using delay in a while(isActive) to send pings. Turns out that didn't work, after a certain number of connections there was too much overhead in the creating of so many coroutines where the delays weren't accurate anymore. I had to change the design to scale better by keeping track of when the last ping was sent based on timestamp for each connection basically not using delay.
m
Are you sure the delays weren’t accurate? Maybe too many coroutines were waking up at the same time which caused all coroutine resumption to have significant backpressure.
l
I knew the delays weren't accurate because that was the client side, on the server side I had multiple servers (6 so less pressure on server side because it is spread) and they detect if a ping didn't arrive on time and close connection after 30 seconds, double the delay time, I would then ramp up the connections and start seeing disconnect after 400k connections. After profiling and doing a memory dump I noticed significant coroutine overhead, and I just changed design to not use delay and then I didn't have issues.
I mean I was able to stabilize at 500k connections where the app worked as intended.
g
Maybe worth to report it and see what kotlinx.coroutinws team will answer, there might be some bottle neck in dispatcher implementation
d
Could also be sampling profilers misreporting incredibly numerous but very short calls - using a tracing profiler instead will give more accurate results
🤔 1