https://kotlinlang.org logo
Title
m

Marc Knaup

10/15/2020, 8:07 PM
Is all that parking overhead from coroutines? Are there approaches to get that down? Looks like that 50% of the application’s CPU time is spent on locking.
👀 2
Actual CPU-bound work is rarely > 200ms in my coroutines. Lots of IO though.
Is there any way to further debug that? So that I find unnecessary dispatcher switches for example.
s

streetsofboston

10/15/2020, 8:27 PM
IO-bound coroutines should be run on the Dispatchers.IO dispatcher which is specifically targeted to be used in blocking (io) code, where threads will be doing a lot of waiting/parking.
m

Marc Knaup

10/15/2020, 8:31 PM
Yeah that’s what I’m doing. But there may be unintentional switches between Default and IO or Main and IO etc. Finding those would be helpful, should there be some. Mistakes happen 🙂
s

streetsofboston

10/15/2020, 8:33 PM
What do you mean by unintentional switches? Is it important on which thread (if it is jvm/native) your code is run?
m

Marc Knaup

10/15/2020, 8:36 PM
Dispatcher switches, not Thread switches. Dispatcher switches aren’t that cheap for coroutines afaik, at least not at high volume.
s

streetsofboston

10/15/2020, 8:39 PM
Ah, I see, performance wise. I know a coroutine’s resumption, after a suspension, has an affinity to the same dispatcher and even thread. But I can’t comment on the exact performance impact on this…. I don’t know enough about it. 🙂
m

Marc Knaup

10/15/2020, 8:40 PM
Yeah me too. I’m trying to find info on that.
I’ve also just noticed that the results look very different with Java Flight Recorder than with CPU Profiling. I need to run more tests.
Okay I ran it another hour with the normal CPU profiler and the result looks pretty much the same: Lots of parking.
d

danny

10/16/2020, 12:03 AM
Last time I looked through the scheduler: CPU -> IO is free because you stay on the same thread, it just gives up it’s CPU lease. IO -> CPU may incur a park and thread switch if it can’t acquire a CPU lease in time
m

Marc Knaup

10/16/2020, 12:35 AM
I see. So basically I need to check for too many
flowOn
,
launch(In)
and
withContext()
calls that delegate to Default dispatcher where it’s not strictly necessary. I’ll check the coroutines code. Maybe I Can put a breakpoint somewhere and get an idea how often it happens and where.
d

danny

10/16/2020, 2:01 AM
You know, I see the same unpark overhead here - CPU or bound worker, it doesn’t matter. I’ve been tinkering with a reactor flux as a flow, quite a bit of CPU overhead measured both with sampling and async profiler
I’m using 1.4.0-M1 right now, because I wanted to try out sharing:
Think I’ll go all Flux for now
m

Marc Knaup

10/16/2020, 2:33 AM
What do you use for profiling?
g

gildor

10/16/2020, 2:37 AM
unintentional switches between Default and IO
Default and IO share threads, so there is no switch between them
1
d

danny

10/16/2020, 3:52 AM
Marc - that’s YourKit
🙏 1
b

bezrukov

10/16/2020, 9:36 AM
so there is no switch between them
I think you can rely on it only in one direction, when switching from Default to IO. When you're switching from IO to Default back, I think it's very likely that there will be thread switch.
g

gildor

10/16/2020, 10:00 AM
why do you think so?
it’s easy to check, and no, it works to both directions: https://pl.kotl.in/8IE7FyoQf
b

bezrukov

10/16/2020, 10:33 AM
let's say you have 64 running coroutines in IO, what will happen if all of them will switch to Default dispatcher and call blocking call?
l

louiscad

10/16/2020, 10:51 AM
@bezrukov Your code has race conditions. You need to specify which coroutines is printing its thread.
b

bezrukov

10/16/2020, 10:55 AM
is it? the only async piece of code is
repeat(8) {
launch(Dispatchers.Default) {
Thread.sleep(50) // blocking call
}
}
that doesn't print anything. All other print statements are "sequential" in terms of coroutines
l

louiscad

10/16/2020, 10:57 AM
Ah yes, you're right, looked too fast.
b

bezrukov

10/16/2020, 10:57 AM
it's absolutely the same as Andrey's example, just with quoted code mentioned above. It just makes Default dispatcher "busy"
l

Luis Munoz

10/16/2020, 2:19 PM
Correct me if I am wrong: coroutines on jvm are just built on top of executors and Runnable. It takes a suspend method and splits into suspension points with a Runnable for each split and puts it in queues which are used like an event loop and may context switched when pulled off queue. I appreciate someone quantifying the overhead.
@Marc Knaup with that said, if your application is IO bound wouldn't you expect the CPU to spend most of its time parked, i.e., waiting on IO? You said 50 of time is spent on locking, what else is it suppose to when it mostly waits? that 50 isn't cpu usage of the system, that is cpu usage of app of the few cpu bound tasks it does, right?
m

Marc Knaup

10/16/2020, 2:41 PM
I’m actually not sure whether my app was IO or CPU bound. Both usages were pretty high. I’ve done a lot of optimizations by now and am currently recording with Java Flight Recorder to check if that yields different results. Then again with CPU profiling. If it’s IO bound I would expect a lot of parking. But not parking by coroutines but by the actual IO operation, e.g. in the HTTP client or the database client.
I have no idea what the coroutine parking is about. So I can merely speculate how it works under the hood ¯\_(ツ)_/¯
l

Luis Munoz

10/16/2020, 2:58 PM
i have an app that does millions of http requests (using netty), a little xml parsing, I look at total server cpu usage it is 5-10%. I do application profiling most time (over 50%) is spent on coroutines like you show. My conclusion everything is as expected.
m

Marc Knaup

10/16/2020, 2:59 PM
Yeah it may also be nothing at all, just the run loop waiting for work.
I’ll have another profiling with JFR ready in a few minutes. Let’s see what happens 🙂 After I did regular CPU profiling again and the parking is really just waiting for work in coroutine dispatchers then the amount should be well above 50% as the program is now more optimized.
JFR seems to ignore all parking. Or the sample rate is just too bad. It has very low amount of samples compared to regular CPU profiling 😞
443k exceptions created within half an hour. That’s 246 per second 😮 I guess that’s
CancellationException
? 😄
I somehow managed to fix it 😮 My latest profiling result has so few parkings compared to before that I barely see any of them now. And the CPU usage is down significantly.
Now I’m puzzled even more. I have much less CPU-bound work and the same IO-bound work as before. If parking means that dispatchers are waiting for work, there should be more parking now, not less.
Maybe it only happens after 2 hours. This one was 30 minutes. 🤷‍♂️
l

louiscad

10/16/2020, 3:51 PM
Maybe parking is done for longer periods of time, eating less CPU?
m

Marc Knaup

10/16/2020, 3:54 PM
Yeah, maybe I had hit a big inefficiency before, where the dispatchers were almost always busy and now the thread are rarely exhausted so they can rest in peace ¯\_(ツ)_/¯
Much better now, even after 2 hours. But I wonder where all those terminated workers are coming from (there are more).
l

Luis Munoz

10/17/2020, 8:06 AM
Is your application open source? I would be interested in looking at code. If those were threads I would suspect over use of concurrency / creating too many threads. There is such a thing as using too many threads where the overhead of switching threads out weights the benefits of having so many and you should reduce the limit. With coroutines that is way more unlikely though.
m

Marc Knaup

10/17/2020, 11:43 AM
I'm afraid it's not 😅 I also wonder of its the amount of Flows I've created (50k+). But most of them will just idle in
delay
l

Luis Munoz

10/17/2020, 4:15 PM
i've had problems with using delay before. I have a project where I was using stomp protocol, which is persistent and was sending a ping every 15 seconds on each connection. I had millions of connections using delay in a while(isActive) to send pings. Turns out that didn't work, after a certain number of connections there was too much overhead in the creating of so many coroutines where the delays weren't accurate anymore. I had to change the design to scale better by keeping track of when the last ping was sent based on timestamp for each connection basically not using delay.
m

Marc Knaup

10/17/2020, 4:20 PM
Are you sure the delays weren’t accurate? Maybe too many coroutines were waking up at the same time which caused all coroutine resumption to have significant backpressure.
l

Luis Munoz

10/17/2020, 4:39 PM
I knew the delays weren't accurate because that was the client side, on the server side I had multiple servers (6 so less pressure on server side because it is spread) and they detect if a ping didn't arrive on time and close connection after 30 seconds, double the delay time, I would then ramp up the connections and start seeing disconnect after 400k connections. After profiling and doing a memory dump I noticed significant coroutine overhead, and I just changed design to not use delay and then I didn't have issues.
I mean I was able to stabilize at 500k connections where the app worked as intended.
g

gildor

10/18/2020, 1:37 PM
Maybe worth to report it and see what kotlinx.coroutinws team will answer, there might be some bottle neck in dispatcher implementation
d

danny

10/18/2020, 11:55 PM
Could also be sampling profilers misreporting incredibly numerous but very short calls - using a tracing profiler instead will give more accurate results
🤔 1