has anyone on the coroutine team (or in the commun...
# coroutines
a
has anyone on the coroutine team (or in the community!) ever done any benchmarking of creating large numbers of coroutines (e.g., like ~1M/s)? I'm working on a GraphQL server that uses coroutines, and I'm finding that if you design the execution in such a way where you spawn a coroutine (with
async
) for every field in a GraphQL query (which could have >5k with very large queries), the overhead -- specifically, seemingly, in handling of CoroutineContext (and even worse, ThreadLocalElement's...) is significant (I see things like
CombinedContext.fold
taking up 20% of total CPU). I'm wondering if folks have done any work on perf in this area, if this is known, etc? What's the "recommended" upper bound of how many coroutines to create? It's totally reasonable to say "well nothing's free!" -- it makes sense why you can't just create infinite coroutines and expect zero cost. But I haven't seen much published on this, so I'm wondering if people have had any similar experiences they'd like to share.
d
👋 unsure which GraphQL lib you are using but AFAIK older versions of
graphql-java
were wrapping everything in completable futures. They fixed that in v22. Might be related.
k
Reading the release notes, it looks like v22 of graphql-java aimed at reducing memory pressure and they were less concerned with CPU usage. The OP is about CPU usage.
a
Also, I'm writing my own impl.
(i am technically using some graphql-java stuff, but the idea is to have a Kotlin-coroutine-first execution strategy + data fetchers)
d
I think we saw some cpu improvements because of that as well (in gql kotlin)
@Samuel Vazquez I think you were doing some benchmarks with this?
k
FWIW this is the only area of documentation I know of that alludes to coroutines being cheap. https://kotlinlang.org/docs/coroutines-basics.html#coroutines-are-light-weight
I don’t know of any other resources which specify specifically how cheap they are
Speaking generally, 1 million coroutines per second seems like a lot.
a
i agree, but it's easy to get there -- 5000 fields in a query hitting the server at 200qps, coroutine for each field
k
FWIW, it looks like these are the benchmarks they have https://github.com/Kotlin/kotlinx.coroutines/tree/master/benchmarks/src/jmh
It’s
launch
and not
async
but I imagine a lot of the machinery for them is similar.
Also, for what it’s worth,
CoroutineContext.fold
is defined in the stdlib and not in kotlinx.
d
coroutines being cheap
well they are "cheap" compared to threads (also lighter than virtual threads) but there is always some cost
k
Right, but the original post is asking how cheap are they?
d
Also, I'm writing my own impl.
side question 🙂 somewhat curious about that as well ⬆️ but that probably is a better conversation for #CKD2KL89G channel 🙂
k
My hunch is that trying to understand how many coroutines is too many will probably be very use case specific. You’re seeing a lot of
CoroutineContext.fold
, and perhaps that’s because your coroutines are switching contexts a lot? Is that something you can try to limit?
d
also do you really need a coroutine for each field? i'd imagine a lot of them might be just reading from property so no need to go async there
a
Is that something you can try to limit?
It's really any coroutine creation that causes the context folding. Even if you use fancy "undispatched" coroutines, which don't switch threads, it still does the fold operation.
@Dariusz Kuc no, I don't need a coroutine for each field. To be clear, I think I actually have a workaround and a better design given the constraints
👍 2
k
Yeah that makes sense, each async is creating a new child job that needs to be put into the context. So folding contexts is probably necessary.
a
(that design being -- use Deferred's (not unlike GraphQL-Java!) -- to ensure coroutine creation only happens when the system actually needs to wait)
Yeah that makes sense, each async is creating a new child job that needs to be put into the context. So folding contexts is probably necessary. (edited)
exactly.
d
well that is a workaround but... still doesn't address the original question of how many coroutines are too many 🙂
a
right right. Like you can imagine how I wrote my first iteration of the exec strategy -- just loop over fields, do
async { }
for each one, have nice suspend functions for all the field handling
very pretty code.
😛
but ultimately, too naive for scale
k
I’d still be curious to hear one of the maintainers chime in here
a
I guess the guidance I'm hoping makes its way somewhere at some point is that you really shouldn't be doing coroutine creation in some kind of "inner loop"
this is something that I probably should have just "known" but the APIs lend themselves well to be able to do this "by accident" and potentially footgun yourself
k
That’s the same philosophy that makes okio folks hesitant to support asyncio with coroutines
a
im not familiar -- what is that discussion about?
k
The tl;dr is that coroutines have too much overhead for tight inner loop I/O operations like read n bytes from this fd.
a
ah, yep.
j
You could try
CoroutineStart.UNDISPATCHED
, but be careful.
a
I did that @JP Sugarbroad -- didn't help
j
Aw.
But yeah, creating coroutines is not as cheap as one would like.
The
Job
machinery looked pretty heavyweight when I last checked.
k
I think it’s safe to say they’re cheap but not free — but again, I’ll be curious to see if any of the maintainers chime in with some insight.
a
for context, these are the types of profiles I'm looking at: https://s.skevy.dev/dLkBCDmc
j
Is there code I can glance at?
a
about 2/3's of my
scopedFuture
call (which is kind of just a wrapper around async, but propagates coroutineContext through Java-land) is spent manipulating context
@JP Sugarbroad unfortunately no, not yet open sourced
j
what is ThreadLocalCoroutineContextManager?
looks like non-trivial cost in dealing with
CopyableThreadContextElement
stuff.
a
ThreadLocalCoroutineContextManager
is something that lets me hold the coroutineContext inside of a thread local, so that I can reference the coroutinecontext from a non-suspend function.
but fwiw, even when I remove that from the critical path, I still see huge amounts of time spent inside this code.
j
Yeah, might be interesting to see what's actually in your context. There's like 10 things in there.
a
yah it's a lot. Which is a problem in and of itself. But I actually removed all of those things for the bulk of the coroutines -- I put them in a "snapshot" context element, so I only had just one element (+ job/dispatcher).
and note that the snapshot element is not a threadlocal element
just a normal one.
but yah, still seeing tons of overhead even with a reduced context size.
k
It might be worthwhile sharing how much CPU usage reduction you see when you reduce the context’s size
1
j
My next suspect would be dispatcher interceptions, which means playing in
Dispatchers.Unconfined
a
i went from about 20% CPU to about 11% CPU
i could get it further probably.
j
But before that, I would find out what the
CopyableThreadContextElement
is. Those things are expensive.
k
I wonder if it scales non-linearly with context element size?
eg. 5 elements are fine, 10 are okay-ish, 20 and you’re crawling
j
Don't think so.
threadContextElements
could use to be optimized, perhaps, but it looks like the intention is that `DispatchedContinuation`s don't get created super-often.
I'd love to see an updated flame graph with the reduced context.
s
the fact that graphql-java works with futures there is nothing much you can do, in last version of graphql-java they stopped wrapping synchronously resolved data (already in memory) into completable features, alleviating the memory and CPU usage bottleneck that the GC was causing when cleaning up all those CFs. (before that i wansnt even able to use ZGC for example). i would imagine, you must have a very good reason to write your own execution strategy implementation. In my personal experience, the less context switching you do, the better. you can always interop with the default execution strategies that graphql java provide
k
What was the scenario that produced the above flame graph?
1M coroutines per second?
a
or thereabouts. I'm estimating, might not be exactly that.
👍 1
but that order of magnitude
k
I have a kind of wacky idea
I wonder if, rather than spawning coroutines on demand that have a singular task, you attempt to have longer lived "field resolver" coroutines
And then those coroutines start doing work when a request is received by sending fields that need to be resolved through a channel so that they can start being processed, and the resolved field is sent back over a channel
The number of coroutines could grow and shrink with demand to some extent
But the crux of the idea is to amortize the cost of creating a new coroutine across many different requests
a
good idea! thought about this too
one troubled part here is that I've essentially emulated javascripts process.nextTick witha. custom dispatcher
in order to batch dataloader calls
so if you have a channel producer/consumer situation -- where you have a fixed number of consumers "doing work" -- I'd essentially be bounding my batch size
which...might be OK?
but its tricky.
but it's a good idea.
s
have you looked into the graphql-kotlin custom dataloader dispatching mechanism ? https://opensource.expediagroup.com/graphql-kotlin/docs/server/data-loader/data-loader-instrumentation/ it works with coroutines, and it dispatches when is absolutely necessary. Very similar to the JS event loop with microtasks.
a
and I'm wondering if I might even use the channel approach on top of my Deferred approach that I mentioned above, such that we can still actually parallelize work across multiple threads that would otherwise be synchronous.
yes @Samuel Vazquez -- my solution was written before this came out and does it at the coroutine dispatcher level
but it might be reasonable to do it inside of GraphQL execution instead
s
In my personal experience with graphql-java, the less indirections you write on top of the engine the better, sufficient overhead already exists with graphql-java instrumentations that pretty much allow everyone to hook into the engine (specially the datadog agent)
You mentioned the usage of async for every field in a graphql, does it have to be that way though ? Like all resolvers return a deferred object ? In a lot of cases resolvers are just mappers for domain data that is already in memory.
e
Flow has functions such as
.flatMapMerge()
that process an unbounded number of flows but only up to DEFAULT_CONCURRENCY concurrently at a time