has anyone on the coroutine team or in the community ever do kotlinlang #coroutines

has anyone on the coroutine team (or in the commun...

Adam Miskiewicz

12/04/2024, 5:34 PM

has anyone on the coroutine team (or in the community!) ever done any benchmarking of creating large numbers of coroutines (e.g., like ~1M/s)? I'm working on a GraphQL server that uses coroutines, and I'm finding that if you design the execution in such a way where you spawn a coroutine (with

async

) for every field in a GraphQL query (which could have >5k with very large queries), the overhead -- specifically, seemingly, in handling of CoroutineContext (and even worse, ThreadLocalElement's...) is significant (I see things like

CombinedContext.fold

taking up 20% of total CPU). I'm wondering if folks have done any work on perf in this area, if this is known, etc? What's the "recommended" upper bound of how many coroutines to create? It's totally reasonable to say "well nothing's free!" -- it makes sense why you can't just create infinite coroutines and expect zero cost. But I haven't seen much published on this, so I'm wondering if people have had any similar experiences they'd like to share.

Dariusz Kuc

12/04/2024, 8:27 PM

👋 unsure which GraphQL lib you are using but AFAIK older versions of

graphql-java

were wrapping everything in completable futures. They fixed that in v22. Might be related.

kevin.cianfarini

12/04/2024, 9:23 PM

Reading the release notes, it looks like v22 of graphql-java aimed at reducing memory pressure and they were less concerned with CPU usage. The OP is about CPU usage.

Adam Miskiewicz

12/04/2024, 9:23 PM

Also, I'm writing my own impl.

Adam Miskiewicz

12/04/2024, 9:24 PM

(i am technically using some graphql-java stuff, but the idea is to have a Kotlin-coroutine-first execution strategy + data fetchers)

Dariusz Kuc

12/04/2024, 9:24 PM

I think we saw some cpu improvements because of that as well (in gql kotlin)

Dariusz Kuc

12/04/2024, 9:25 PM

@Samuel Vazquez I think you were doing some benchmarks with this?

kevin.cianfarini

12/04/2024, 9:25 PM

FWIW this is the only area of documentation I know of that alludes to coroutines being cheap. https://kotlinlang.org/docs/coroutines-basics.html#coroutines-are-light-weight

kevin.cianfarini

12/04/2024, 9:26 PM

I don’t know of any other resources which specify specifically how cheap they are

kevin.cianfarini

12/04/2024, 9:27 PM

Speaking generally, 1 million coroutines per second seems like a lot.

Adam Miskiewicz

12/04/2024, 9:28 PM

i agree, but it's easy to get there -- 5000 fields in a query hitting the server at 200qps, coroutine for each field

kevin.cianfarini

12/04/2024, 9:29 PM

FWIW, it looks like these are the benchmarks they have https://github.com/Kotlin/kotlinx.coroutines/tree/master/benchmarks/src/jmh

kevin.cianfarini

12/04/2024, 9:31 PM

Seems like this test might be the most relevant? https://github.com/Kotlin/kotlinx.coroutines/blob/master/benchmarks/src/jmh/kotlin/benchmarks/scheduler/LaunchBenchmark.kt

kevin.cianfarini

12/04/2024, 9:31 PM

It’s

launch

and not

async

but I imagine a lot of the machinery for them is similar.

kevin.cianfarini

12/04/2024, 9:33 PM

Also, for what it’s worth,

CoroutineContext.fold

is defined in the stdlib and not in kotlinx.

Dariusz Kuc

12/04/2024, 9:35 PM

coroutines being cheap

well they are "cheap" compared to threads (also lighter than virtual threads) but there is always some cost

kevin.cianfarini

12/04/2024, 9:36 PM

Right, but the original post is asking how cheap are they?

Dariusz Kuc

12/04/2024, 9:38 PM

Also, I'm writing my own impl.

side question 🙂 somewhat curious about that as well ⬆️ but that probably is a better conversation for #CKD2KL89G channel 🙂

kevin.cianfarini

12/04/2024, 9:44 PM

My hunch is that trying to understand how many coroutines is too many will probably be very use case specific. You’re seeing a lot of

CoroutineContext.fold

, and perhaps that’s because your coroutines are switching contexts a lot? Is that something you can try to limit?

Dariusz Kuc

12/04/2024, 9:45 PM

also do you really need a coroutine for each field? i'd imagine a lot of them might be just reading from property so no need to go async there

Adam Miskiewicz

12/04/2024, 9:45 PM

Is that something you can try to limit?

It's really any coroutine creation that causes the context folding. Even if you use fancy "undispatched" coroutines, which don't switch threads, it still does the fold operation.

Adam Miskiewicz

12/04/2024, 9:46 PM

@Dariusz Kuc no, I don't need a coroutine for each field. To be clear, I think I actually have a workaround and a better design given the constraints

👍 2

kevin.cianfarini

12/04/2024, 9:46 PM

Yeah that makes sense, each async is creating a new child job that needs to be put into the context. So folding contexts is probably necessary.

Adam Miskiewicz

12/04/2024, 9:47 PM

(that design being -- use Deferred's (not unlike GraphQL-Java!) -- to ensure coroutine creation only happens when the system actually needs to wait)

Adam Miskiewicz

12/04/2024, 9:47 PM

Yeah that makes sense, each async is creating a new child job that needs to be put into the context. So folding contexts is probably necessary. (edited)

exactly.

Dariusz Kuc

12/04/2024, 9:47 PM

well that is a workaround but... still doesn't address the original question of how many coroutines are too many 🙂

Adam Miskiewicz

12/04/2024, 9:48 PM

right right. Like you can imagine how I wrote my first iteration of the exec strategy -- just loop over fields, do

async { }

for each one, have nice suspend functions for all the field handling

Adam Miskiewicz

12/04/2024, 9:48 PM

very pretty code.

Adam Miskiewicz

12/04/2024, 9:48 PM

😛

Adam Miskiewicz

12/04/2024, 9:48 PM

but ultimately, too naive for scale

kevin.cianfarini

12/04/2024, 9:48 PM

I’d still be curious to hear one of the maintainers chime in here

Adam Miskiewicz

12/04/2024, 9:49 PM

I guess the guidance I'm hoping makes its way somewhere at some point is that you really shouldn't be doing coroutine creation in some kind of "inner loop"

Adam Miskiewicz

12/04/2024, 9:50 PM

this is something that I probably should have just "known" but the APIs lend themselves well to be able to do this "by accident" and potentially footgun yourself

kevin.cianfarini

12/04/2024, 9:50 PM

That’s the same philosophy that makes okio folks hesitant to support asyncio with coroutines

Adam Miskiewicz

12/04/2024, 9:50 PM

im not familiar -- what is that discussion about?

kevin.cianfarini

12/04/2024, 9:51 PM

The tl;dr is that coroutines have too much overhead for tight inner loop I/O operations like read n bytes from this fd.

Adam Miskiewicz

12/04/2024, 9:51 PM

ah, yep.

JP Sugarbroad

12/04/2024, 9:51 PM

You could try

CoroutineStart.UNDISPATCHED

, but be careful.

Adam Miskiewicz

12/04/2024, 9:51 PM

I did that @JP Sugarbroad -- didn't help

JP Sugarbroad

12/04/2024, 9:51 PM

Aw.

JP Sugarbroad

12/04/2024, 9:52 PM

But yeah, creating coroutines is not as cheap as one would like.

JP Sugarbroad

12/04/2024, 9:52 PM

The

Job

machinery looked pretty heavyweight when I last checked.

kevin.cianfarini

12/04/2024, 9:53 PM

I think it’s safe to say they’re cheap but not free — but again, I’ll be curious to see if any of the maintainers chime in with some insight.

Adam Miskiewicz

12/04/2024, 9:53 PM

for context, these are the types of profiles I'm looking at: https://s.skevy.dev/dLkBCDmc

JP Sugarbroad

12/04/2024, 9:54 PM

Is there code I can glance at?

Adam Miskiewicz

12/04/2024, 9:54 PM

about 2/3's of my

scopedFuture

call (which is kind of just a wrapper around async, but propagates coroutineContext through Java-land) is spent manipulating context

Adam Miskiewicz

12/04/2024, 9:54 PM

@JP Sugarbroad unfortunately no, not yet open sourced

JP Sugarbroad

12/04/2024, 9:55 PM

what is ThreadLocalCoroutineContextManager?

JP Sugarbroad

12/04/2024, 9:57 PM

looks like non-trivial cost in dealing with

CopyableThreadContextElement

stuff.

Adam Miskiewicz

12/04/2024, 9:58 PM

ThreadLocalCoroutineContextManager

is something that lets me hold the coroutineContext inside of a thread local, so that I can reference the coroutinecontext from a non-suspend function.

Adam Miskiewicz

12/04/2024, 9:59 PM

but fwiw, even when I remove that from the critical path, I still see huge amounts of time spent inside this code.

JP Sugarbroad

12/04/2024, 9:59 PM

Yeah, might be interesting to see what's actually in your context. There's like 10 things in there.

Adam Miskiewicz

12/04/2024, 10:00 PM

yah it's a lot. Which is a problem in and of itself. But I actually removed all of those things for the bulk of the coroutines -- I put them in a "snapshot" context element, so I only had just one element (+ job/dispatcher).

Adam Miskiewicz

12/04/2024, 10:01 PM

and note that the snapshot element is not a threadlocal element

Adam Miskiewicz

12/04/2024, 10:01 PM

just a normal one.

Adam Miskiewicz

12/04/2024, 10:01 PM

but yah, still seeing tons of overhead even with a reduced context size.

kevin.cianfarini

12/04/2024, 10:02 PM

It might be worthwhile sharing how much CPU usage reduction you see when you reduce the context’s size

➕ 1

JP Sugarbroad

12/04/2024, 10:02 PM

My next suspect would be dispatcher interceptions, which means playing in

Dispatchers.Unconfined

Adam Miskiewicz

12/04/2024, 10:03 PM

i went from about 20% CPU to about 11% CPU

Adam Miskiewicz

12/04/2024, 10:03 PM

i could get it further probably.

JP Sugarbroad

12/04/2024, 10:03 PM

But before that, I would find out what the

CopyableThreadContextElement

is. Those things are expensive.

kevin.cianfarini

12/04/2024, 10:03 PM

I wonder if it scales non-linearly with context element size?

kevin.cianfarini

12/04/2024, 10:03 PM

eg. 5 elements are fine, 10 are okay-ish, 20 and you’re crawling

JP Sugarbroad

12/04/2024, 10:04 PM

Don't think so.

JP Sugarbroad

12/04/2024, 10:09 PM

threadContextElements

could use to be optimized, perhaps, but it looks like the intention is that `DispatchedContinuation`s don't get created super-often.

JP Sugarbroad

12/04/2024, 10:10 PM

I'd love to see an updated flame graph with the reduced context.

Samuel Vazquez

12/04/2024, 10:18 PM

the fact that graphql-java works with futures there is nothing much you can do, in last version of graphql-java they stopped wrapping synchronously resolved data (already in memory) into completable features, alleviating the memory and CPU usage bottleneck that the GC was causing when cleaning up all those CFs. (before that i wansnt even able to use ZGC for example). i would imagine, you must have a very good reason to write your own execution strategy implementation. In my personal experience, the less context switching you do, the better. you can always interop with the default execution strategies that graphql java provide

kevin.cianfarini

12/04/2024, 10:47 PM

What was the scenario that produced the above flame graph?

kevin.cianfarini

12/04/2024, 10:48 PM

1M coroutines per second?

Adam Miskiewicz

12/04/2024, 10:48 PM

or thereabouts. I'm estimating, might not be exactly that.

👍 1

Adam Miskiewicz

12/04/2024, 10:48 PM

but that order of magnitude

kevin.cianfarini

12/04/2024, 10:50 PM

I have a kind of wacky idea

kevin.cianfarini

12/04/2024, 10:51 PM

I wonder if, rather than spawning coroutines on demand that have a singular task, you attempt to have longer lived "field resolver" coroutines

kevin.cianfarini

12/04/2024, 10:51 PM

And then those coroutines start doing work when a request is received by sending fields that need to be resolved through a channel so that they can start being processed, and the resolved field is sent back over a channel

kevin.cianfarini

12/04/2024, 10:52 PM

The number of coroutines could grow and shrink with demand to some extent

kevin.cianfarini

12/04/2024, 10:52 PM

But the crux of the idea is to amortize the cost of creating a new coroutine across many different requests

Adam Miskiewicz

12/04/2024, 10:52 PM

good idea! thought about this too

Adam Miskiewicz

12/04/2024, 10:53 PM

one troubled part here is that I've essentially emulated javascripts process.nextTick witha. custom dispatcher

Adam Miskiewicz

12/04/2024, 10:53 PM

in order to batch dataloader calls

Adam Miskiewicz

12/04/2024, 10:54 PM

so if you have a channel producer/consumer situation -- where you have a fixed number of consumers "doing work" -- I'd essentially be bounding my batch size

Adam Miskiewicz

12/04/2024, 10:54 PM

which...might be OK?

Adam Miskiewicz

12/04/2024, 10:54 PM

but its tricky.

Adam Miskiewicz

12/04/2024, 10:54 PM

but it's a good idea.

Samuel Vazquez

12/04/2024, 10:54 PM

have you looked into the graphql-kotlin custom dataloader dispatching mechanism ? https://opensource.expediagroup.com/graphql-kotlin/docs/server/data-loader/data-loader-instrumentation/ it works with coroutines, and it dispatches when is absolutely necessary. Very similar to the JS event loop with microtasks.

Adam Miskiewicz

12/04/2024, 10:54 PM

and I'm wondering if I might even use the channel approach on top of my Deferred approach that I mentioned above, such that we can still actually parallelize work across multiple threads that would otherwise be synchronous.

Adam Miskiewicz

12/04/2024, 10:55 PM

yes @Samuel Vazquez -- my solution was written before this came out and does it at the coroutine dispatcher level

Adam Miskiewicz

12/04/2024, 10:55 PM

but it might be reasonable to do it inside of GraphQL execution instead

Samuel Vazquez

12/04/2024, 10:57 PM

In my personal experience with graphql-java, the less indirections you write on top of the engine the better, sufficient overhead already exists with graphql-java instrumentations that pretty much allow everyone to hook into the engine (specially the datadog agent)

Samuel Vazquez

12/05/2024, 1:17 AM

You mentioned the usage of async for every field in a graphql, does it have to be that way though ? Like all resolvers return a deferred object ? In a lot of cases resolvers are just mappers for domain data that is already in memory.

ephemient

12/08/2024, 10:59 AM

Flow has functions such as

.flatMapMerge()

that process an unbounded number of flows but only up to DEFAULT_CONCURRENCY concurrently at a time

2 Views

Open in Slack

Previous Next