Hi Are there any general tips for debugging slow performing kotlinlang #ktor

Hi! Are there any general tips for debugging slow-...

madisp

01/28/2025, 7:57 PM

Hi! Are there any general tips for debugging slow-performing ktor backends? We have a backend service with a mildish load (~1k rpm per instance) running on ktor and it seems to get stuck at times for no obvious reason - e.g. a ~1ms request taking a minute or more. In some cases we don't even observe the issues at the handler level. We do put blocking tasks off to the IO dispatcher and we've increased it's size substantially. I suspect there's a dispatcher that ends up being a bottleneck, either the io or one of the netty ones. I'm not sure from what angle to approach debugging this though?

Bruce Hamilton

01/29/2025, 9:31 AM

It might make sense to put blocking tasks on a separate dispatcher with its own thread pool if the IO dispatcher is getting saturated, since Ktor relies on this in some places. It's always good to capture a thread dump and see what's going on with the threads, and if there's a lot in waiting when things slow down then it's possible some blocking operation is gumming up a dispatcher.

Aleksei Tirman [JB]

01/29/2025, 9:38 AM

@madisp, can you tell me which server engine do you use?

madisp

01/29/2025, 10:38 AM

I'm using the netty engine

madisp

01/29/2025, 10:39 AM

@Bruce Hamilton good point - I was already thinking of maybe offloading stuff like db operations to a separate virt thread pool

madisp

01/29/2025, 10:41 AM

it would be really awesome if there was a non-hackish way of attaching micrometer/prometheus metrics to dispatchers

madisp

02/03/2025, 3:45 PM

I started monitoring the various Netty event loop groups and it looks like it's the

call

group that gets a lot of pending tasks in spikes - what exactly does that event loop group do?

madisp

02/03/2025, 3:52 PM

is this the handler / pipeline event loop group?

Bruce Hamilton

02/03/2025, 4:02 PM

yes, that's the one

Bruce Hamilton

02/03/2025, 4:02 PM

you can try increasing the number of threads available by updating

callGroupSize

, though there shouldn't be any blocking calls in there... unless there's some plugin or intercepter running afoul

madisp

02/03/2025, 4:02 PM

ok, I see pending tasks climbing up to ~500 or so under load, while

ktor_http_server_requests_active

is 30

🤔 1

madisp

02/03/2025, 4:04 PM

io pool is 512 threads, I'll see whether I can find anything on what could be blocking this

madisp

02/03/2025, 4:06 PM

I don't think increasing

callGroupSize

would help as well, the group has 4 threads (we have 4 vcpu) and only one of the threads peaks for a few seconds

madisp

02/03/2025, 4:07 PM

so it looks like something chokes on one of the call group event loops and then tasks start piling up on that thread (I'm assuming the incoming requests are distributed somewhat uniformly across the call group)

Bruce Hamilton

02/03/2025, 4:29 PM

I know the default active request limit is 32, so it sounds like that's hitting the cap there... if there's nothing stuck in waiting, then it could just be the OS churning on freeing ports or some other I/O

madisp

02/03/2025, 4:30 PM

hmm, active request limit of 32 feels very low - is there a way to tune this value somewhere? do long-running requests like websockets or sse responses consume out of that limit?

Bruce Hamilton

02/03/2025, 4:34 PM

oh that's a good question, I would hope not... the property is

ktor.deployment.runningLimit

or just

runningLimit = 100

if you're not using a config file

👍 1

madisp

02/03/2025, 4:35 PM

that's a good hint, thanks! I'll do some experiments

madisp

02/03/2025, 5:03 PM

hmm, I might be using this wrong, what exactly is considered an active request here? For kicks, I set

runningLimit = 3

, ran 12 long(ish) requests in parallel and didn't observe any queuing

Bruce Hamilton

02/03/2025, 5:55 PM

huh, I'm not very familiar with the mechanism. I guess it throttles the number of requests doing I/O at the same time... it determines whether or not to call https://netty.io/4.1/api/io/netty/channel/ChannelHandlerContext.html#read--

madisp

02/03/2025, 5:56 PM

I got as far, it’s a different number of active requests from the metrics one

madisp

02/03/2025, 5:57 PM

it looks a bit sus wrt when the limit is actually hit though, I’ll experiment a bit more and get back to you tomorrow

🙏 1

3 Views

Open in Slack

Previous Next