Hi! Are there any general tips for debugging slow-...
# ktor
m
Hi! Are there any general tips for debugging slow-performing ktor backends? We have a backend service with a mildish load (~1k rpm per instance) running on ktor and it seems to get stuck at times for no obvious reason - e.g. a ~1ms request taking a minute or more. In some cases we don't even observe the issues at the handler level. We do put blocking tasks off to the IO dispatcher and we've increased it's size substantially. I suspect there's a dispatcher that ends up being a bottleneck, either the io or one of the netty ones. I'm not sure from what angle to approach debugging this though?
b
It might make sense to put blocking tasks on a separate dispatcher with its own thread pool if the IO dispatcher is getting saturated, since Ktor relies on this in some places. It's always good to capture a thread dump and see what's going on with the threads, and if there's a lot in waiting when things slow down then it's possible some blocking operation is gumming up a dispatcher.
a
@madisp, can you tell me which server engine do you use?
m
I'm using the netty engine
@Bruce Hamilton good point - I was already thinking of maybe offloading stuff like db operations to a separate virt thread pool
it would be really awesome if there was a non-hackish way of attaching micrometer/prometheus metrics to dispatchers
I started monitoring the various Netty event loop groups and it looks like it's the
call
group that gets a lot of pending tasks in spikes - what exactly does that event loop group do?
is this the handler / pipeline event loop group?
b
yes, that's the one
you can try increasing the number of threads available by updating
callGroupSize
, though there shouldn't be any blocking calls in there... unless there's some plugin or intercepter running afoul
m
ok, I see pending tasks climbing up to ~500 or so under load, while
ktor_http_server_requests_active
is 30
🤔 1
io pool is 512 threads, I'll see whether I can find anything on what could be blocking this
I don't think increasing
callGroupSize
would help as well, the group has 4 threads (we have 4 vcpu) and only one of the threads peaks for a few seconds
so it looks like something chokes on one of the call group event loops and then tasks start piling up on that thread (I'm assuming the incoming requests are distributed somewhat uniformly across the call group)
b
I know the default active request limit is 32, so it sounds like that's hitting the cap there... if there's nothing stuck in waiting, then it could just be the OS churning on freeing ports or some other I/O
m
hmm, active request limit of 32 feels very low - is there a way to tune this value somewhere? do long-running requests like websockets or sse responses consume out of that limit?
b
oh that's a good question, I would hope not... the property is
ktor.deployment.runningLimit
or just
runningLimit = 100
if you're not using a config file
👍 1
m
that's a good hint, thanks! I'll do some experiments
hmm, I might be using this wrong, what exactly is considered an active request here? For kicks, I set
runningLimit = 3
, ran 12 long(ish) requests in parallel and didn't observe any queuing
b
huh, I'm not very familiar with the mechanism. I guess it throttles the number of requests doing I/O at the same time... it determines whether or not to call https://netty.io/4.1/api/io/netty/channel/ChannelHandlerContext.html#read--
m
I got as far, it’s a different number of active requests from the metrics one
it looks a bit sus wrt when the limit is actually hit though, I’ll experiment a bit more and get back to you tomorrow
🙏 1