We are seeing slow responses from a third party around 3 5s kotlinlang #ktor

We are seeing slow responses from a third party (a...

Anders Sveen

12/07/2023, 11:28 AM

We are seeing slow responses from a third party (around 3.5s) when we get many requests in a short time (about 20 per second, all of them need to go to the third party). The "weird" thing is that the response time from the server increases much more (maybe not so weird, but I would love to hear your take on it). It increases so much that the callers are getting time outs. We are using KTor Server+Client. Client is configured with OkHttp, and Server with Jetty if that matters. So my real question is really; what is going on here? And then how can I know? Is there some metrics etc. that I can monitor to see the thing that is blocking instead of just the "pattern" of responses increasing? What would you do to resolve it? 🙂

AdamW

12/07/2023, 2:00 PM

Hard to say anything concrete 😅 In general OpenTelemetry is great for observability, and has OOTB support for Ktor. It’s handy in situations like this, but you might need other tools also. Can you repro the issue locally?

Anders Sveen

12/07/2023, 2:10 PM

Havn't really reporoduced locally, but might be possible. We do use New Relic so have decent numbers inside the JVM. We see that the time spent on calling the external partner is consistent, but that the time used to respond to the requests goes up. My theory is that things are queued as NR are showing processing time (avg) around 3.5s but the response times (no processing time) goes up to 15s . So in essence NR is saying there's a lot of waiting going on without processing in our code. So I am looking for ways to analyze/confirm this situation. Queue'd time in KTor server could give an indication. Or queued number of processes. Similarly for the client. One track I am following is that OkHttp has a max of 5 connections to one host. When the fill rate is 20 requests per sec and each takes 3.6 sec and we have 2 nodes (10 connections) that explains a potential queuing. I'd still like to see some kind of info to confirm that though. I will try and increase the number of connections to one host in OkHttp now. Would be nice to have some info though, outside of just trying stuff. 🙂 Do you know what is the parallelism the server will handle with KTor+Jetty? I think probably the client is the limit here now.

AdamW

12/07/2023, 4:59 PM

OkHttp has a max of 5 connections to one host.

Do you mean the connection pool? It should just create a new connection if the pool is exhausted, and the overhead of that definitely shouldn’t amount to what you’re seeing 🤔

Nils Kohrs

12/08/2023, 7:22 AM

Have a look at this thread https://kotlinlang.slack.com/archives/C0A974TJ9/p1699701136203609?thread_ts=1699701136.203609&cid=C0A974TJ9

Anders Sveen

12/08/2023, 7:26 AM

Thanks @Nils Kohrs! Trying to apply that now. 🙂

Anders Sveen

12/08/2023, 7:27 AM

@AdamW I think it is a hard limit? This is the doc from the okhttp Dispatcher:

Copy code

The maximum number of requests for each host to execute concurrently. This limits requests by the URL's host name. Note that concurrent requests to a single IP address may still exceed this limit: multiple hostnames may share an IP address or be routed through the same HTTP proxy.

But I might be misinterpreting though. 🙂

Anders Sveen

12/08/2023, 7:30 AM

And if it is so, doesn't it make sense the response time increases because things are queued? If I can do 10 requests (2 nodes) and they all take 3.5 sec, and I have an influx of 20 requests per sec it will just fill up and the response time will become longer and longer?

Nils Kohrs

12/08/2023, 7:30 AM

It's a hard limit, we had the exact same problem you described. The amount of concurrent calls was limited

Anders Sveen

12/08/2023, 7:31 AM

Great, hoping this works. 🙂 Are you using OkHttp for any specific reason? We're not married to OkHttp as the engine, but we do get better numbers in New Relic for it because it is supported. 🙂

Nils Kohrs

12/08/2023, 7:32 AM

We're using Dynatrace, and it doesn't support CIO, so we would lose tracing of the calls. The Apache client had a nasty memory leak in the past, don't know if that is fixed

Nils Kohrs

12/08/2023, 7:33 AM

Java client also doesn't produce traces with Dynatrace

Nils Kohrs

12/08/2023, 7:34 AM

So that ends us up with OkHttp being the best choice. It does perform very well, just this little default of 4-5 concurrent request was annoying to find out about

Anders Sveen

12/08/2023, 7:34 AM

Hehe, so same reason basically. Great, thanks for the help. 🙂

Anders Sveen

12/08/2023, 7:35 AM

Find it annoying there isn't something like queue metrics for KTor and/or Coroutines though. Makes these exercises a lot more touchy feely. 😞

Nils Kohrs

12/08/2023, 7:37 AM

Dynatraces problem is if you want to add manual traces to with their Sdk, then you need to start and stop it on the same thread... so doesn't really work in this case

Anders Sveen

12/08/2023, 7:44 AM

Yeah, NR has it's issues with Kotlin/Coroutines too, but getting better. At least now Coroutines are officially supported. 🙂

Nils Kohrs

12/08/2023, 8:04 AM

Dynatrace also doesn't support coroutines. I was able to make my own java agent to force this support. Soon we'll also have opentelemetry set up, which does have a lot more things supported out of the box, including coroutines

Nils Kohrs

12/08/2023, 8:05 AM

I did send Dynatrace the code to support coroutines, but they don't have any kotlin developers, so they just can't provide the support for it 😒

Anders Sveen

12/08/2023, 8:32 AM

Ah, that's weird. About time they hired some. 😉 NR dragged their feet as well, but are supporting it now so that's a relief. Looked into the OpenTelemetry agent too, but didn't give as good metrics as NR at the time. Think it might have to give it a go again. 🙂 Thanks for the help. Will actually know around 12 if it worked. 🙂

AdamW

12/08/2023, 9:30 AM

Well, TIL. I just verified it’s indeed like that @Nils Kohrs - this could have become an issue down the road, thanks 🙇

Anders Sveen

12/18/2023, 2:03 PM

Hey @Nils Kohrs. Did you see any effects on memory consumption when increasing the number of concurrent requests? We're wondering if we see a slight increase because of that or if there are other issues. 🙂

Nils Kohrs

12/18/2023, 2:05 PM

Memory hasn't been an issue with it. If it has slightly increase I'm not sure, what would you label as an slight increase?

Nils Kohrs

12/18/2023, 2:06 PM

And it probably also depends on the size of the response you're processing. We didn't have to adjust the memory resources for our pods

Anders Sveen

12/18/2023, 2:09 PM

We have a quite low average and was hovering on heap below 200MB. It is usually closer to 300MB now, so 50% but not too bad considering how low it was. 🙂 Yeah, the responses are probably medium plus-ish. Thanks, just trying to see which areas might be affected. 🙂

2 Views

Open in Slack

Previous Next