We are seeing slow responses from a third party (a...
# ktor
a
We are seeing slow responses from a third party (around 3.5s) when we get many requests in a short time (about 20 per second, all of them need to go to the third party). The "weird" thing is that the response time from the server increases much more (maybe not so weird, but I would love to hear your take on it). It increases so much that the callers are getting time outs. We are using KTor Server+Client. Client is configured with OkHttp, and Server with Jetty if that matters. So my real question is really; what is going on here? And then how can I know? Is there some metrics etc. that I can monitor to see the thing that is blocking instead of just the "pattern" of responses increasing? What would you do to resolve it? ๐Ÿ™‚
a
Hard to say anything concrete ๐Ÿ˜… In general OpenTelemetry is great for observability, and has OOTB support for Ktor. Itโ€™s handy in situations like this, but you might need other tools also. Can you repro the issue locally?
a
Havn't really reporoduced locally, but might be possible. We do use New Relic so have decent numbers inside the JVM. We see that the time spent on calling the external partner is consistent, but that the time used to respond to the requests goes up. My theory is that things are queued as NR are showing processing time (avg) around 3.5s but the response times (no processing time) goes up to 15s . So in essence NR is saying there's a lot of waiting going on without processing in our code. So I am looking for ways to analyze/confirm this situation. Queue'd time in KTor server could give an indication. Or queued number of processes. Similarly for the client. One track I am following is that OkHttp has a max of 5 connections to one host. When the fill rate is 20 requests per sec and each takes 3.6 sec and we have 2 nodes (10 connections) that explains a potential queuing. I'd still like to see some kind of info to confirm that though. I will try and increase the number of connections to one host in OkHttp now. Would be nice to have some info though, outside of just trying stuff. ๐Ÿ™‚ Do you know what is the parallelism the server will handle with KTor+Jetty? I think probably the client is the limit here now.
a
OkHttp has a max of 5 connections to one host.
Do you mean the connection pool? It should just create a new connection if the pool is exhausted, and the overhead of that definitely shouldnโ€™t amount to what youโ€™re seeing ๐Ÿค”
a
Thanks @Nils Kohrs! Trying to apply that now. ๐Ÿ™‚
@AdamW I think it is a hard limit? This is the doc from the okhttp Dispatcher:
Copy code
The maximum number of requests for each host to execute concurrently. This limits requests by the URL's host name. Note that concurrent requests to a single IP address may still exceed this limit: multiple hostnames may share an IP address or be routed through the same HTTP proxy.
But I might be misinterpreting though. ๐Ÿ™‚
And if it is so, doesn't it make sense the response time increases because things are queued? If I can do 10 requests (2 nodes) and they all take 3.5 sec, and I have an influx of 20 requests per sec it will just fill up and the response time will become longer and longer?
n
It's a hard limit, we had the exact same problem you described. The amount of concurrent calls was limited
a
Great, hoping this works. ๐Ÿ™‚ Are you using OkHttp for any specific reason? We're not married to OkHttp as the engine, but we do get better numbers in New Relic for it because it is supported. ๐Ÿ™‚
n
We're using Dynatrace, and it doesn't support CIO, so we would lose tracing of the calls. The Apache client had a nasty memory leak in the past, don't know if that is fixed
Java client also doesn't produce traces with Dynatrace
So that ends us up with OkHttp being the best choice. It does perform very well, just this little default of 4-5 concurrent request was annoying to find out about
a
Hehe, so same reason basically. Great, thanks for the help. ๐Ÿ™‚
Find it annoying there isn't something like queue metrics for KTor and/or Coroutines though. Makes these exercises a lot more touchy feely. ๐Ÿ˜ž
n
Dynatraces problem is if you want to add manual traces to with their Sdk, then you need to start and stop it on the same thread... so doesn't really work in this case
a
Yeah, NR has it's issues with Kotlin/Coroutines too, but getting better. At least now Coroutines are officially supported. ๐Ÿ™‚
n
Dynatrace also doesn't support coroutines. I was able to make my own java agent to force this support. Soon we'll also have opentelemetry set up, which does have a lot more things supported out of the box, including coroutines
I did send Dynatrace the code to support coroutines, but they don't have any kotlin developers, so they just can't provide the support for it ๐Ÿ˜’
a
Ah, that's weird. About time they hired some. ๐Ÿ˜‰ NR dragged their feet as well, but are supporting it now so that's a relief. Looked into the OpenTelemetry agent too, but didn't give as good metrics as NR at the time. Think it might have to give it a go again. ๐Ÿ™‚ Thanks for the help. Will actually know around 12 if it worked. ๐Ÿ™‚
a
Well, TIL. I just verified itโ€™s indeed like that @Nils Kohrs - this could have become an issue down the road, thanks ๐Ÿ™‡
a
Hey @Nils Kohrs. Did you see any effects on memory consumption when increasing the number of concurrent requests? We're wondering if we see a slight increase because of that or if there are other issues. ๐Ÿ™‚
n
Memory hasn't been an issue with it. If it has slightly increase I'm not sure, what would you label as an slight increase?
And it probably also depends on the size of the response you're processing. We didn't have to adjust the memory resources for our pods
a
We have a quite low average and was hovering on heap below 200MB. It is usually closer to 300MB now, so 50% but not too bad considering how low it was. ๐Ÿ™‚ Yeah, the responses are probably medium plus-ish. Thanks, just trying to see which areas might be affected. ๐Ÿ™‚