Hi everybody We operate several Kotlin http4k services in th kotlinlang #http4k

Hi everybody! We operate several Kotlin/http4k se...

zed

03/15/2024, 9:45 AM

Hi everybody! We operate several Kotlin/http4k services in the AWS cloud. We notice regularly, that single pods are “freaking out”: first, the overall CPU usage goes up (which seem to happen when GC switches from minor to major (see below)). This is stable for several hours, although after a while we see performance impacts. Eventually, the CPU usage “explodes” until the memory is freed and everything is good again (as before, see “1” and “2" in the pictures below). We see this behaviour in basically all our 6 Kotlin/http4k services. Any ideas or suggestions? Would be highly appreciated, since we are a bit out of ideas. 😔 Environment: • Java 21 ◦ eclipse-temurin (based on OpenJDK) ◦ JAVA_OPTS:

“-XX:+UseParallelGC -XX:ActiveProcessorCount=2 -XX:MaxRAMPercentage=65.0 -XX:MinRAMPercentage=60.0”

• Http4k: 5.14.0.0 • Server: Undertow (but also netty) • ~30 requests / second • AWS, Kubernetes: ◦ requests: { memory: 1720Mi, cpu: 500m } ◦ limits: { memory: 1720Mi, cpu: 2000m } Observations: • When overall CPU usage starts to increase, minor GC usage completely stopped. • Self-healing effect (over ~12 hours). • Response time increases massively in the “explosion” phase (currently we have a timeout of 10s on client side).

zed

03/15/2024, 9:47 AM

image.png

Rob Elliot

03/15/2024, 10:07 AM

Stupid question, but you are sure that the process heals? It's not just that the orchestrator kills it and starts a new one? I ask only because I've defined graphs that made this mistake 😳

zed

03/15/2024, 10:18 AM

Good point. 🙂 In fact we see both: In the last 24 hours, in one out of four cases k8s restarted the pod, the other 3 healed themselves.

Rob Elliot

03/15/2024, 10:23 AM

Can you reproduce & take a heap dump? Something (a cache?) must be hanging on to references in a way that means it can ultimately release them, so is presumably designed not to cause a memory leak, but it's releasing them way too late, sending the GC into a loop.

zed

03/15/2024, 10:27 AM

No, unfortunately not. Pods in prod are using JRE (so we are unable to do a heap dump). In general it is hard to reproduce (locally) - seems to be independent of the traffic and usually happens only after the pod has a certain age (minimum a couple of days).

dave

03/15/2024, 10:32 AM

This does seem unusual - we've run much heavier workloads with undertow in K8S in the past and haven't ever found problems related to http4k. Can you share the other details of your apps - what's are they doing/talking to, what other libraries have you got in play etc...

zed

03/15/2024, 10:40 AM

The services are quite simple: either they call a database or another service (never both, all in the same VPC), do some simple data transformations and return a json (

http4k-format-jackson

). We basically use forkhandles

Result4k

, in one case

http4k-connect-redis

, in one case the official mongo client and for other service calls

okhttp

client.

dave

03/15/2024, 10:41 AM

in order to half your problem space, can you try swapping out undertow for something else like Jetty? I'd probably go with Jetty11 since 12 is still quite new?

zed

03/15/2024, 10:45 AM

Originally we started with netty, and switched some of the services to undertow (so we have currently both running). We didn’t try jetty, though, we will do that. We are also considering helidon …

James Richardson

03/15/2024, 12:19 PM

UseParallelOldGC aka PS MarkSweep is an STW collector, so as soon as that kicks in for any time, you'll be seeing things stop. You memory graph shows heap, but can you also show other pools? It would be interesting to look at other metrics - for example, response times up upstream services, number of requests in the queue. Undertow has different threads for different purposes and would be good to know know many of which you have, and how many are currently in use. This can be obtained via jmx.

zed

03/15/2024, 1:35 PM

You memory graph shows heap, but can you also show other pools?

You mean something like that?

zed

03/15/2024, 2:01 PM

we’ve run much heavier workloads with undertow in K8S in the past and haven’t ever found problems related to http4k

Our guess is that we have some kind of misconfiguration - that somehow the interaction of k8s resources, the used server (undertow, netty, etc) & JVM are not optimal. Are there any good practices when it comes to resources, undertow config and JVM setup? There are so many degrees of freedom (which JVM version, which distribution, additional JAVA_OPT configuration, k8s resources, etc.) ... 😳

James Richardson

03/15/2024, 2:08 PM

can you post a bit more history of that graph, where it does back to a low heap?

zed

03/15/2024, 2:12 PM

🤔 hmmmm …

James Richardson

03/15/2024, 2:13 PM

my hunch is that its a memory leak, or at least holding on to things that you don't need. for example reset at exactly 7am? then constantly growing.. then the gc will use up more and more of the cpu.

James Richardson

03/15/2024, 2:32 PM

of course jvm will only do a gc when it needs to free up the memory... so often need more information.

12 Views

Open in Slack

Previous Next