Hi everybody! We operate several Kotlin/http4k se...
# http4k
z
Hi everybody! We operate several Kotlin/http4k services in the AWS cloud. We notice regularly, that single pods are “freaking out”: first, the overall CPU usage goes up (which seem to happen when GC switches from minor to major (see below)). This is stable for several hours, although after a while we see performance impacts. Eventually, the CPU usage “explodes” until the memory is freed and everything is good again (as before, see “1” and “2" in the pictures below). We see this behaviour in basically all our 6 Kotlin/http4k services. Any ideas or suggestions? Would be highly appreciated, since we are a bit out of ideas. 😔 Environment: • Java 21 ◦ eclipse-temurin (based on OpenJDK) ◦ JAVA_OPTS:
“-XX:+UseParallelGC -XX:ActiveProcessorCount=2 -XX:MaxRAMPercentage=65.0 -XX:MinRAMPercentage=60.0”
• Http4k: 5.14.0.0 • Server: Undertow (but also netty) • ~30 requests / second • AWS, Kubernetes: ◦ requests: { memory: 1720Mi, cpu: 500m } ◦ limits: { memory: 1720Mi, cpu: 2000m } Observations: • When overall CPU usage starts to increase, minor GC usage completely stopped. • Self-healing effect (over ~12 hours). • Response time increases massively in the “explosion” phase (currently we have a timeout of 10s on client side).
image.png
r
Stupid question, but you are sure that the process heals? It's not just that the orchestrator kills it and starts a new one? I ask only because I've defined graphs that made this mistake 😳
z
Good point. 🙂 In fact we see both: In the last 24 hours, in one out of four cases k8s restarted the pod, the other 3 healed themselves.
r
Can you reproduce & take a heap dump? Something (a cache?) must be hanging on to references in a way that means it can ultimately release them, so is presumably designed not to cause a memory leak, but it's releasing them way too late, sending the GC into a loop.
z
No, unfortunately not. Pods in prod are using JRE (so we are unable to do a heap dump). In general it is hard to reproduce (locally) - seems to be independent of the traffic and usually happens only after the pod has a certain age (minimum a couple of days).
d
This does seem unusual - we've run much heavier workloads with undertow in K8S in the past and haven't ever found problems related to http4k. Can you share the other details of your apps - what's are they doing/talking to, what other libraries have you got in play etc...
z
The services are quite simple: either they call a database or another service (never both, all in the same VPC), do some simple data transformations and return a json (
http4k-format-jackson
). We basically use forkhandles
Result4k
, in one case
http4k-connect-redis
, in one case the official mongo client and for other service calls
okhttp
client.
d
in order to half your problem space, can you try swapping out undertow for something else like Jetty? I'd probably go with Jetty11 since 12 is still quite new?
z
Originally we started with netty, and switched some of the services to undertow (so we have currently both running). We didn’t try jetty, though, we will do that. We are also considering helidon …
j
UseParallelOldGC aka PS MarkSweep is an STW collector, so as soon as that kicks in for any time, you'll be seeing things stop. You memory graph shows heap, but can you also show other pools? It would be interesting to look at other metrics - for example, response times up upstream services, number of requests in the queue. Undertow has different threads for different purposes and would be good to know know many of which you have, and how many are currently in use. This can be obtained via jmx.
z
You memory graph shows heap, but can you also show other pools?
You mean something like that?
we’ve run much heavier workloads with undertow in K8S in the past and haven’t ever found problems related to http4k
Our guess is that we have some kind of misconfiguration - that somehow the interaction of k8s resources, the used server (undertow, netty, etc) & JVM are not optimal. Are there any good practices when it comes to resources, undertow config and JVM setup? There are so many degrees of freedom (which JVM version, which distribution, additional JAVA_OPT configuration, k8s resources, etc.) ... 😳
j
can you post a bit more history of that graph, where it does back to a low heap?
z
🤔 hmmmm …
j
my hunch is that its a memory leak, or at least holding on to things that you don't need. for example reset at exactly 7am? then constantly growing.. then the gc will use up more and more of the cpu.
of course jvm will only do a gc when it needs to free up the memory... so often need more information.