Does anyone had problems with memory of a ktor server app I kotlinlang #server

Does anyone had problems with memory of a ktor ser...

Vinicius Araujo

04/24/2024, 2:53 PM

Does anyone had problems with memory of a ktor server app? I have a ktor netty server deployed in a ECS fargate instance, I see that the memory usage keeps increasing almost linearly, when it reaches about 70% of capacity the health check service starts receiving timeout and replace the instance, and the behavior starts again. I’ve got a heap dump and it doesn’t look like there is something in the app leaking memory. Maybe the quantity of requests are exceeding the instance’s capacity? Any help would be really appreciated.

Task1.hprof

Andrew O'Hara

04/24/2024, 3:55 PM

I believe Netty uses a static thread pool, so the server and request rate are unlikely to be blamed for high memory usage. For contrast, I believe the Helidon server uses disposable virtual threads, and it's presumably unbounded request rate has caused memory spikes for me. Some theories for your situation: • You might just have a memory leak in your application. Are you performing any in-memory caching, or using any persistent data-structures? • I personally don't find Netty to be the most stable server. Have you tried a different server implementation? • It's odd that your app would die at 70% memory utilization. I'm not familiar with how Netty would react to too many requests, but perhaps you just need to scale out to handle the load

🙏 1

James Richardson

04/24/2024, 5:02 PM

suggest to be careful (not do it) when sharing hprofs.. you can give out more information than you might think. an hprof doesnt contain (that i know of) the Xmx settings, but the size of objects in the profile is small - only 20MB or so. there are a few threads that seem to be doing something with mongo - is it possible that you're waiting for a slow operation? what is the fargate config? you're running with java 11. what is the java config?

Vinicius Araujo

04/24/2024, 6:47 PM

@James Richardson thanks for the tip, i’ll attach a sanitized version of the hprof here. The instance config is 512mb memory and JVM config:

-Xms128m -Xmx256m

. It is running on a

amazoncorretto:11

container. I`ll dig more about the mongo situation, I know it does make a lot of requests but I got no hints that they are slow

Task1-sanitized.hprof

Vinicius Araujo

04/24/2024, 6:53 PM

Hey @Andrew O'Hara, thanks for the tips. I’ll try other engines to see if that behavior changes. Netty is the default right? Which one do you choose in your projects?

Andrew O'Hara

04/24/2024, 7:06 PM

I personally like to use Jetty

James Richardson

04/25/2024, 11:16 AM

its a bit of a mystery. you say -Xmx is 256m, but there is only 21MB of heap objects. There are very few threads. Quite a few of those threads are doing mongo things. My immediate thought is that for whatever reason there is no free worker to handle a health endpoint call so the server is killed, and its nothing to do with memory. However this doesn't match "70% capacity"...

Cies

04/25/2024, 12:26 PM

@James Richardson we do

Copy code

-XX:
+*UseContainerSupport* -XX:MaxRAMPercentage=80.0 -XX:ActiveProcessorCount=1

where our instances have only one (v)CPU. 80% RAM leave some room for the OS an OS services. The UseContainerSupport a bit magical, try reading on it. We some times increase RAM on our instances, this config flag set makes JVM scale along nicely.

James Richardson

04/25/2024, 12:36 PM

Yeah - agreed (I'm not the OP), but now UseContainerSupport is enabled by default. I tend to use

Copy code

"-XX:+AlwaysActAsServerClassMachine",
"-XX:InitialRAMPercentage=60.0",
"-XX:MinRAMPercentage=60.0",
"-XX:MaxRAMPercentage=70.0",
"-XX:+ExitOnOutOfMemoryError",

the active processor count is a good suggestion, but i think might not be required on 11.0.16+ (OP is 11.0.23) as it can pick up cpu share from cgroups. (not 100% sure on that)

🙏 1

Vinicius Araujo

04/25/2024, 12:57 PM

@James Richardson just to clarify, the heap dump was taken in the middle life of the container, maybe thats why it doesn’t show the full heap being used (if I take too long to do it, I may loose connection in the middle of the process of dumping). I’m attaching a snapshot of the ECS memory monitoring to illustrate the patterns, whats bugs me must is that I doesn’t event reach to higher memory capacity. I have also frequently monitored the heap with

jhsdb jmap

and GC seems to work normal.

Vinicius Araujo

04/25/2024, 1:19 PM

This is a snapshop I took minutes before an outage, It looks like the problem ins not related to heap since there is a lot of free space in all generations

Copy code

JVM version is 11.0.23+9-LTS

using thread-local object allocation.
Mark Sweep Compact GC

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 268435456 (256.0MB)
   NewSize                  = 44695552 (42.625MB)
   MaxNewSize               = 89456640 (85.3125MB)
   OldSize                  = 89522176 (85.375MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 40435712 (38.5625MB)
   used     = 6246584 (5.957206726074219MB)
   free     = 34189128 (32.60529327392578MB)
   15.448185999544165% used
Eden Space:
   capacity = 35979264 (34.3125MB)
   used     = 5063816 (4.829231262207031MB)
   free     = 30915448 (29.48326873779297MB)
   14.074262330658014% used
From Space:
   capacity = 4456448 (4.25MB)
   used     = 1182768 (1.1279754638671875MB)
   free     = 3273680 (3.1220245361328125MB)
   26.540599149816178% used
To Space:
   capacity = 4456448 (4.25MB)
   used     = 0 (0.0MB)
   free     = 4456448 (4.25MB)
   0.0% used
tenured generation:
   capacity = 89522176 (85.375MB)
   used     = 26105088 (24.895751953125MB)
   free     = 63417088 (60.479248046875MB)
   29.160470808931187% used

James Richardson

04/25/2024, 1:21 PM

yes - i think something to do with health check not responding and therefore container killed

Vinicius Araujo

04/25/2024, 1:32 PM

I can’t see why would ktor stops responding periodically after an amount of time, the way I see the app has not even the time to log any exception. Anyways I will increase memory capacity to check if it’s related to memory att all and apply the given recommendations. Thank you guys for the useful insights

👍 1

Illustrator

04/26/2024, 9:33 AM

Ig you will need to have a systemd file and configure it to restart on failure.

Fred Friis

04/30/2024, 6:47 PM

look for stuff like runCatching in the source code, it's very commonly abused and my understanding is it can cause issues like these

👍 1

Vinicius Araujo

05/02/2024, 5:10 PM

Hey guys I tried increasing the memory to 1G and the behavior continued, this time it took more than a week to restart and failed at 50%. That implies it is really a memory issue, right?

Andrew O'Hara

05/02/2024, 6:30 PM

I guess so 🤷 . But curious that it fails only around 50%. Question though: is it really all that bad the app dies after a week? It's very good practice to have multiple containers running for redundancy, so a dead container shouldn't cause any issues until ECS replaces it. In fact, if you add autoscaling, you'll probably have containers going up and down all the time. For one of my own services that autoscales regularly, the oldest container is only 18 hours old.

Vinicius Araujo

05/02/2024, 6:38 PM

Yes most of the time we deploy new versions in less than a week which replaces the container, it should’t be a problem. Whats metrics do you use to autoscale a jvm based app?

Andrew O'Hara

05/02/2024, 6:39 PM

It can be difficult to find the best metric to scale on, but CPU is often quite reliable.

Andrew O'Hara

05/02/2024, 6:42 PM

It appears your memory usage is quite consistent, so it would probably not be good to scale on. But if your metrics change over time, then you can certainly save $$$ by auto-scaling

🙏 1

6 Views

Open in Slack

Previous Next