Does anyone had problems with memory of a ktor ser...
# server
Does anyone had problems with memory of a ktor server app? I have a ktor netty server deployed in a ECS fargate instance, I see that the memory usage keeps increasing almost linearly, when it reaches about 70% of capacity the health check service starts receiving timeout and replace the instance, and the behavior starts again. I’ve got a heap dump and it doesn’t look like there is something in the app leaking memory. Maybe the quantity of requests are exceeding the instance’s capacity? Any help would be really appreciated.
I believe Netty uses a static thread pool, so the server and request rate are unlikely to be blamed for high memory usage. For contrast, I believe the Helidon server uses disposable virtual threads, and it's presumably unbounded request rate has caused memory spikes for me. Some theories for your situation: • You might just have a memory leak in your application. Are you performing any in-memory caching, or using any persistent data-structures? • I personally don't find Netty to be the most stable server. Have you tried a different server implementation? • It's odd that your app would die at 70% memory utilization. I'm not familiar with how Netty would react to too many requests, but perhaps you just need to scale out to handle the load
🙏 1
suggest to be careful (not do it) when sharing hprofs.. you can give out more information than you might think. an hprof doesnt contain (that i know of) the Xmx settings, but the size of objects in the profile is small - only 20MB or so. there are a few threads that seem to be doing something with mongo - is it possible that you're waiting for a slow operation? what is the fargate config? you're running with java 11. what is the java config?
@James Richardson thanks for the tip, i’ll attach a sanitized version of the hprof here. The instance config is 512mb memory and JVM config:
-Xms128m -Xmx256m
. It is running on a
container. I`ll dig more about the mongo situation, I know it does make a lot of requests but I got no hints that they are slow
Hey @Andrew O'Hara, thanks for the tips. I’ll try other engines to see if that behavior changes. Netty is the default right? Which one do you choose in your projects?
I personally like to use Jetty
its a bit of a mystery. you say -Xmx is 256m, but there is only 21MB of heap objects. There are very few threads. Quite a few of those threads are doing mongo things. My immediate thought is that for whatever reason there is no free worker to handle a health endpoint call so the server is killed, and its nothing to do with memory. However this doesn't match "70% capacity"...
@James Richardson we do
Copy code
+*UseContainerSupport* -XX:MaxRAMPercentage=80.0 -XX:ActiveProcessorCount=1
where our instances have only one (v)CPU. 80% RAM leave some room for the OS an OS services. The UseContainerSupport a bit magical, try reading on it. We some times increase RAM on our instances, this config flag set makes JVM scale along nicely.
Yeah - agreed (I'm not the OP), but now UseContainerSupport is enabled by default. I tend to use
Copy code
the active processor count is a good suggestion, but i think might not be required on 11.0.16+ (OP is 11.0.23) as it can pick up cpu share from cgroups. (not 100% sure on that)
🙏 1
@James Richardson just to clarify, the heap dump was taken in the middle life of the container, maybe thats why it doesn’t show the full heap being used (if I take too long to do it, I may loose connection in the middle of the process of dumping). I’m attaching a snapshot of the ECS memory monitoring to illustrate the patterns, whats bugs me must is that I doesn’t event reach to higher memory capacity. I have also frequently monitored the heap with
jhsdb jmap
and GC seems to work normal.
This is a snapshop I took minutes before an outage, It looks like the problem ins not related to heap since there is a lot of free space in all generations
Copy code
JVM version is 11.0.23+9-LTS

using thread-local object allocation.
Mark Sweep Compact GC

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 268435456 (256.0MB)
   NewSize                  = 44695552 (42.625MB)
   MaxNewSize               = 89456640 (85.3125MB)
   OldSize                  = 89522176 (85.375MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 40435712 (38.5625MB)
   used     = 6246584 (5.957206726074219MB)
   free     = 34189128 (32.60529327392578MB)
   15.448185999544165% used
Eden Space:
   capacity = 35979264 (34.3125MB)
   used     = 5063816 (4.829231262207031MB)
   free     = 30915448 (29.48326873779297MB)
   14.074262330658014% used
From Space:
   capacity = 4456448 (4.25MB)
   used     = 1182768 (1.1279754638671875MB)
   free     = 3273680 (3.1220245361328125MB)
   26.540599149816178% used
To Space:
   capacity = 4456448 (4.25MB)
   used     = 0 (0.0MB)
   free     = 4456448 (4.25MB)
   0.0% used
tenured generation:
   capacity = 89522176 (85.375MB)
   used     = 26105088 (24.895751953125MB)
   free     = 63417088 (60.479248046875MB)
   29.160470808931187% used
yes - i think something to do with health check not responding and therefore container killed
I can’t see why would ktor stops responding periodically after an amount of time, the way I see the app has not even the time to log any exception. Anyways I will increase memory capacity to check if it’s related to memory att all and apply the given recommendations. Thank you guys for the useful insights
👍 1
Ig you will need to have a systemd file and configure it to restart on failure.
look for stuff like runCatching in the source code, it's very commonly abused and my understanding is it can cause issues like these
👍 1
Hey guys I tried increasing the memory to 1G and the behavior continued, this time it took more than a week to restart and failed at 50%. That implies it is really a memory issue, right?
I guess so 🤷 . But curious that it fails only around 50%. Question though: is it really all that bad the app dies after a week? It's very good practice to have multiple containers running for redundancy, so a dead container shouldn't cause any issues until ECS replaces it. In fact, if you add autoscaling, you'll probably have containers going up and down all the time. For one of my own services that autoscales regularly, the oldest container is only 18 hours old.
Yes most of the time we deploy new versions in less than a week which replaces the container, it should’t be a problem. Whats metrics do you use to autoscale a jvm based app?
It can be difficult to find the best metric to scale on, but CPU is often quite reliable.
It appears your memory usage is quite consistent, so it would probably not be good to scale on. But if your metrics change over time, then you can certainly save $$$ by auto-scaling
🙏 1