I feel like the answer to this question is going t...
# server
a
I feel like the answer to this question is going to be "It depends on your specific circumstances", but I'll try to give context to avoid that. So, I often run into issues where my high-load servers just stop responding to health checks for no apparent reason, and then a vicious cycle occurs where the replacements are brought up just as fast as the next server dies. The only thing that breaks the cycle is to overprovision and then wait for autoscaling to go back down to the baseline. I can't help but feel like the Jetty (12) server is just running out of threads. Here's some metrics from an example service: Request Characteristics • 20k requests / minute • 10 ms average response time • 45 ms p99 response time • RESTful API with a few different request archetypes: ◦ Roughly 95% of requests gather some info from another RESTful service, then push a job onto an in-memory queue for an executor to bulk insert to DynamoDB and ElasticSearch; they're very fast thanks to aggressive caching ◦ 4% of requests query an enormous DynamoDB index, with up to tens of thousands of sort keys per hash key; it's possible they could return a lot of data due to some poor design decisions ◦ 1% of requests query an enormous ElasticSearch cluster; I can't guarantee the requests are efficient, but there are safeguards to guarantee sane page sizes • Health checks are absolutely minimal; the request handlers statically return 200, so if even those are timing out (which they are), then the server is in serious trouble Hardware Characteristics • Amazon ECS service • scaling on a 75% CPU target, typically with only 2 tasks at a time • allocated 0.5 vCPU with 50% typical utilization • allocated 2 GB RAM with 50% typical utilization Are my containers just massively undersized despite the low utilization? Is it irresponsible to operate with a minimum container count of just 2? Do I need to tweak jetty settings at this scale? Would I have better luck with a virtual thread server?
d
Lots of questions: What type of services are we talking about? Typical microservices with a DB or some type of API gateway or proxy? Can we assume k8s or similar? What is your load profile per 24hrs? What are the readiness checks doing? You're right on one thing... It depends 😉 . But we will try to help 🙃
a
Hey Dave. I've updated my original question with the data you asked for. 24 hr load profile below so I can have a graph accompany it. Request patterns roughly follow a daily sinusoid, however, you can see we've had some instability lately.
image.png
The requests to other RESTful services and ElasticSearch all share a single Java11 HttpClient. I wonder if it's running out of threads and so all the Jetty threads are stuck waiting for a client thread to be available 🤔 .
d
You've got lots to play with here, but seems like you should instrument the outgoing and find out where the time is being spent ? You can easily split into one http client per destination and then reverse proxy into one port if you need to distinguish. Switching jetty to loom or helidon server/client would be a low impact experiment for traffic - even behind a randomised feature flag so you can tweak it in live and compare. Then you've got CPU bound - maybe for parsing large JSON is the blocker here? Are you Moshi or is there some Jackson here? On the face of it, the CPU should be good but finding hotspots with a profiler might help. Is auto scaling delay a problem? Min size of 2 is fine imho as long as you can scale quickly. How bursty is the traffic profile? Once you've got timings, you could build a simple test harness with the chaos engine to add latency and throw traffic at it locally. Or just pull some levers and observe. 🙃
d
It sounds like you are using kubernetes. Is this application the only one that has issues with failing probes? We have the issue that our k8s cluster seems to have issues with probes in general, despite a static 200 response from ktor (underlying is netty) and significantly lower volumes than what you describe. There are multiple open issues against kubernetes with respect to that. Just bringing this up here as it might not even be something that can or should be fixed in the server configuration itself.
c
what is your jetty config?
I think the java11 http client can not run out of threads because its non blocking.
a
When this happened to elastic beanstalk, t's because I stopped using AWS's Aurora as my Postgres DB and started Digital Ocean's Postgres service. In Digital Ocean, I'd manually whitelist the AWS instance for access to the db, but once autoscaling kicks in, the IP of the new EC2 box changes and, just like that, the new EC2 and others are automatically rejected. At some point, you get lucky and one of our old IPs pops back into service. I Fixed that by making a static EC2 instance, and using a postgres instance there that pointed to my elastic beanstalk as an allowed resource. Sounds like you're on dynamoDB, so I don't think you have the same issue. Just thought I'd share in case it jiggles anything open in your journey to debugging :)
j
Are you using ECS with envoy?
a
@dave > but seems like you should instrument the outgoing and find out where the time is being spent ? I'm sure I should. That being said, it's not something I'm familiar with, so I'd probably have to find more dedicated time to look into it later. > Switching jetty to loom or helidon server/client would be a low impact experiment for traffic Yeah, wouldn't hurt to try. Might just take a while to determine whether it had a positive impact. No issue = no issue? Or no issue = dormant issue? 🤔 > Then you've got CPU bound I don't know if I am. I think the CPU spikes are almost always cold starts from a deployment or death spiral. > Is auto scaling delay a problem? I don't think so... it seems to take about 10 seconds to start a container, and then 10 seconds to register it to the target group. Seems pretty fast. Especially compared to bare EC2. The issue with the death spiral is that when one server dies, the second one gets overwhelmed by the additional traffic soon after, and the autoscaler is only ever adding a single instance at a time, followed by a delay. So perhaps what I mean to say is yes, it does seem to be a problem. > How bursty is the traffic profile? Overall very consistent. But the Elastic Search queries are user-driven and bursty; I have no idea what kind of performance impact they have in the worst case scenario. Associated metrics below. Don't mind the spikes so much. Those are either death spirals or the MySQL upgrade we did this morning where our multi-AZ failover didn't seem to work as advertised.
@Dominik Sandjaja
It sounds like you are using kubernetes.
No, this is ECS, but I imagine the important bit is that it IS containerized @James Richardson
Are you using ECS with envoy?
This is ECS, but there are no envoys or sidecars
@christophsturm
I think the java11 http client can not run out of threads because its non blocking.
You're probably right. However I believe it has a connection pool. I wonder if THAT can run out.
what is your jetty config?
I'm using the http4k wrapper around Jetty 12, so I don't know if I can accurately say. But it appears to me the only thing http4k modifies from the standard config is to add a handler and tweak the stop mode.
d
Yes - if you need something custom then we encourage to take the raw jetty server config as a template and then tweak accordingly.
a
I'm not sure if and what I SHOULD modify in the Jetty config 😅 . But you're right, it absolutely can be overridden.
d
It is quite rare tbh. The last time I had to do it was for a raw API proxy server instead of a standard service
a
@Adrian Garcia
Sounds like you're on dynamoDB, so I don't think you have the same issue
It sounds like you had a networking issue. But then that would result in ZERO performance rather than the degraded performance I get. Or do I misunderstand?
j
We had some random connectivity issues with envoy aka app mesh, and it was easier to disable. One other thought is about health check settings. If the unhealthy count etc is v low, it might not take much for the service to just be killed.
a
@James Richardson I couldn't get app mesh to work at all 😅 . So I do all my routing with target groups. That's an interesting angle to look at it from, even if it would only address the symptom. My settings are the default set by Cloudformation, which is an unhealthy node is defined by failing two consecutive health checks with an 30 second interval. Seems pretty reasonable to me? But I could try to make it even more forgiving to reduce the likelihood of a death spiral. To be clear, it always appears to be the Target Group health check that fails, and not the container-level one (even though they both call the same endpoint).
j
You mention aggressive caching; is it possible that that causes the vicious cycle? That the replacement servers can’t handle the load because they are still busy filling the cache?
a
> is it possible that that causes the vicious cycle? An interesting thought. It's a multi-layer cache, with a non-aggressive in-memory cache fed by an aggressive distributed redis cache. It could be that filling the in-memory cache slows things down too much, but I have my doubts. 🤔 To be clear, the cache is only filled one item at a time on a miss. It doesn't pre-emptively fill.
a
@Andrew O'Hara Yes, it was a networking issu, but performance worked well until the 1st load balancer event would kick in. Then I'd get 0 performance.