<@UB9K6R4JH> I'm getting today multiple times ```[...
# github-workflows-kt
v
@LeoColman I'm getting today multiple times
Copy code
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond
when executing workflow scripts, both locally and on GitHub Actions. They still work as in that condition it is retried and the 2. to 3. request is successful. But there seems to be "some" issue still.
l
Interesting. Did you do any investigation/found any logs?
This seems to be a reverse proxy issue, but it wouldn't justify happening randomly
v
No, didn't investigate, just observed while doing something else
p
v
Well, "went away" is too optimistic 😄
p
should have written "didn't see it anymore so far" 😄 @Vampire are you still seeing it?
v
Right now I'm not execution workflows, but yesterday I have seen it quite often
p
no failures in the service metrics, so if it's "our fault", the problem is either in the way we record metrics or in reverse proxy
v
Not necessarily. Could for example also be that the server is only accepting and queueing X request and if X+1 requests come in, they might not be answered or something like that. đŸ€·â€â™‚ïž
l
Could be, and that could be in the reverse proxy as well. Theoretically there's enough bandwidth and processing power to not hang any connection. I doubt it's a server-wide issue or we'd see more failures. My first guess is a configuration problem on the reverse proxy. Not very easy to reproduce, but we can look at caddy logs to see if it gives us any tips on if it's the problem
v
The "problem" is, that the you don't know how many failures are there, as for that error condition the resolution engine just retries and in the end finished successfully. You just see it manually if you look at the output.
p
I'm wondering what exactly
The target server failed to respond
means - did Kotlin get any response, or it timed out when waiting for it (then I'd expect a timeout-related message)?
ah,
NoHttpResponseException
- now it's clear
v
I just started a loop:
Copy code
for i in {01..100}; do echo $i; rm -rf ~/.m2/repository/actions/checkout*; .github/workflows/build.main.kts; done
and it failed on 5th try.
p
locally, that's good - easy to reproduce
feel free to create a ticket on GitHub
v
Yes, I also had seen it locally yesterday
Second failure on 12th try
Actually that one was a different one, "Connection reset"
p
I'll be able to take a look tomorrow morning
an idea for investigation: if it reproduces with this kind of frequency, let's do a
git bisect
on the latest changes (last ~1-2 weeks)
v
I have no idea whether it happens with a local server. That loop was executing against the production server.
p
ah, sorry... you're right, my incorrect assumption
Here's an idea how to proceed with this issue, let me know what you think, if it makes sense or if you have a better one: https://github.com/typesafegithub/github-workflows-kt/issues/1975#issuecomment-2900119371
Another idea, maybe simpler: have an atomic counter that is incremented when receiving a request and decremented when sending a response, and exposing the counter's value somehow, even through a separate endpoint or a metric
@LeoColman can I ask you to pull some logs from the reverse proxy?
ok, we have a concrete next step advised by the ktor support: https://kotlinlang.slack.com/archives/C0A974TJ9/p1747988699947299?thread_ts=1747947460.471269&amp;cid=C0A974TJ9 I think we need to tweak <https:log4j2.xml|log4j2.xml> to set logging level for Netty to DEBUG
v
Should be possible without re-deploy. Just change the file in the container, as long as the container does not restart it will be used. If you then hammer the server with requests again, you should be able to hopefully find the needed information. You might want to use pure requests to the server instead of calling the main.kts script which will make the testing faster and maybe get information like the request IDs. The "failing" requests will most probably not have one, but you could then see the request id of the request that worked before and eventually the requests that worked after to find the relevant section of the logs where something interesting might be logged.
👍 1
p
sorry, it's taking longer than I'd like for me to get to this problem... if anyone is able, feel free to take a closer look, perhaps enable more verbose logging
I've checked the reverse proxy logs (thanks Leo), and it's fairly hard to analyze them, but I don't see anything useful there I enabled more verbose logging, but: 1. I'm not sure if the config change is correct, I occasionally see messages about reloading the config, but without info what's wrong 2. Trying to reproduce the issue - no reproduction right now 😕
I see some suspicious logs via Grafana Logs, I think they're coming from Portainer, like: ‱
2025-05-28T09:32:13.915810+02:00 ritalee dockerd[979]: time="2025-05-28T09:32:13.915473868+02:00" level=error msg="copy stream failed" error="reading from a closed fifo" stream=stdout
‱
2025-05-28T09:29:58.698701+02:00 ritalee sshd[1721918]: Connection closed by authenticating user root 185.188.181.61 port 45810 [preauth]
‱
2025-05-28T09:32:26.442243+02:00 ritalee sshd[1727114]: Connection closed by invalid user odoo 196.251.84.225 port 56226 [preauth]
they may be completely unrelated to the problem we're investigating, but still...
v
I occasionally see messages about reloading the config, but without info what's wrong
Should just mean you edited the file and thus it is reloaded, what is the exact message?
Trying to reproduce the issue - no reproduction right now
At least yesterday it still reproduced during normal usage
p
after the last change I see this which looks healthy:
Copy code
2025-05-28T07:35:47.548992769Z Log4j2-TF-5-Scheduled-3 INFO Configuration source at `/app/resources/log4j2.xml` was modified on `2025-05-28T07:35:46.654Z`, previous modification was on `2025-05-28T07:09:02.257Z`

2025-05-28T07:35:47.562640458Z Log4j2-TF-4-ConfigurationFileWatcher-4 WARN RollingFileAppender 'Rolling File Appender': The bufferSize is set to 8192 but bufferedIO is not true

2025-05-28T07:35:47.563870305Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Starting configuration XmlConfiguration[location=/app/resources/log4j2.xml, lastModified=2025-05-28T07:35:46.654Z]...

2025-05-28T07:35:47.563968376Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Start watching for changes to /app/resources/log4j2.xml every 5 seconds

2025-05-28T07:35:47.564278738Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Configuration XmlConfiguration[location=/app/resources/log4j2.xml, lastModified=2025-05-28T07:35:46.654Z] started.

2025-05-28T07:35:47.565135842Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Stopping configuration XmlConfiguration[location=/app/resources/log4j2.xml, lastModified=2025-05-28T07:09:02.257Z]...

2025-05-28T07:35:47.565367256Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Configuration XmlConfiguration[location=/app/resources/log4j2.xml, lastModified=2025-05-28T07:09:02.257Z] stopped.
and my original surprise was caused by seeing it multiple times after I edited. Now I realized that it might be because of broken refreshing of logs in Portainer
👌 1
At least yesterday it still reproduced during normal usage
just observed the issue
👌 1
logging after the config change shows some Netty-specific logs, but on bindings.krzeminski.it I'm still not seeing anything related to Netty đŸ€”
v
broken refreshing of logs in Portainer
You might want to display the logs for one container, not the service. When showing the service logs, also the logs of shut down containers are mixed in, which then gives these strange interleaved logs that are hard to read.
p
I did so - selected a concrete, live container
v
Hm, ok, in that case I didn't see strange behavior yet. 😕
p
It would be cool if you could help by double-checking yourself that I'm not missing something - regarding why there are no Netty logs)
v
I don't see any logs at all anymore besides for metrics and status.
http <https://bindings.krzeminski.it/refresh/actions/setup-java/v4/setup-java-v4.pom>
should at least log "something".
p
yeah, the config change doesn't seem to bring us anything
v
Oh, it is very much delayed, now it shows the new log lines in Portainer too, it was just much delayed
p
can you paste an example new log? I don't see anything
v
Nah, I talked about the normal
... from ...
logs that also were not showing up in the Portainer web view
Even if I enable
ALL
logging on the root logger no netty logs appear
On restart of the server (in IDE with
io.netty
set to
ALL
, there are some Netty logs during ramp up. Maybe something additional needs to be set to make Netty log, like with Ktor where you needed to add a plugin.
p
yes, I saw them, too. The difference in setup is that we're just hot-reloading the log4j config on the hosting, not performing the full restart. Maybe Netty somehow looks at the log level only during startup? I know it wouldn't make sense because it's the logging framework that controls which messages go to the appenders, but I'm just observing this difference
v
There is no difference
Neither local nor production show any netty logs on requests
Netty could indeed do a
is...Enabled
call at startup and persist that, it would just be pretty stupid to do so as logging config can change at runtime. But well, that is not the problem. Netty just seems not to log anything or it needs to be enabled explicitly like for ktor.
👍 1
Did either of you do something? Right now the server is dead
l
Yes, me again
👌 1
I'll stop trying to test on production and find a way to test locally. It's too much server kills
👍 1
p
I think setting up an identical stack fir testung, reachable through your custom domain, is fairly simple
v
Now the whole domain is dead 🙈
l
The server is indeed fully down, I'm investigating
network jf6e68ul3luucqqqx7aag6xb9 exists this error is going on the reverse proxy, and I don't know why it's an error
😭 1
We're back
👌 1
🎉 2
p
less than two nines 😛 <joke>
😂 1
I think folks barely noticed, I'd expect more than Björn's message on #C02UUATR7RC
v
image.png
sad panda 2
😛
p
but hey, we measure it at least! 😛
🩜 1
v
You should set up an additional monitoring somewhere, that monitors that the whole server is down, or maybe even that Grafana is down as that would be the one sending notifications otherwise.
👍 1
p
some time ago I tried https://betterstack.com/ (on a different service), but in the free edition I think it allows up to one call per 3 minutes. Still better than nothing