< LeoColman> I m getting today multiple times ``` main INFO kotlinlang #github-workflows-kt

<@UB9K6R4JH> I'm getting today multiple times ```[...

Vampire

05/20/2025, 4:21 PM

@LeoColman I'm getting today multiple times

Copy code

[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond

when executing workflow scripts, both locally and on GitHub Actions. They still work as in that condition it is retried and the 2. to 3. request is successful. But there seems to be "some" issue still.

LeoColman

05/20/2025, 6:36 PM

Interesting. Did you do any investigation/found any logs?

LeoColman

05/20/2025, 6:37 PM

This seems to be a reverse proxy issue, but it wouldn't justify happening randomly

Vampire

05/20/2025, 6:37 PM

No, didn't investigate, just observed while doing something else

Piotr Krzemiński

05/20/2025, 7:13 PM

Also saw it a few days ago, but it went away:https://github.com/typesafegithub/github-workflows-kt/actions/runs/15087231764/job/42411318829

Vampire

05/20/2025, 8:37 PM

Well, "went away" is too optimistic 😄

Piotr Krzemiński

05/21/2025, 11:52 AM

should have written "didn't see it anymore so far" 😄 @Vampire are you still seeing it?

Vampire

05/21/2025, 11:53 AM

Right now I'm not execution workflows, but yesterday I have seen it quite often

Piotr Krzemiński

05/21/2025, 11:54 AM

no failures in the service metrics, so if it's "our fault", the problem is either in the way we record metrics or in reverse proxy

Vampire

05/21/2025, 11:55 AM

Not necessarily. Could for example also be that the server is only accepting and queueing X request and if X+1 requests come in, they might not be answered or something like that. 🤷‍♂️

LeoColman

05/21/2025, 11:57 AM

Could be, and that could be in the reverse proxy as well. Theoretically there's enough bandwidth and processing power to not hang any connection. I doubt it's a server-wide issue or we'd see more failures. My first guess is a configuration problem on the reverse proxy. Not very easy to reproduce, but we can look at caddy logs to see if it gives us any tips on if it's the problem

Vampire

05/21/2025, 12:01 PM

The "problem" is, that the you don't know how many failures are there, as for that error condition the resolution engine just retries and in the end finished successfully. You just see it manually if you look at the output.

Piotr Krzemiński

05/21/2025, 12:03 PM

I'm wondering what exactly

The target server failed to respond

means - did Kotlin get any response, or it timed out when waiting for it (then I'd expect a timeout-related message)?

Piotr Krzemiński

05/21/2025, 12:03 PM

ah,

NoHttpResponseException

- now it's clear

Vampire

05/21/2025, 12:07 PM

I just started a loop:

Copy code

for i in {01..100}; do echo $i; rm -rf ~/.m2/repository/actions/checkout*; .github/workflows/build.main.kts; done

and it failed on 5th try.

Piotr Krzemiński

05/21/2025, 12:08 PM

locally, that's good - easy to reproduce

Piotr Krzemiński

05/21/2025, 12:08 PM

feel free to create a ticket on GitHub

Vampire

05/21/2025, 12:08 PM

Yes, I also had seen it locally yesterday

Vampire

05/21/2025, 12:08 PM

Second failure on 12th try

Vampire

05/21/2025, 12:10 PM

Actually that one was a different one, "Connection reset"

Vampire

05/21/2025, 12:12 PM

https://github.com/typesafegithub/github-workflows-kt/issues/1975

thank you color 1

Piotr Krzemiński

05/21/2025, 12:13 PM

I'll be able to take a look tomorrow morning

Piotr Krzemiński

05/21/2025, 12:15 PM

an idea for investigation: if it reproduces with this kind of frequency, let's do a

git bisect

on the latest changes (last ~1-2 weeks)

Vampire

05/21/2025, 12:16 PM

I have no idea whether it happens with a local server. That loop was executing against the production server.

Piotr Krzemiński

05/21/2025, 12:17 PM

ah, sorry... you're right, my incorrect assumption

Piotr Krzemiński

05/22/2025, 7:17 PM

Here's an idea how to proceed with this issue, let me know what you think, if it makes sense or if you have a better one: https://github.com/typesafegithub/github-workflows-kt/issues/1975#issuecomment-2900119371

Piotr Krzemiński

05/22/2025, 8:36 PM

Another idea, maybe simpler: have an atomic counter that is incremented when receiving a request and decremented when sending a response, and exposing the counter's value somehow, even through a separate endpoint or a metric

Piotr Krzemiński

05/23/2025, 6:14 AM

https://kotlinlang.slack.com/archives/C0A974TJ9/p1747947460471269?thread_ts=1747947460.471269&cid=C0A974TJ9

Piotr Krzemiński

05/23/2025, 6:25 AM

@LeoColman can I ask you to pull some logs from the reverse proxy?

Piotr Krzemiński

05/23/2025, 8:50 AM

ok, we have a concrete next step advised by the ktor support: https://kotlinlang.slack.com/archives/C0A974TJ9/p1747988699947299?thread_ts=1747947460.471269&cid=C0A974TJ9 I think we need to tweak <https:log4j2.xml|log4j2.xml> to set logging level for Netty to DEBUG

Vampire

05/23/2025, 8:57 AM

Should be possible without re-deploy. Just change the file in the container, as long as the container does not restart it will be used. If you then hammer the server with requests again, you should be able to hopefully find the needed information. You might want to use pure requests to the server instead of calling the main.kts script which will make the testing faster and maybe get information like the request IDs. The "failing" requests will most probably not have one, but you could then see the request id of the request that worked before and eventually the requests that worked after to find the relevant section of the logs where something interesting might be logged.

👍 1

Piotr Krzemiński

05/26/2025, 8:42 AM

sorry, it's taking longer than I'd like for me to get to this problem... if anyone is able, feel free to take a closer look, perhaps enable more verbose logging

Piotr Krzemiński

05/28/2025, 7:27 AM

I've checked the reverse proxy logs (thanks Leo), and it's fairly hard to analyze them, but I don't see anything useful there I enabled more verbose logging, but: 1. I'm not sure if the config change is correct, I occasionally see messages about reloading the config, but without info what's wrong 2. Trying to reproduce the issue - no reproduction right now 😕

Piotr Krzemiński

05/28/2025, 7:33 AM

I see some suspicious logs via Grafana Logs, I think they're coming from Portainer, like: •

2025-05-28T09:32:13.915810+02:00 ritalee dockerd[979]: time="2025-05-28T09:32:13.915473868+02:00" level=error msg="copy stream failed" error="reading from a closed fifo" stream=stdout

•

2025-05-28T09:29:58.698701+02:00 ritalee sshd[1721918]: Connection closed by authenticating user root 185.188.181.61 port 45810 [preauth]

•

2025-05-28T09:32:26.442243+02:00 ritalee sshd[1727114]: Connection closed by invalid user odoo 196.251.84.225 port 56226 [preauth]

they may be completely unrelated to the problem we're investigating, but still...

Vampire

05/28/2025, 7:36 AM

I occasionally see messages about reloading the config, but without info what's wrong

Should just mean you edited the file and thus it is reloaded, what is the exact message?

Trying to reproduce the issue - no reproduction right now

At least yesterday it still reproduced during normal usage

Piotr Krzemiński

05/28/2025, 7:38 AM

after the last change I see this which looks healthy:

Copy code

2025-05-28T07:35:47.548992769Z Log4j2-TF-5-Scheduled-3 INFO Configuration source at `/app/resources/log4j2.xml` was modified on `2025-05-28T07:35:46.654Z`, previous modification was on `2025-05-28T07:09:02.257Z`

2025-05-28T07:35:47.562640458Z Log4j2-TF-4-ConfigurationFileWatcher-4 WARN RollingFileAppender 'Rolling File Appender': The bufferSize is set to 8192 but bufferedIO is not true

2025-05-28T07:35:47.563870305Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Starting configuration XmlConfiguration[location=/app/resources/log4j2.xml, lastModified=2025-05-28T07:35:46.654Z]...

2025-05-28T07:35:47.563968376Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Start watching for changes to /app/resources/log4j2.xml every 5 seconds

2025-05-28T07:35:47.564278738Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Configuration XmlConfiguration[location=/app/resources/log4j2.xml, lastModified=2025-05-28T07:35:46.654Z] started.

2025-05-28T07:35:47.565135842Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Stopping configuration XmlConfiguration[location=/app/resources/log4j2.xml, lastModified=2025-05-28T07:09:02.257Z]...

2025-05-28T07:35:47.565367256Z Log4j2-TF-4-ConfigurationFileWatcher-4 INFO Configuration XmlConfiguration[location=/app/resources/log4j2.xml, lastModified=2025-05-28T07:09:02.257Z] stopped.

and my original surprise was caused by seeing it multiple times after I edited. Now I realized that it might be because of broken refreshing of logs in Portainer

👌 1

Piotr Krzemiński

05/28/2025, 7:39 AM

At least yesterday it still reproduced during normal usage

just observed the issue

👌 1

Piotr Krzemiński

05/28/2025, 7:41 AM

logging after the config change shows some Netty-specific logs, but on bindings.krzeminski.it I'm still not seeing anything related to Netty 🤔

Vampire

05/28/2025, 7:41 AM

broken refreshing of logs in Portainer

You might want to display the logs for one container, not the service. When showing the service logs, also the logs of shut down containers are mixed in, which then gives these strange interleaved logs that are hard to read.

Piotr Krzemiński

05/28/2025, 7:42 AM

I did so - selected a concrete, live container

Vampire

05/28/2025, 7:43 AM

Hm, ok, in that case I didn't see strange behavior yet. 😕

Piotr Krzemiński

05/28/2025, 7:53 AM

It would be cool if you could help by double-checking yourself that I'm not missing something - regarding why there are no Netty logs)

Vampire

05/28/2025, 8:42 AM

I don't see any logs at all anymore besides for metrics and status.

http <https://bindings.krzeminski.it/refresh/actions/setup-java/v4/setup-java-v4.pom>

should at least log "something".

Piotr Krzemiński

05/28/2025, 8:43 AM

yeah, the config change doesn't seem to bring us anything

Vampire

05/28/2025, 8:46 AM

Oh, it is very much delayed, now it shows the new log lines in Portainer too, it was just much delayed

Piotr Krzemiński

05/28/2025, 8:49 AM

can you paste an example new log? I don't see anything

Vampire

05/28/2025, 9:02 AM

Nah, I talked about the normal

... from ...

logs that also were not showing up in the Portainer web view

Vampire

05/28/2025, 9:02 AM

Even if I enable

ALL

logging on the root logger no netty logs appear

Vampire

05/28/2025, 9:13 AM

On restart of the server (in IDE with

io.netty

set to

ALL

, there are some Netty logs during ramp up. Maybe something additional needs to be set to make Netty log, like with Ktor where you needed to add a plugin.

Piotr Krzemiński

05/28/2025, 9:15 AM

yes, I saw them, too. The difference in setup is that we're just hot-reloading the log4j config on the hosting, not performing the full restart. Maybe Netty somehow looks at the log level only during startup? I know it wouldn't make sense because it's the logging framework that controls which messages go to the appenders, but I'm just observing this difference

Vampire

05/28/2025, 9:23 AM

There is no difference

Vampire

05/28/2025, 9:23 AM

Neither local nor production show any netty logs on requests

Vampire

05/28/2025, 9:25 AM

Netty could indeed do a

is...Enabled

call at startup and persist that, it would just be pretty stupid to do so as logging config can change at runtime. But well, that is not the problem. Netty just seems not to log anything or it needs to be enabled explicitly like for ktor.

👍 1

Vampire

05/28/2025, 12:09 PM

Did either of you do something? Right now the server is dead

LeoColman

05/28/2025, 12:10 PM

Yes, me again

👌 1

LeoColman

05/28/2025, 12:10 PM

I'll stop trying to test on production and find a way to test locally. It's too much server kills

👍 1

Piotr Krzemiński

05/28/2025, 12:11 PM

I think setting up an identical stack fir testung, reachable through your custom domain, is fairly simple

Vampire

05/28/2025, 12:13 PM

Now the whole domain is dead 🙈

LeoColman

05/28/2025, 12:18 PM

The server is indeed fully down, I'm investigating

LeoColman

05/28/2025, 12:29 PM

network jf6e68ul3luucqqqx7aag6xb9 exists this error is going on the reverse proxy, and I don't know why it's an error

😭 1

LeoColman

05/28/2025, 12:35 PM

We're back

👌 1

🎉 2

Piotr Krzemiński

05/28/2025, 12:36 PM

less than two nines 😛 <joke>

😂 1

Piotr Krzemiński

05/28/2025, 12:38 PM

I think folks barely noticed, I'd expect more than Björn's message on #C02UUATR7RC

Vampire

05/28/2025, 12:38 PM

image.png

sad panda 2

Vampire

05/28/2025, 12:38 PM

😛

Piotr Krzemiński

05/28/2025, 12:39 PM

but hey, we measure it at least! 😛

🦜 1

Vampire

05/28/2025, 1:55 PM

You should set up an additional monitoring somewhere, that monitors that the whole server is down, or maybe even that Grafana is down as that would be the one sending notifications otherwise.

👍 1

Piotr Krzemiński

05/28/2025, 1:57 PM

some time ago I tried https://betterstack.com/ (on a different service), but in the free edition I think it allows up to one call per 3 minutes. Still better than nothing

4 Views

Open in Slack

Previous Next