we re sorry for any instability of the bindings server in th kotlinlang #github-workflows-kt

we're sorry for any instability of the bindings se...

Piotr Krzemiński

01/08/2025, 8:13 AM

we're sorry for any instability of the bindings server in the recent days - we're facing some issues with the hosting infra. If you see any further issues, please let us know (preferably in this thread)! CC @LeoColman

Vampire

01/08/2025, 8:47 AM

It failed for example • 2025-01-06 18:07 CET • 2025-01-06 18:32 CET • 2025-01-07 23:32 - 23:41 CET 13 times • 2025-01-08 02:36 CET

thank you color 1

Piotr Krzemiński

01/08/2025, 8:49 AM

Was it always like "service unavailable", or some other failures?

Vampire

01/08/2025, 8:49 AM

I think it was always "the server refused to answer"

👌 1

LeoColman

01/08/2025, 10:11 AM

Yes, I'm sorry guys, I have a complete meltdown on the server and am still getting things up to a good enough running point

👍 1

LeoColman

01/08/2025, 10:11 AM

Under disaster recovery for 50+ containers since the 6th

Vampire

01/08/2025, 10:12 AM

Ouch, good luck

LeoColman

01/08/2025, 10:50 AM

It's up and running for a few days. No downtime is expected, but a little bit of instability might still come around. Thanks for your patience 🙂

🙌 1

Vampire

01/08/2025, 12:38 PM

What do you mean with "for a few days"? Do you mean it runs since a few days, or it will be running the next few days? The former would be wrong as it failed today, yesterday, and the day before yesterday. If the latter, what happens after that? As the downloaded bindings are not cached, every Workflow run that does the YAML validation (usually all) fail when the server is not working. Maybe we need to enhance the generated yaml validation so that GHA cache is used or something like that. 😕

Piotr Krzemiński

01/08/2025, 12:40 PM

Maybe we need to enhance the generated yaml validation so that GHA cache is used or something like that. 😕

I also thought about a fallback mode where, if the remote service is down, a local instance is started and used. But caching can be good enough for existing, non-modified workflows

Vampire

01/08/2025, 12:47 PM

The problem again though is with the generated bindings changing over time. New input added, locally maven local is deleted, new version is downloaded, new input is used. GHA uses cached version without the input => compilation fails.

Piotr Krzemiński

01/08/2025, 12:49 PM

yeah, there are the edge cases, but I thought of the two proposed solutions as an imperfect mitigation that would work most of the time

Vampire

01/08/2025, 12:50 PM

I'm not so sure. Added inputs are not that uncommon.

LeoColman

01/08/2025, 12:50 PM

What do you mean with "for a few days"?

I've restarted it on Day 6 (disaster was day 5->6) On day 7 backups filled the disk (newish and quicker to fix disaster) And there is today, which I'm touching it a lot to get it 100% So what I meant is that it's already running since the disaster, but still facing a bit of instability due to the server being horribly instable right now

👌 1

LeoColman

01/08/2025, 12:50 PM

I expect it to be 100% ASAP. I also have personal stack in there that I'm suffering without 😂

Vampire

01/08/2025, 1:37 PM

At least currently it is still failing: https://github.com/spockframework/spock/actions/runs/12668537904/job/35313972502?pr=2073

LeoColman

01/08/2025, 1:39 PM

Now this is odd, it's currently up and serving requests

LeoColman

01/08/2025, 1:39 PM

image.png

Vampire

01/08/2025, 1:41 PM

Well, what can I say, this was minutes ago 😕

LeoColman

01/08/2025, 1:42 PM

@Piotr Krzemiński Any ideas?

Vampire

01/08/2025, 1:42 PM

I restarted again, and it is right now still failing the same

Vampire

01/08/2025, 1:43 PM

Ah, now it succeeded after the 3rd try

LeoColman

01/08/2025, 1:44 PM

Ok, i deleted some hanging containers, maybe they were stealing the proxy

👌 1

LeoColman

01/08/2025, 1:44 PM

Please keep letting me know if there's anything weird still happening

👍 1

LeoColman

01/08/2025, 1:44 PM

And I'm incredibly sorry for all of this

Vampire

01/08/2025, 1:44 PM

Well, happens, thanks for the quick action

Vampire

01/08/2025, 2:10 PM

Still flaky, failed at 1447 https://github.com/spockframework/spock/actions/runs/12671881968/job/35314625928

Vampire

01/08/2025, 4:28 PM

Still failing

LeoColman

01/08/2025, 4:28 PM

it just failed now?

LeoColman

01/08/2025, 4:28 PM

Hmm..

LeoColman

01/08/2025, 4:28 PM

Is it timing out? Host not found?

LeoColman

01/08/2025, 4:32 PM

I'm wondering if putting my stacks up is making it restart the DNS and thus cutting the connection mid-pipeline

LeoColman

01/08/2025, 4:32 PM

Let me stop doing that for a couple of hours and see

Vampire

01/08/2025, 4:46 PM

No, still the same

Copy code

[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond

[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
> Task :preprocessBranchesAndPrsWorkflow FAILED
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond

[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
[Incubating] Problems report is available at: file:///D:/a/spock/spock/build/reports/problems/problems-report.html
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
.github\workflows\branches-and-prs.main.kts:22:1: warning: file 'codecov:codecov-action:v5' not found
.github\workflows\branches-and-prs.main.kts:22:1: warning: artifactResolutionException: Could not transfer artifact codecov:codecov-action:pom:v5 from/to https___bindings.krzeminski.it_ (<https://bindings.krzeminski.it/>): transfer failed for <https://bindings.krzeminski.it/codecov/codecov-action/v5/codecov-action-v5.pom>

Vampire

01/08/2025, 4:47 PM

I mean when it failed. Right now the restart seems to have helped maybe again.

LeoColman

01/08/2025, 4:48 PM

I do see a lot of 404s here, which would correspond to the file not find errors around, but I do believe we handle these cases

Vampire

01/08/2025, 4:49 PM

The 404s you see is for the checksums I think which should be fine, unless you have a proxy in front that does not forward these but swallows them.

Vampire

01/08/2025, 4:49 PM

At least given the screenshot from above.

Vampire

01/08/2025, 4:50 PM

But there it also requests the md5 after the sha1 404ed, so should be fine there.

LeoColman

01/08/2025, 6:11 PM

Still flaky? No containers are being cycled as far as I can see and DNS should be resolving correctly

Vampire

01/08/2025, 6:12 PM

Right now, no builds were running since I last wrote, so no idea.

Piotr Krzemiński

01/09/2025, 6:44 AM

@LeoColman do you have some metrics/logs from the load balancer (if it's not subject to the issues visible here)? then we could see if there are any requests that the bindings server doesn't manage to respond

LeoColman

01/09/2025, 9:34 AM

We don't have a load balance, only a reverse proxy; It doesn't state any errors, but it does restart every time a new reverse-proxy entry appears. So for example, a failing container would restart the reverse proxy once every minute. I believe I got that covered, as I don't have any more failing containers, so I need to know if the problem still happens

👍 1

Piotr Krzemiński

01/09/2025, 9:35 AM

> We don't have a load balance, only a reverse proxy; yeah, sorry - we have just one container so nothing to balance the traffic between 🙂 I meant the reverse proxy, anything that exposes the endpoint to the world and is first to get the call from our side from the net

LeoColman

01/10/2025, 7:23 PM

I'm assuming we didn't have more problems?

Vampire

01/11/2025, 11:24 AM

Not from my side up to now

👍 1

Piotr Krzemiński

01/13/2025, 1:40 PM

for anyone else impacted, we've added a feature that can help you mitigate the problem in a limited way, see https://kotlinlang.slack.com/archives/C02UUATR7RC/p1736764191614309

4 Views

Open in Slack

Previous Next