we're sorry for any instability of the bindings se...
# github-workflows-kt
p
we're sorry for any instability of the bindings server in the recent days - we're facing some issues with the hosting infra. If you see any further issues, please let us know (preferably in this thread)! CC @LeoColman
v
It failed for example • 2025-01-06 18:07 CET • 2025-01-06 18:32 CET • 2025-01-07 23:32 - 23:41 CET 13 times • 2025-01-08 02:36 CET
thank you color 1
p
Was it always like "service unavailable", or some other failures?
v
I think it was always "the server refused to answer"
👌 1
l
Yes, I'm sorry guys, I have a complete meltdown on the server and am still getting things up to a good enough running point
👍 1
Under disaster recovery for 50+ containers since the 6th
v
Ouch, good luck
l
It's up and running for a few days. No downtime is expected, but a little bit of instability might still come around. Thanks for your patience 🙂
🙌 1
v
What do you mean with "for a few days"? Do you mean it runs since a few days, or it will be running the next few days? The former would be wrong as it failed today, yesterday, and the day before yesterday. If the latter, what happens after that? As the downloaded bindings are not cached, every Workflow run that does the YAML validation (usually all) fail when the server is not working. Maybe we need to enhance the generated yaml validation so that GHA cache is used or something like that. 😕
p
Maybe we need to enhance the generated yaml validation so that GHA cache is used or something like that. 😕
I also thought about a fallback mode where, if the remote service is down, a local instance is started and used. But caching can be good enough for existing, non-modified workflows
v
The problem again though is with the generated bindings changing over time. New input added, locally maven local is deleted, new version is downloaded, new input is used. GHA uses cached version without the input => compilation fails.
p
yeah, there are the edge cases, but I thought of the two proposed solutions as an imperfect mitigation that would work most of the time
v
I'm not so sure. Added inputs are not that uncommon.
l
What do you mean with "for a few days"?
I've restarted it on Day 6 (disaster was day 5->6) On day 7 backups filled the disk (newish and quicker to fix disaster) And there is today, which I'm touching it a lot to get it 100% So what I meant is that it's already running since the disaster, but still facing a bit of instability due to the server being horribly instable right now
👌 1
I expect it to be 100% ASAP. I also have personal stack in there that I'm suffering without 😂
v
l
Now this is odd, it's currently up and serving requests
image.png
v
Well, what can I say, this was minutes ago 😕
l
@Piotr Krzemiński Any ideas?
v
I restarted again, and it is right now still failing the same
Ah, now it succeeded after the 3rd try
l
Ok, i deleted some hanging containers, maybe they were stealing the proxy
👌 1
Please keep letting me know if there's anything weird still happening
👍 1
And I'm incredibly sorry for all of this
v
Well, happens, thanks for the quick action
Still failing
l
it just failed now?
Hmm..
Is it timing out? Host not found?
I'm wondering if putting my stacks up is making it restart the DNS and thus cutting the connection mid-pipeline
Let me stop doing that for a couple of hours and see
v
No, still the same
Copy code
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond

[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
> Task :preprocessBranchesAndPrsWorkflow FAILED
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond

[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
[Incubating] Problems report is available at: file:///D:/a/spock/spock/build/reports/problems/problems-report.html
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - I/O exception (org.jetbrains.kotlin.org.apache.http.NoHttpResponseException) caught when processing request to {s}-><https://bindings.krzeminski.it:443>: The target server failed to respond
[main] INFO org.jetbrains.kotlin.org.apache.http.impl.execchain.RetryExec - Retrying request to {s}-><https://bindings.krzeminski.it:443>
.github\workflows\branches-and-prs.main.kts:22:1: warning: file 'codecov:codecov-action:v5' not found
.github\workflows\branches-and-prs.main.kts:22:1: warning: artifactResolutionException: Could not transfer artifact codecov:codecov-action:pom:v5 from/to https___bindings.krzeminski.it_ (<https://bindings.krzeminski.it/>): transfer failed for <https://bindings.krzeminski.it/codecov/codecov-action/v5/codecov-action-v5.pom>
I mean when it failed. Right now the restart seems to have helped maybe again.
l
I do see a lot of 404s here, which would correspond to the file not find errors around, but I do believe we handle these cases
v
The 404s you see is for the checksums I think which should be fine, unless you have a proxy in front that does not forward these but swallows them.
At least given the screenshot from above.
But there it also requests the md5 after the sha1 404ed, so should be fine there.
l
Still flaky? No containers are being cycled as far as I can see and DNS should be resolving correctly
v
Right now, no builds were running since I last wrote, so no idea.
p
@LeoColman do you have some metrics/logs from the load balancer (if it's not subject to the issues visible here)? then we could see if there are any requests that the bindings server doesn't manage to respond
l
We don't have a load balance, only a reverse proxy; It doesn't state any errors, but it does restart every time a new reverse-proxy entry appears. So for example, a failing container would restart the reverse proxy once every minute. I believe I got that covered, as I don't have any more failing containers, so I need to know if the problem still happens
👍 1
p
> We don't have a load balance, only a reverse proxy; yeah, sorry - we have just one container so nothing to balance the traffic between 🙂 I meant the reverse proxy, anything that exposes the endpoint to the world and is first to get the call from our side from the net
l
I'm assuming we didn't have more problems?
v
Not from my side up to now
👍 1
p
for anyone else impacted, we've added a feature that can help you mitigate the problem in a limited way, see https://kotlinlang.slack.com/archives/C02UUATR7RC/p1736764191614309