thanks <@U0152P3VB5J> for noticing and reporting a...
# github-workflows-kt
p
thanks @Vampire for noticing and reporting a problem with Personal Access Tokens not being able to list versions for some actions (ref), and @LeoColman for driving a fix by adding a feature to authorize as a GitHub app where this problem doesn't exist 🎉 the power of community and open-source success baby
🩜 1
👌 1
v
Is something borked with the GitHub App? Errorcount increases and my workflows are failing with 500 for the maven-metadata.xml 😞 (cc @LeoColman)
For example
<https://bindings.krzeminski.it/Wandalen/wretry.action__main___major/maven-metadata.xml>
p
in the logs I see only:
Copy code
2025-04-01 04:30:17,005 <INFO > <eventLoopGroupProxy-4-5            > <[]> <{request-id=y585sbwpwgngn8t}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /Wandalen/wretry.action__main___major/maven-metadata.xml in 19799ms
no stack trace, very long latency...
ideas for next steps: ‱ check if the token is fine, e.g. isn't marked as invalidated in GitHub's UI or something ‱ try to reproduce locally
unfortunately I won't be able to take care of this today, so all hope is in Leo
@Vampire can you try useLocalBindingsServerAsFallback to unblock your workflows? (you'll have to regenerate the YAMLs)
Both private keys (what I called "token" previously) look healthy
v
I will not be able to regenerate, unless I use a local server with a working token. And the option wouldn't help anyway, unless I provide it with a token.
You should probably document that
Btw. with a locally started server with token it worked.
p
> You should probably document that yes, I'm planning to create a new page in the docs, devoted to the server, including what to do if it malfunctions
v
I meant documenting that you cannot use it if dynamic versions (anything else) are needed unless a token is provided that works with all used actions.
👍 1
Maybe the cached access token is expired somehow? Or maybe it has to do with summer-time switch?
But at 3:25 AM GMT+2 it still worked while at least since 4:41 AM GMT+2 it is broken
Btw. the client id should probably also be configured via env variable, that makes it easier to run an own instance with own GitHub App
👍 1
And probably the installation id whatever that is
👍 1
Copy code
2025-04-01 08:31:25,912 <TRACE> <eventLoopGroupProxy-4-8            > <[]> <{request-id=okucrz5scomwqmh}> <          io.ktor.client.plugins.HttpCallValidator> Processing exception java.io.EOFException: Failed to parse HTTP response: the server prematurely closed the connection for request <https://api.github.com/app/installations/62885502/access_tokens>
2025-04-01 08:31:25,913 <DEBUG> <eventLoopGroupProxy-4-8            > <[]> <{request-id=okucrz5scomwqmh}> <                        io.ktor.server.Application> Unhandled: GET - /Wandalen/wretry.action__main___major/maven-metadata.xml. Exception class java.io.EOFException: Failed to parse HTTP response: the server prematurely closed the connection
java.io.EOFException: Failed to parse HTTP response: the server prematurely closed the connection
	at io.ktor.client.engine.cio.UtilsKt$readResponse$2.invokeSuspend(utils.kt:172) ~[ktor-client-cio-jvm-3.1.1.jar:3.1.1]
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) ~[kotlin-stdlib-2.1.20.jar:2.1.20-release-217]
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:113) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:89) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:820) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:717) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:704) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
2025-04-01 08:31:25,923 <INFO > <eventLoopGroupProxy-4-8            > <[]> <{request-id=okucrz5scomwqmh}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /Wandalen/wretry.action__main___major/maven-metadata.xml in 20016ms
So it fails to get an access token after 20 seconds
p
interesting, why did it suddenly stop working...? đŸ€” maybe something wrong with egress (networking)?
but on the other hand, fetching actions' manifests works. Maybe it's a different endpoint
did you do anything special to get the stack trace, or did I miss it in the logs?
v
Look at the log level 🙂
By default the server is configured to log INFO+
p
oh... 💡
we should be able to dynamically change the log level
v
I did
v
Exactly
p
I just didn't think of it in the morning, TIL
v
Like I configured the logging, the config file is not in the jar but separate and watched for changes
👍 1
So I just opened the console to the docker container and exchanged INFO by ALL in the file
👍 1
Alternatively you can also set
LOG_LEVEL
env variable and restart the container, but with editing the file you do not have service interuption
And do not loose the state that might be part of the problem
p
but separate and watched for changes
thanks, valuable info!
v
Network-wise it seems to work, I just did the access token request from the docker container using curl and it worked
p
I meant networking on Leo's cluster, maybe DNS broke or something?
v
I did it from the actual bindings server docker container that is right now running the bindings server
p
ah! got it
v
It does work fine when I run the server locally with the app private key
đŸ€” 1
p
to be checked if it stopped working upon deployment, but thinking quickly, I think the timeline doesn't match
v
According to Grafana uptime is 5.75 days, so I'd say no
The docker container even says uptime 32 days 😕
p
the latest deployment was on Sunday: https://github.com/typesafegithub/github-workflows-kt/actions/runs/14151082606, so I think Grafana is lying
v
I think the docker container is lying with 32 days, but the container also says "last update" was "2025-03-26 164307" which matches what Grafana says
p
if so, then the question is why didn't the Sunday's deployment perform the actual image refresh
a few questions to answer here 🙂
v
definitely
v
Yeah, no, that is not the image used. It uses
sha256:a5b10be062341cb9bdbe2abbf3026145ca58fca5e09df4d29977978fef24459f
What is
TRIGGER_IMAGE_PULL
?
p
Maybe the web hook is gone, but then why doesn't the request fail? Maybe by default the exit code is 0...
v
Yeah, but what is the value? 🙂
👍 1
Not for public channel of course
p
I'm afraid GitHub doesn't let me view it, I'd have to hack around to get it in a workflow
It just lets me set a new value
v
Yeah, it's secret. 😄 And GitHub will also not print in but mask it if you try to print it. You would need to use tmate and then get it via SSH or something like that.
But anyway I don't see any changes since that image that should change anything, just minor library updates
👍 1
I'll restart the server now and see whether that at least fixes it for now, agreed?
p
sure
v
Yes, after restart it works again now, even without re-pull
p
đŸ€” very weird. Thanks for mitigating the problem!
maybe Leo will know something more about this probelm
So the
curl
per-se is successful, it just did not do what it should have done. I wonder why the manual run then actually triggered the service to be updated, unless Leo did it manually.
p
thanks for noticing the response đŸ€Šâ€â™‚ïž I didn't think of it
it hasn't been working for months
v
https://docs.portainer.io/user/docker/containers/webhooks says webhooks are only for portainer business edition. Maybe he had some evaluation period where it worked but is not anymore as it is the community edition?
💡 1
p
I'll add an assertion that the call was successful, and ask Leo to create the webhook
👌 1
v
Ah, it was probably this stack webhook:
Hm, no, that webhook gives a different message
Copy code
{
  "details": "Object not found inside the database",
  "message": "Unable to find the stack by webhook ID"
}
Ah, it probably updated last week because of Leo updating https://github.com/LeoColman/MyStack/commit/a7e847b151061ad96fce0ec690b3bbc4d473ad88
So we have: ‱ externalize client id ‱ externalize installation id ‱ better error logging if access token retrieval fails after 20 seconds (currently only a stacktrace on DEBUG level) ◩ Use https://hub4j.github.io/github-api/ for talking to GitHub API for increased reliability and built-in error-handling ‱ fail on failing auto-deployment ‱ fix auto-deployment ‱ hopefully find out why access token retrieval failed unless restarted and how to fix it (probably with GH API error handling there is more and important information i.e. the error message by GH) ‱ fall back to PAT if authenticating with the GitHub app fails + record a metric for such events ‱ anything else?
👍 1
p
I'd add a better error message logged together with the HTTP 500, right now it was just
500 Internal Server Error: GET - /Wandalen/wretry.action__main___major/maven-metadata.xml in 19799ms
and even without the stack trace, I'd expect more info
v
I also wonder that even with log level
ALL
there was not that much information, nothing about what ktor tried to do, only the
EOFException
at
TRACE
and the stacktrace at
DEBUG
. I would have expected some more logging.
But besides that, the missing details about the problem is probably the same as recently with the PAT-problem.
You try to parse the response as access token response but you get some error response that fails to parse and thus gives that exception
👍 1
Probably also the same can happen at the other two
body
calls where you get tagger from tag or author from commit
👍 1
p
it's this kind of code that is easy to write the happy path handling for, and easy to forget about the "what if" path 😄
v
Yeah, better to use a GitHub API wrapper that does proper error handling 😄 😛
p
hmm, I cannot recall if I didn't use the library because no decent one/official was available, or I thought I'll quickly code it on my own... now I see https://hub4j.github.io/github-api/, maybe there's something Kotlin-specific? If you have experience with a specific lib, let me know!
v
With the PAT-problem you said you just didn't want to use a big lib for "just one call", which obviously it is not. 😄
That's the one I typically used for talking to GitHub API too
p
I obviously missed the moment when the project grew sufficiently to justify adding a dependency on the lib 😛
which I'm partially happy about
could you edit your list of action items to add using the lib?
👌 1
v
Well, I usually use the
net.wooga.github
Gradle plugin which uses that API under-the-hood and exposes it for usage in custom tasks
👍 1
l
I'll read this with more details later today, but @Piotr KrzemiƄski briefed me of what is going on and I'm up-to-date 🙂
👍 1
👌 1
p
and we figured out one more action item: fall back to PAT if authenticating with the GitHub app fails + record a metric for such events
v
added above
👍 1
l
We fixed the webhook for deploy The previous one was offline, but a new docker image made the server fetch in 5 minutes and deploy the new image so deploy was working. Not it's working as intended
v
Where is that webhook defined?
l
Github Actions Secrets, I think. @Piotr KrzemiƄski has access
It's generated by portainer at the host
v
I meant the webhook side, not the GHA side
Where in portainer? I only found one on the stack that seems to only work with business edition
l
I only found one on the stack that seems to only work with business edition
That's the one I don't really know what it means with the toggleable features, but we tested and the deploy worked
p
to be precise: the webhook was called successfully, but I haven't actually checked if the request was honored by Portainer
đŸ€” 1
v
Really? Interesting. I copied that link and sent it a POST request but it said the hook was not found, but also with a different detail message than the workflow showed
Now the call to the webhook indeed is successful. What did you do to make the webhook work again?
l
Just moved from polling to webhook 😂
p
I got a new webhook URL from Leo and put it under the secret in GH
v
That is not alternative, is it? I understood that it polls and additionally you can trigger the webhook.
Alos, triggering the webhook does not do anything.
l
I understood that it polls and additionally you can trigger the webhook.
Hmm.. Maybe, I don't really know
to be precise: the webhook was called successfully, but I haven't actually checked if the request was honored by Portainer
It didn't honor. It's still on an old image
v
I just triggered the webhook "successfully", but the container is still the one I restarted at 12:00 CEST
Probably because the webhook just gets the compose file from your repo and the web ui says if there was a change it would apply it
But there was no change
To work as intended the re-pull image would need to work and that only works on business feature
l
So we'd need the repull from business edition anyway?
Damn. Let me investigate further, but I think you're right. I'll see if there's a workaround without having an explicit version
v
unless you invent something else somehow that triggers it
We could for example add a "shutdown" api to the server so that it quits and then portainer will hopefully start a new container, but it would probably still not re-pull
l
Could use watchdog
v
you would probably need to somehow make a pull from inside a container, but that does probably not work
l
Not watchdog 😂
v
You would probably need to make a commit to https://github.com/LeoColman/MyStack/blob/main/github-workflows-kt/docker-compose.yml that changes it with some UUID so that there is a change and then trigger the webhook
l
that changes it with some UUID
Could even be the version tag
v
We do not recommend using Watchtower in a commercial or production environment
😄
But yeah, that watchtower is basically what I meant. You map the docker socket from the host into a container, so that the container can then tell the docker daemon to pull the image.
l
It shouldn't be used in production because the alternative is better (using the right tool for the job), but our alternative is to implement it by hand, which I think is worse
đŸ€·â€â™‚ïž 1
Maybe not feasible on the short term, but do you know if moving from Portainer to K8s is something feasible? I don't have that hard love for Portainer, and this would eventually solve many problems on all my stacks hahah
v
I used neither so far, sorry
p
I personally have bad experience with k8s, maybe because I haven't had an opportunity to learn it properly. For me, it makes simple things overly complex. But if you know it and manage it, and the service works, I'm fine with it 😄
l
Watchtower enabled for our container only. Let's see how it goes. Updates should happen as soon as the docker image is created in dockerhub and the webhook is no longer necessary (probably worked only as a placebo anyways)
❀ 2
p
ok, I'll remove calling the webhook, thanks!
👌 1
l
time="2025-04-02T234321Z" level=warning msg="Could not do a head request for \"krzema12/github-workflows-kt-jit-binding-server:latest@sha256:4d302b1122bed8f58bcf56f4bbe6fa27517d9f1d653cb76feaf626c4439b06b8\", falling back to regular pull." container=/github-workflows-kt_github-workflows-kt.1.xojfjx7xhyho6xdwu6ikur6f0 image="krzema12/github-workflows-kt-jit-binding-server:latest@sha256:4d302b1122bed8f58bcf56f4bbe6fa27517d9f1d653cb76feaf626c4439b06b8"
time="2025-04-02T234321Z" level=warning msg="Reason: Parsed container image ref has no tag: docker.io/krzema12/github-workflows-kt-jit-binding-server@sha256:4d302b1122bed8f58bcf56f4bbe6fa27517d9f1d653cb76feaf626c4439b06b8" container=/github-workflows-kt_github-workflows-kt.1.xojfjx7xhyho6xdwu6ikur6f0 image="krzema12/github-workflows-kt-jit-binding-server:latest@sha256:4d302b1122bed8f58bcf56f4bbe6fa27517d9f1d653cb76feaf626c4439b06b8"
time="2025-04-02T234322Z" level=info msg="Session done" Failed=0 Scanned=1 Updated=0 notify=no
Some problems with watchtower
😭 2
p
@LeoColman getting failures when listing versions (e.g. https://bindings.krzeminski.it/actions/checkout/maven-metadata.xml) logs:
```java.lang.IllegalStateException: Missing environment variables for generating an auth token. There are two options:
1. Create a personal access token at https://github.com/settings/tokens.
The token needs to have public_repo scope. Then, set it in
GITHUB_TOKEN
env var.
With this approach, listing versions for some actions may not work.
2. Create a GitHub app, and generate a private key. Then, set it in
APP_PRIVATE_KEY
env var.
With this approach, listing versions for all actions works.```
APP_PRIVATE_KEY
is filled,
GITHUB_TOKEN
is empty - I think we should remove
GITHUB_TOKEN
entirely, let me try it
the same is happening
ok, works - I needed to point to the newest image manually in Portainer
v
Yeah, that's the portainer issue Leo mentioned above probably
p
the two env vars (for the client and installation ID) were absent, I had to re-add them đŸ€”
l
In the container itself? I think they were set for the environment
p
Yes, sorry, maybe - so now we may have duplicated env vars
l
Ok, it was actually problematic. I didn't tell the stack to get the env variables from the env
Your solution was the right one for a temp fix
👍 1
👌 1
v
As far as I can see most points of above are handled. Missing is ‱ migrating remaining calls to GitHub API lib ‱ adding a metric when falling back to token, currently only a warning is logged 🙂
👍 1
p
yep, both tracked in the issues
👌 1
v
Something is going on again
Copy code
2025-04-11 07:36:37,166 <TRACE> <eventLoopGroupProxy-4-6            > <[]> <{request-id=whdh0up66n95zs5}> <typesafegithub.workflows.shared.internal.GithubApi> REQUEST <https://api.github.com/repos/actions/cache/git/matching-refs/tags/v> failed with exception: java.io.EOFException: Not enough data available
2025-04-11 07:36:37,166 <TRACE> <eventLoopGroupProxy-4-6            > <[]> <{request-id=whdh0up66n95zs5}> <          io.ktor.client.plugins.HttpCallValidator> Processing exception java.io.EOFException: Not enough data available for request <https://api.github.com/repos/actions/cache/git/matching-refs/tags/v>
2025-04-11 07:36:37,167 <DEBUG> <eventLoopGroupProxy-4-6            > <[]> <{request-id=whdh0up66n95zs5}> <                        io.ktor.server.Application> Unhandled: GET - /actions/cache__restore___major/maven-metadata.xml. Exception class java.io.EOFException: Not enough data available
java.io.EOFException: Not enough data available
	at io.ktor.utils.io.ByteReadChannelOperationsKt.readByte(ByteReadChannelOperations.kt:48) ~[ktor-io-jvm-3.1.2.jar:3.1.2]
	at io.ktor.utils.io.ByteReadChannelOperationsKt$readByte$1.invokeSuspend(ByteReadChannelOperations.kt) ~[ktor-io-jvm-3.1.2.jar:3.1.2]
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) ~[kotlin-stdlib-2.1.20.jar:2.1.20-release-217]
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:100) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.internal.LimitedDispatcher$Worker.run(LimitedDispatcher.kt:113) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.TaskImpl.run(Tasks.kt:89) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:586) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:820) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:717) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:704) ~[kotlinx-coroutines-core-jvm-1.10.1.jar:?]
Doing the same request with the same bearer token from my machine works without problems 😕
I restarted the container, now it'S works again
thank you color 1
p
ack, @LeoColman we should get the alerting up and running 😬
👍 1
this time the problem is different than previously when it stopped working, right?
v
No
The only difference is that we now have more logging
👍 1
Well, last incident we only got the 500 line so it could well be a different problem
p
I cannot investigate right now (crunch at work), let's see how it behaves after the restart
v
But it is really strange that the container cannot get the response while with the same bearer token I locally can get it
👍 1
Right now it works again
If it would be rate limiting, there should be an according response, not just nothing
p
and also if it was rate limiting, the restart wouldn't fix the issue because the rate depend on the number of incoming calls - we don't have e.g. some worker threads that could make the calls on their own
I'm wondering if it has something to do with the large number of "runnable" threads. Probably the thread pool just grows to a particular upper cap, and it's only around 100 so it doesn't seem bad, I just observed it
v
Could well be the problem though, yes
l
You guys can add yourselves to https://grafana.bindings.colman.com.br/alerting/notifications?search= When I create the alert, Grafana will send an e-mail to you when the conditions (>=1 error) are met
👍 1
p
added myself as well, but there's this warning next to our contact points - does it work in practice?
l
No, currently it's doing nothing. I wanted you to take a look at the platform before adding an alert
What we are missing is defining how many errors for an alert and I think we are good to create an alert
v
1? 🙂
👍 1
p
Yeah, any failure is worth investigating
BTW, I usually call an "error" a user error, so HTTP 4xx. Failures are server faults, so HTTP 5xx. It's good to agree our common terminology
l
Ok for me to email only on server failures i.e. 5xx
👍 1
Alerts for 5xx setup to our e-mails 🎉
👌 2
Now to test it.... đŸ€”
v
Of course 5xx only, 4xx does not make sense unless we abuse it. 😄
I would just not use error and failure, especially as if I would use it the other way around, in accordance with tests, were a failure is the softer thing that means a should have been 1 but was 2, while an error is an unexpected exception or something else unexpected. So best just be explicit and call it user error and server error or 4xx error and 5xx error, then there can hardly be misunderstandings. 🙂
👍 1