<@UB9K6R4JH> we've got some failures last night, a...
# github-workflows-kt
p
@LeoColman we've got some failures last night, and I'm struggling to find logs for it via Portainer. Do you have another way of accessing the logs? I tried SSHing into the container ("Container console"), but I don't see a log file anywhere, probably because it goes only to the stdout and nothing writes it to any file. As a quick idea for [Feature] Collect, Store, and make available Application Logs , we could modify the Log4j config to write the logs also to a file, with some upper limit, RollingFileAppender should do the job
v
Just click on the right service and then there is a button "Service Logs": https://ritalee.colman.com.br/#!/1/docker/services/isqaiiw22kqob554dwtf3r807/logs
If it's not there, it is probably gone until file logging is enabled I guess :-(
👍 1
p
I know but I see only logs from like 2-3 hours ago, interleaved with some random logs from yesterday. Can you check if you can track down these failures in the logs?
v
Last updated at 2025-04-30 231932
So even with file logging the logs would have been gone, unless it is logged to a persisted volume, just changing logging config would not have been enough for this case
Ah, wait, it happened at 3 AM GMT, so after the update :-/
Yeah, the logs seem to be the last few messages if the old container, interleaved with the last 2 hours or so of the current container.
I fear the older ones are gone
p
ok, let me add the file appender
storing the files on a persistent volume is another issue I'm not tackling here
v
Maybe it would be better to have something like Logtail or similar
Maybe Grafana Loki is something similar
p
what do you mean? it's a separate tool, right? we also had https://dozzle.dev/ some time ago
I thought keeping the logs on the disk can always be handy if another tool fails, like now
so I mean it to be rather a backup option
v
Btw. the links in Grafana e-mails are dead @LeoColman, they are pointing to localhost
🤦‍♂️ 1
p
I got an e-mail from Grafana today around 5 AM CET
l
• A fix for the auto-deploy • Grafana Loki Are both on my to-do list. I think Loki will help with the log situation. I also think we can persist it to disk to make sure, and log-rotate on the volume itself. I can work on that after fixing the auto deploy
👍 1
👌 1
v
I got an e-mail from Grafana today around 5 AM CET
Yes, me too, but try to click the link in it 😉
p
ah, got it!
failures again - maybe let's deploy the Log4j2 config change since it won't hurt?
v
Hm, strange, I don't see the failures in the log, though the time-frame should be in
So yeah, do it
👍 1
p
ok, redeployed with file-based logs
👌 1
the failures are happening again, and I discovered that our log files have just 1 KB each... leftover after testing 😅 gonna set it to 10 MB each and redeploy
v
This should not have failed
Just rerunning and it worked
p
ok, redeployed with fixed logging - hopefully we'll solve the mystery soon
👌 1
Copy code
root@87dd34e499b2:/logs# grep -Ri "error" *
server.log:2025-05-02 12:15:51,962 <INFO > <eventLoopGroupProxy-4-8            > <[]> <{request-id=x6xj2ii2d658alm}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__analyze___major/maven-metadata.xml in 8760ms
server.log:2025-05-02 12:15:51,964 <INFO > <eventLoopGroupProxy-4-6            > <[]> <{request-id=4geip7b434ptze2}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__init___major/maven-metadata.xml in 8779ms
server.log:2025-05-02 12:15:53,071 <INFO > <eventLoopGroupProxy-4-7            > <[]> <{request-id=ypjk1l5coe6n9l2}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__init___major/maven-metadata.xml in 7ms
server.log:2025-05-02 12:15:53,154 <INFO > <eventLoopGroupProxy-4-7            > <[]> <{request-id=uvi9egn6qeks3zw}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__analyze___major/maven-metadata.xml in 2ms
server.log:2025-05-02 12:15:55,215 <INFO > <eventLoopGroupProxy-4-1            > <[]> <{request-id=d8ki05ivaabd9vi}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__init___major/maven-metadata.xml in 3ms
server.log:2025-05-02 12:15:55,263 <INFO > <eventLoopGroupProxy-4-1            > <[]> <{request-id=dpgdzynuxlegw27}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__analyze___major/maven-metadata.xml in 4ms
reproduces by going to https://bindings.krzeminski.it/github/codeql-action__analyze___major/maven-metadata.xml
and works locally o.O (with a PAT)
what's weird, I see no details in the log
questions: • what kind of cached result makes the server fail? likely related to the fix to treat failed generation as failures, not 404s • why aren't there more details, like the full stack trace? is it disabled? shall we set more detailed logging level • why is the problem non-deterministic (the refresh helped)?
v
I think the 404 fix just demeasked it. We store in the cache the
Result
, so exceptions during generation are cached too, as usually they should deterministically reappear too hopefully. But maybe
null
should be cached too have the 404 cached, but exceptions not. So basically we should probably stop having
Result
in the cache, but
Result#right
. If there are exceptions just try again next time and if it fails again, that's maybe fine and should be fixed anyway. That the details are missing is maybe still caused by not using the lib for the GH calls but ktor-client? You probably again have to raise the log level to
ALL
to get the cause displayed, it implementing the switch to the lib.
p
I think the 404 fix just demeasked it.
yeah, I also think so what you propose about storing only result of a successful generation makes sense!
unfortunately, cannot say when I'm able to fix it (likely next week), but if you have a spare moment... 🙂
l
I might be able to do it this weekend. I think it`s more important than the auto deploy
💙 1
Could you create a ticket for information clarity, please?
p
Here's the ticket: https://github.com/typesafegithub/github-workflows-kt/issues/1938, however I may be able to attack it this weekend as well, so if I were to choose, I'd prefer if you tried to fix the deployment, it's fairly tedious to redeploy manually... But of course, up to you!
👍 2
Actually it needs more tweaking, will take care of it today or tomorrow
Fix deployed in the image, the server needs to get the newest image - feel free to refresh, or I'll do it in the evening
l
I can redeploy
thank you color 1
v
Something is majorly broken right now. I get constantly failing workflows, Grafana 5 minutes ago showed an uptime of 1.5 minutes, now it shows 34.7 seconds and doesn't change and on Portainer Rita Lee is shown as Down and I also cannot drill into it being told "Failed loading environment Unauthorized"
Hm, right now it seems working again and I can also open Rita again 😕
l
Yes, sorry, I should probably have warned before
It's me
v
Ah, ok
l
I'm trying to get an auto-updater going and I updated it way too many times.
👌 1
I'm impressed at how I try to do stuff at bad sunday hours but I still manage to break someone legs 😂
v
Well, my legs are always whirling, especially at night 😄
p
to avoid this downtime, ideally we'd make the deployment first deploy to a new container, then swap which container is pointed to by the reverse proxy, and then ditch the old container. I think @LeoColman tried something like this a while ago, but with the current infra it turned out to be very tricky and we punted on it for now