< LeoColman> we ve got some failures last night and I m stru kotlinlang #github-workflows-kt

<@UB9K6R4JH> we've got some failures last night, a...

Piotr Krzemiński

05/01/2025, 8:17 AM

@LeoColman we've got some failures last night, and I'm struggling to find logs for it via Portainer. Do you have another way of accessing the logs? I tried SSHing into the container ("Container console"), but I don't see a log file anywhere, probably because it goes only to the stdout and nothing writes it to any file. As a quick idea for [Feature] Collect, Store, and make available Application Logs , we could modify the Log4j config to write the logs also to a file, with some upper limit, RollingFileAppender should do the job

Vampire

05/01/2025, 9:37 AM

Just click on the right service and then there is a button "Service Logs": https://ritalee.colman.com.br/#!/1/docker/services/isqaiiw22kqob554dwtf3r807/logs

Vampire

05/01/2025, 9:39 AM

If it's not there, it is probably gone until file logging is enabled I guess :-(

👍 1

Piotr Krzemiński

05/01/2025, 9:39 AM

I know but I see only logs from like 2-3 hours ago, interleaved with some random logs from yesterday. Can you check if you can track down these failures in the logs?

Vampire

05/01/2025, 9:44 AM

Last updated at 2025-04-30 231932

So even with file logging the logs would have been gone, unless it is logged to a persisted volume, just changing logging config would not have been enough for this case

Vampire

05/01/2025, 9:45 AM

Ah, wait, it happened at 3 AM GMT, so after the update :-/

Vampire

05/01/2025, 9:48 AM

Yeah, the logs seem to be the last few messages if the old container, interleaved with the last 2 hours or so of the current container.

Vampire

05/01/2025, 9:48 AM

I fear the older ones are gone

Piotr Krzemiński

05/01/2025, 9:50 AM

ok, let me add the file appender

Piotr Krzemiński

05/01/2025, 9:59 AM

PTAL: feat(server): log also to rolling file

Piotr Krzemiński

05/01/2025, 10:53 AM

storing the files on a persistent volume is another issue I'm not tackling here

Vampire

05/01/2025, 11:45 AM

Maybe it would be better to have something like Logtail or similar

Vampire

05/01/2025, 11:50 AM

Maybe Grafana Loki is something similar

Piotr Krzemiński

05/01/2025, 11:50 AM

what do you mean? it's a separate tool, right? we also had https://dozzle.dev/ some time ago

Piotr Krzemiński

05/01/2025, 11:51 AM

I thought keeping the logs on the disk can always be handy if another tool fails, like now

Piotr Krzemiński

05/01/2025, 11:51 AM

so I mean it to be rather a backup option

Vampire

05/01/2025, 11:51 AM

Btw. the links in Grafana e-mails are dead @LeoColman, they are pointing to localhost

🤦‍♂️ 1

Piotr Krzemiński

05/01/2025, 11:57 AM

I got an e-mail from Grafana today around 5 AM CET

LeoColman

05/01/2025, 12:04 PM

• A fix for the auto-deploy • Grafana Loki Are both on my to-do list. I think Loki will help with the log situation. I also think we can persist it to disk to make sure, and log-rotate on the volume itself. I can work on that after fixing the auto deploy

👍 1

👌 1

Vampire

05/01/2025, 1:25 PM

I got an e-mail from Grafana today around 5 AM CET

Yes, me too, but try to click the link in it 😉

Piotr Krzemiński

05/01/2025, 1:26 PM

ah, got it!

Piotr Krzemiński

05/01/2025, 1:37 PM

failures again - maybe let's deploy the Log4j2 config change since it won't hurt?

Vampire

05/01/2025, 1:43 PM

Hm, strange, I don't see the failures in the log, though the time-frame should be in

Vampire

05/01/2025, 1:44 PM

So yeah, do it

👍 1

Piotr Krzemiński

05/01/2025, 1:57 PM

ok, redeployed with file-based logs

👌 1

Piotr Krzemiński

05/02/2025, 8:25 AM

the failures are happening again, and I discovered that our log files have just 1 KB each... leftover after testing 😅 gonna set it to 10 MB each and redeploy

Vampire

05/02/2025, 8:32 AM

This is also suspicious: https://github.com/spockframework/spock/actions/runs/14791465583/job/41529347930

🤔 1

Vampire

05/02/2025, 8:33 AM

This should not have failed

Vampire

05/02/2025, 8:33 AM

Just rerunning and it worked

Piotr Krzemiński

05/02/2025, 8:37 AM

ok, redeployed with fixed logging - hopefully we'll solve the mystery soon

👌 1

Piotr Krzemiński

05/02/2025, 12:21 PM

Copy code

root@87dd34e499b2:/logs# grep -Ri "error" *
server.log:2025-05-02 12:15:51,962 <INFO > <eventLoopGroupProxy-4-8            > <[]> <{request-id=x6xj2ii2d658alm}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__analyze___major/maven-metadata.xml in 8760ms
server.log:2025-05-02 12:15:51,964 <INFO > <eventLoopGroupProxy-4-6            > <[]> <{request-id=4geip7b434ptze2}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__init___major/maven-metadata.xml in 8779ms
server.log:2025-05-02 12:15:53,071 <INFO > <eventLoopGroupProxy-4-7            > <[]> <{request-id=ypjk1l5coe6n9l2}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__init___major/maven-metadata.xml in 7ms
server.log:2025-05-02 12:15:53,154 <INFO > <eventLoopGroupProxy-4-7            > <[]> <{request-id=uvi9egn6qeks3zw}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__analyze___major/maven-metadata.xml in 2ms
server.log:2025-05-02 12:15:55,215 <INFO > <eventLoopGroupProxy-4-1            > <[]> <{request-id=d8ki05ivaabd9vi}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__init___major/maven-metadata.xml in 3ms
server.log:2025-05-02 12:15:55,263 <INFO > <eventLoopGroupProxy-4-1            > <[]> <{request-id=dpgdzynuxlegw27}> <                        io.ktor.server.Application> 500 Internal Server Error: GET - /github/codeql-action__analyze___major/maven-metadata.xml in 4ms

reproduces by going to https://bindings.krzeminski.it/github/codeql-action__analyze___major/maven-metadata.xml

Piotr Krzemiński

05/02/2025, 12:23 PM

and works locally o.O (with a PAT)

Piotr Krzemiński

05/02/2025, 12:24 PM

what's weird, I see no details in the log

Piotr Krzemiński

05/02/2025, 12:25 PM

https://bindings.krzeminski.it/github/codeql-action__analyze___minor/maven-metadata.xml works but took a while, and https://bindings.krzeminski.it/github/codeql-action__analyze___major/maven-metadata.xml fails quickly, as if the result that causes the failure was cached

Piotr Krzemiński

05/02/2025, 12:25 PM

force-refreshing (with https://bindings.krzeminski.it/refresh/github/codeql-action__analyze___major/maven-metadata.xml) made it work again

Piotr Krzemiński

05/02/2025, 12:26 PM

questions: • what kind of cached result makes the server fail? likely related to the fix to treat failed generation as failures, not 404s • why aren't there more details, like the full stack trace? is it disabled? shall we set more detailed logging level • why is the problem non-deterministic (the refresh helped)?

Vampire

05/02/2025, 12:52 PM

I think the 404 fix just demeasked it. We store in the cache the

Result

, so exceptions during generation are cached too, as usually they should deterministically reappear too hopefully. But maybe

null

should be cached too have the 404 cached, but exceptions not. So basically we should probably stop having

Result

in the cache, but

Result#right

. If there are exceptions just try again next time and if it fails again, that's maybe fine and should be fixed anyway. That the details are missing is maybe still caused by not using the lib for the GH calls but ktor-client? You probably again have to raise the log level to

ALL

to get the cause displayed, it implementing the switch to the lib.

Piotr Krzemiński

05/02/2025, 12:54 PM

I think the 404 fix just demeasked it.

yeah, I also think so what you propose about storing only result of a successful generation makes sense!

Piotr Krzemiński

05/02/2025, 12:55 PM

unfortunately, cannot say when I'm able to fix it (likely next week), but if you have a spare moment... 🙂

LeoColman

05/02/2025, 3:19 PM

I might be able to do it this weekend. I think it`s more important than the auto deploy

💙 1

LeoColman

05/02/2025, 3:19 PM

Could you create a ticket for information clarity, please?

Piotr Krzemiński

05/02/2025, 4:05 PM

Here's the ticket: https://github.com/typesafegithub/github-workflows-kt/issues/1938, however I may be able to attack it this weekend as well, so if I were to choose, I'd prefer if you tried to fix the deployment, it's fairly tedious to redeploy manually... But of course, up to you!

👍 2

Piotr Krzemiński

05/03/2025, 6:07 AM

Please review: https://github.com/typesafegithub/github-workflows-kt/pull/1940

✅ 1

Piotr Krzemiński

05/03/2025, 12:16 PM

Actually it needs more tweaking, will take care of it today or tomorrow

Piotr Krzemiński

05/03/2025, 2:59 PM

Fix deployed in the image, the server needs to get the newest image - feel free to refresh, or I'll do it in the evening

LeoColman

05/03/2025, 3:00 PM

I can redeploy

thank you color 1

Vampire

05/04/2025, 11:10 PM

Something is majorly broken right now. I get constantly failing workflows, Grafana 5 minutes ago showed an uptime of 1.5 minutes, now it shows 34.7 seconds and doesn't change and on Portainer Rita Lee is shown as Down and I also cannot drill into it being told "Failed loading environment Unauthorized"

Vampire

05/04/2025, 11:11 PM

Hm, right now it seems working again and I can also open Rita again 😕

LeoColman

05/04/2025, 11:15 PM

Yes, sorry, I should probably have warned before

LeoColman

05/04/2025, 11:15 PM

It's me

Vampire

05/04/2025, 11:15 PM

Ah, ok

LeoColman

05/04/2025, 11:16 PM

I'm trying to get an auto-updater going and I updated it way too many times.

👌 1

LeoColman

05/04/2025, 11:17 PM

I'm impressed at how I try to do stuff at bad sunday hours but I still manage to break someone legs 😂

Vampire

05/04/2025, 11:18 PM

Well, my legs are always whirling, especially at night 😄

Piotr Krzemiński

05/05/2025, 8:10 AM

to avoid this downtime, ideally we'd make the deployment first deploy to a new container, then swap which container is pointed to by the reverse proxy, and then ditch the old container. I think @LeoColman tried something like this a while ago, but with the current infra it turned out to be very tricky and we punted on it for now

2 Views

Open in Slack

Previous Next