is there any OpenTelemetry expert here? :slightly_...
# random
p
is there any OpenTelemetry expert here? 🙂 together with @LeoColman we're trying to configure the monitoring stack to provide basic metrics (not just raw spans that we can browse via Jaeger), and having hard time understanding the setup. What we have now is a setup that does provide some metrics, but after several ours, the metrics simply disappear. We're not sure what component to blame
t
@Riccardo Lippolis?
r
I think this question is hard to answer without more insight into the components used in the stack, and their configuration, could you give more information?
p
@LeoColman owns the monitoring infra, but I'm aware of these configs: https://github.com/LeoColman/MyStack/tree/main/github-workflows-kt. We monitor a server that has its entry point here, and has its OTel config here. In short, what we want: • be able to see latency and # of calls for the endpoints • be able to define metrics/graphs where filtering based on certain request's attributes is performed • see spans (or alike) for each request individually The most nasty problem right now is that the metrics disappear after several hours after restarting the stack. Leo found this issue: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/31758, unfortunately auto-closed. Right now we're in a moment of considering: • a hypothesis that some dynamic IP changes, and it leads to losing the metrics • considering ELK stack (no experience from my side, from Leo's as well, I think) but not sure if we would have to parse the logs to create metrics, or something like OTel is supported as well there What we need the most is a solid understanding of what component does what, and what happens when we lose the metrics. I hope this summary is helpful!
@Riccardo Lippolis any idea what can be wrong here? or you need more info?
r
I have looked through the code, and don't immediately see something out of the ordinary. You are saying:
a setup that does provide some metrics, but after several ours, the metrics simply disappear
so I assume it is all working in the beginning, but then after a few hours you don't see any new metric information appear, am I right? And does this apply to logs, traces and metrics, or only specifically to metrics? In short, the way the information flows is like this:
your app
->
opentelemetry collector
->
grafana/jaeger/prometheus
so it indeed seems like at some point either the connection between your app and the otel collector is broken, or the connection between the collector and the observability backends (grafana/jaeger/prometheus). To determine whether the OTel collector is still receiving information, you could try a few things: • configure the collector to output its internal telemetry as a prometheus exporter, to have the otel collector also send metrics to your prometheus instance • set the log level of the collector to debug to see if anything is still happening with the collector at the moment you don't see any metrics anymore • configure a debug exporter in the otel collector, so that you can see in the logging whether the otel collector is attempting to export anything. If you don't see metrics, but you do see some exporting going on in the logging, the issue must be in the connection between the collector and the observability backends, but if you also don't see any logging, either your app is not sending anything, or the connection between your app and the otel collector is broken, or the otel collector is broken (which could theoretically be caused by the github issue you linked to) I hope this gives you some pointers to look at!
🙌 1
❤️ 2