Hacker News new | ask | show | jobs
by cheald 3491 days ago
Influx + Telegraf + Grafana is such a simple, sweet stack. No work to maintain, trivial to set up, I can ship just about anything I want into it, and reporting is fast.

With alerting in place now, I'm even happier than ever. A huge thank you to the Grafana team for solving a huge pain point!

2 comments

What kind of volume are you sending into Influx? It crashed on me probably 5 times a day with only 100 requests per second.
Right now it looks like it's around 50/sec. A lot of data points get rolled up by Telegraf on individual machines, and then it's shipped in via the UDP line protocol. I've written much larger volumes, though, and never had an issue with stability.
If I may ask. How is UDP doing for you?

I checked my graphite setup once. We had 27% of metrics lost over UDP. That was bad.

pro-tip: "netstat -anus" and look at the error counters.

About 4% err-to-received ratio. That's probably due to untuned UDP buffer sizes though; despite dropped packets, we're getting enough information to provide the information we need.
Was this an older build? We had serious issues at first, but our setup is pretty stable these days.
Last I tried was 1.0.

I love everything about using Influx but it would die and never restart and every time it would some crash on semacquire. I'll have to try it again since I need to check out this Grafana update anyway.

There's some setup involved if you're sending a decent amount of traffic to it.

The two game changers are using the UDP line protocol instead of HTTP, and making sure you are batch-processing inputs. Fixing these settings is the difference between an instances that crashes all the time, and a purring one.

sending data in batches gives serious performance improvements. Don't send metrics directly to influx from your app. Send them to an intermediary like statsD which will aggregate them and send it.

Shameless plug - I recently published a log router in Golang. It sends data to influx too ! (github.com/agnivade/funnel)

Thank you. I'll check this out.
I use riemann in front if influx, which collects data and forwards it once a second. Works nicely, especially given that I aggregate some more high volume metrics before sending them to influx.
What transport are you using to secure telegraf into influxdb?

(Haven't tried telegraf yet, setuping a prometheus at the moment)

Not sure what you mean "secure telegraph into influxdb" but we've had great success with this stack for monitoring by just embedding an HTTP server into each application that needs to be monitored. We keep the HTTP server separate from any others used by the application (i.e. it runs on a separate thread) so performance isn't impacted.
My use case is one where I have servers in different datacenters and would want to have a simple, but secure, way to fetch metrics for graphing and alerts.

So, I meant encryption in transport, authentication, etc. as many solutions work well if you're monitoring "in the clear" from the backend, but not so much over the internet.

We're deployed on AWS in multiple regions with VPNs set up between VPCs. No particular attention paid to securing the transport between Telegraf and Influx at the moment since a) it's either in an internal VPC or secured via ipsec, and b) our monitoring data is low-value enough that it doesn't warrant its own secure transport.
IIRC, Influx supports https too. So you just have to setup some certs and switch to https in the client.