Monitoring my Minecraft server with OpenTelemetry and Prometheus

Thanks for setting me straight :-) I updated the article to reflect that.

doabell 399 days ago

> I am a man of simple tastes, and running the “vanilla” Minecraft server as a Systemd unit on a Linux VM in the cloud

Minecraft is famously under-optimized and needy in terms of CPU frequency. If running a vanilla (no server mods) version, then using something optimized, like PaperMC is a better idea for datacenter VMs. (Until you need to dupe sand or something.)

The other route is installing a bunch of optimization mods - some really do help.

ehnto 399 days ago

People love to bother about Java MC performance, but I ran a modded Tekkit sever for like 10 years on a base Digital Ocean VM. Shoutout to Digital Ocean for having no impactful changes for 10 years too. They give me a VM, I run the thing, life is good.

From my understanding, Paper and the like are good for Minecraft servers focused around specific mini-games (rather than freedorm building), and are the only sensible choice for servers with many people (or not that many people, but really underpowered hardware).

However, they may be a problem if players are sensitive to possible non-vanilla behaviour (as you mentioned, and it’s not limited to cheaty duping). Thankfully, spinning up a server with a selection of performance mods is very easy these days. Various tricks like pre-generating chunks in advance also help.

treyd 399 days ago

It's kinda nuts. The upstream mojang server binary starts to groan if you have >4-5 players on the same server doing stuff. They've really been dropping the ball on optimization in recent years.

Paper is good enough for anyone but very technical players pushing to the limits of redstone tick timing logic, entity behavior, chunk loading mechanics, etc. These don't matter even for advanced players doing normal things.

I actually had to splurge got 2 VCPUs on Digital Ocean to avoid "skipping ticks" and it does sound pretty nuts to me. We play max 3 players. I would expect the server with such a load to be able to run on a slightly tuned up toaster.

It is not cheap for the cloud. Had to use some beefy variety of EC2 medium instance for 4 players or so, with a simple dash for starting it up and terminating, I think using spot instance pricing. Otherwise it cost a pretty penny. At that point I did not use any performance mods, though.

skrtskrt 399 days ago

to be fair with the power on most people's laptops and phones now I think we tend to lose track of just how little "1 CPU" is if you're not just running like, a small web app.

frollogaston 398 days ago

Wasn't it always like this? There's a lot going on in the game, especially if generating new chunks, and it's in Java.

treyd 398 days ago

It was not always like this. You used to comfortably be able to handle 70+ players in a single server before Paper existed (my memory of this is from before like 2015). You'd need to allocate a lot more memory than normal, like 8 gigs instead of the normal suggestion of 1 or 2, but it could handle it without regular lag.

frollogaston 397 days ago

I forget what heap setting I used, maybe it was 2G, but the old 2010 Mac mini I had as a server would lag if just one player was exploring land quickly (maybe by boat). Was online from 1.5 beta to 1.9 release, no more than 8 players usually.

Monitoring and metric collection makes a lot of sense when you run a production system, or a personal but critical system.

Promoting a telemetry solution when it comes to a hobby server, which you host for yourself and which can’t bankrupt you by running up a massive AWS bill, doesn’t seem to make much sense when simply bottling it up in Docker and being able to restart or recreate at will is enough (mount volumes for logs and persistent data, back it up, and you’re good).

With games like Minecraft in particular there’s value in being able to have multiple servers with different worlds, perhaps different mods, etc. If you decide not to have more servers because they are snowflakes you do not have time to set up monitoring for then you rob yourself and your players of the opportunity to have more fun.

Furthermore, containerizing it allows you to upgrade as new game versions come out quickly by simply spinning up a new container with your preexisting world as a test, and you get you basic system resource usage monitoring built-in.

What I think could be a more interesting exercise is a dashboard for friends or family that allows to manage the lifetime and configuration of their respective containers.

gmuslera 399 days ago

Implementing proper monitoring in a toy system doesn't prepare you to do it in a massive critical system, but at least you may had learn something in the process, and notice things that in big scale may not be as evident.

In any case, fun starts when the system have more interdependent components.

I think there is value in learning which pattern is good to apply in which scenario, and I will argue that in this case the best pattern is “servers are cattle”.

One of the stretch goals for me writing this article was indeed to show between the lines how Prometheus Exporters, the OpenTelemetry Collector and Systemd can all work together. That is a very reusable skill on monitoring workflows running outside containers on Linux VMs or hosts.

jeroenhd 399 days ago

The goal of this article is to show you how to integrate with this service from just about anything. It's an ad that was fun to make as a hobby project. I doubt the goal was ever to set up a fully integrated Minecraft monitoring pipeline. At best, this is an employee at this company just decided to show the flexibility of their product by integrating with a random piece of kit they like.

Luckily, all of the interesting components are existing third party libraries so if you don't want to use their SaaS service, you can build your own Minecraft dashboard pretty easily.

I am indeed an employee of Dash0. The setup for telemetry collector will work with anything that accepts OTLP, and with minor adjustments, the data can be sent elsewhere too in other formats, as the OpenTelemetry Collector is very flexible in that regard.

Alerting is specific to Dash0. I know of no other monitoring solution that lets you run real PromQL on logs. But there will be similar ways of accomplishing the same alerting logic.

dpe82 399 days ago

Have you never just built something for fun?

dengolius 398 days ago

Do you mean something like launching k3s on smartphones https://blog.denv.it/posts/pmos-k3s-cluster/?

I have built a panel like the one I mentioned for fun with friends!

The goal of my comment was to highlight opportunities for more fun and less what seems like toil.

Furthermore, this is an article about a telemetry solution posted on a site of that telemetry solution. They make money from this.

dewey 399 days ago

One persons toil is another persons fun.

And sometimes a person is paid to pretend toil is fun. We are talking about spending hours setting up telemetry instead of playing a game.

dewey 399 days ago

Not everyone is into gaming. I rather code on my side projects than use my console. Or people tweak and customize their Linux installation instead of doing work on it. Some people like to work on their cars, driving is a small part of it.

I swear I had a lot of fun setting doing the setup.

I am also a massive observability nerd, so YMMV :-)

koinedad 399 days ago

I’ve recently added telemetry to some “toy” apps at my house because a power outage or other unforeseen issue has caused things like my Siri enabled garage doors to stop working. Now I get alerts through grafana and telegram for basically free which comes in handy.

A garage door is a security concern.

For a game, a solution that simply restarts the container if it’s down solves the issue. You can mount game logs in a volume if you want, and you can see resource usage in container host dashboard. What value do detailed system metrics bring?

Furthermore, you don’t care what software you run to make your garage door system Siri-enabled, as long as it does its job and is not vulnerable; whereas with a game that adds new gameplay features multiple times per month, you do want to update it frequently. Babysitting a snowflake server makes it way more difficult than it should be.

ajmurmann 399 days ago

I am currently planning adding monitoring to some toy apps I hosted on a raspberry pi cluster. The intent is that this might safe me time and stress further down the road. If a new version makes performance worse, I want to see that in the data. If resource needs go up, I want to know that before it's time to move, so that I can plan without any kind of scheduling stress. (I also want to do this in part as an exercise which is partial motivation for the cluster and most things I built that run on it. But don't tell anyone!)

Am I misguided?

Well, as far as I’m concerned, if they are toy apps, why stress? If they are going to go in production at some point, then sure; but this certainly is not happening with a family game server.

ajmurmann 399 days ago

Family game server going down can be very stressful, especially if you have kids.

Also, I've had phone tech support sessions with family that were more stressful than calls with large banks who were worried about losing very large amounts of money in case of an outage. Different stressful, but nonetheless...

> Family game server going down can be very stressful, especially if you have kids.

Telemetry does not address this, though. Shoving it into a container and assigning it a simple “restart if down” rule does. Minecraft is a flaky beast, if you run snapshots and/or mods. Metrics or not, often “start again” is all you need.

Furthermore, this is a game that adds new gameplay features multiple times per month. If you do not update it frequently and your kid misses out on a new mob, you run into the same stress. Containerizing it makes the upgrade very straightforward, and once you run a couple of containerized instances… Do you not struggle to see the value of detailed system monitoring?

[1] https://github.com/dash0hq/minecraft-server/blob/main/drople...

> Telemetry does not address this, though. Shoving it into a container and assigning it a simple “restart if down” rule does.

A Systemd unit as shown in [1] does it too without using containers and with fewer moving parts of using containers. I use containers every day at $work. I have been using containers since before Docker was a thing. In this case, it's entirely overkill: Systemd units use the important things like cgroups already.

For the upgrade: depends. You do need a container image regardless, and I have not seen official ones. Upgrading servers in Minecraft requires upgrading clients to match, and my kids prefer to play, more than upgrade. (Unless a biome is released. Then it must be immediately available to them.) But then again, I just need to download the binary with a cURL call. And if the configurations change, Docker won't help me there one bit anyhow.

Indeed.

My personal definition of nanosecond is the time passing between the Minecraft server having a hiccup, and the first scream piercing the air.

The printer not printing is DEFCON 5 material.

jauntywundrkind 399 days ago

Seeing what computers are doing is good, actually. Period.

This is a real-time game. What the computer is doing is directly in front of your eyes.

jauntywundrkind 399 days ago

I know I sound like a freak to you, but you sound like a deranged freak to me too. Who would opt for ignorance? Who would opt not to have data? Who would opt not to see more? Its insanity to me to resist enrichment so.

Limiting yourself to only naive senses is a wild proposition to me. The scientific mindset compels us to see further: it is a wild privilege to see more, to build and use tools that expand how we can see.

harrall 399 days ago

Setting up telemetry is really easy if you’ve done it before and it’s a learning opportunity if you haven’t.

I have Dockerfiles from 10 years ago for Grafana and a time-series DB so basically you learn it once and you can bang out basic telemetry infra in an hour afterwards.

And I still actually use InfluxDB and Grafana for my hobby stuff. My current Dockerfiles just look like my old ones…

What happens if Grafana or InfluxDB is down? Who monitors the monitors?

For this, I have the impression that https://github.com/dirien/minectl might be very close to what you are thinking. I did not try it, but took the Minecraft Exporter from it and used in the setup.

cpburns2009 398 days ago

> The minecraft-prometheus-exporter ... which uses Fabric, another way to run Minecraft servers with mods. Like Bukkit, Fabric was not an option for me.

Forge and its recent fork Neoforge are supported too.