| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eieio 457 days ago

So sometimes I don't test these projects that much but I did this time. Here are a few thoughts:

My biggest goal was "make sure that my bottleneck is serialization or syscalls for sending to the client." Those are both things I can parallelize really well, so I could (probably) scale my way out of them vertically in a pinch.

So I tried to pick an architecture that would make that true; I evaluated a ton of different options but eventually did some napkin math and decided that a 64-million uint64 array with a single mutex was probably ok[1].

To validate that I made a script that spins up ~600 bots, has 100 of them slam 1,000,000 moves through the server as fast as possible, and has the other 500 request lots of reads. This is NOT a perfect simulation of load, but it let me take profiles of my server under a reasonable amount of load and gave me a decent sense of my bottlenecks, whether changes were good for speed, etc.

I had a plan to move from a single RWMutex to a row-locking approach with 8,000 of them. I didn't want to do this because it's more complicated and I might mess it up. So instead I just measure the number of nanos that I hold my mutex for and send that to a loki instance. This was helpful during testing (at one point my read lock time went up 10x!) but more importantly gave me a plan for what to do if prod was slow - I can look at that metric and only tweak the mutex if it's actually a problem.

I also took some free wins like using protobufs instead of JSON for websockets. I was worried about connection overhead so I moved to GET polling behind Cloudflare's cache for global resources instead of pushing them over websockets.

And then I got comfortable with the fact that I might miss something! There are plenty more measurements I could have taken (if there was money on the line I would have measured some things like "number of TCP connections sending 0 moves this server can support" but I was lazy) but...some of the joy of projects like this is the firefighting :). So I was just ready for that.

Oh and finally I consulted with some very talented systems/performance engineer friends and ran some numbers by them as a sanity check.

It looks like this was way more work than I needed to do! I think I could comfortable 25x the current load and my server would be ok. But I learned a lot and this should all make the next project faster to make :)

[1] I originally did my math wrong and modeled the 100x100 snapshots I send to clients as 10,000 reads from main memory instead of 100 copies of 100 uint64s, which lead me down a very different path... I'm not used to thinking about this stuff!

2 comments

phatskat 457 days ago

> To validate that I made a script that spins up ~600 bots

Funny, when I went there were just over 600 active players and things were running super smoothly, even on my mobile. Kudos!

Do you see this project and the things you’ve tried applying to other future projects?

link

eieio 457 days ago

Hah, yes, but for testing I removed all my rate limits so I pushed 1 million moves in 2 or 3 seconds, whereas now I think I rate limit people to like 3 or 4 moves a second (which is beyond what I can achieve on a trackpad going as fast as I can!) so the test isn't quite comparable!

I definitely learned a lot here. Most of my projects like this are basically just "give the internet access to my computer's memory but with rules." And now I think I've got a really good framework for doing that performantly in golang, which should make the next set of projects like this much quicker to implement.

I also just...know how to write go now. Which I did not 6 weeks ago. So that's nice.

link

cheekyfleek 456 days ago

You ain't the only one who's removed the rate limits lol. Some of these queens are clearing a whole board in like 3s, must've written something to keep a piece selected. This is turning into a race to the godliest piece hackathon.

link

eieio 456 days ago

The rate limits aren't that aggressive and have a decent amount of burst, you can get about 10 moves done in 1 second before you hit them and start getting throttled[1]. And of course you can run multiple clients (I account for this too, but I'm not that aggressive because I don't want to punish many people NAT'd behind a single IP)

[1] down to something like 3/sec?

link

cheekyfleek 456 days ago

I figured the multiple people per ip would be an issue, was wondering if that might be at play here. I thought you said it was already at 3-4/s and I doubted it based on some of what I'm seeing. 10/s tracks a little better.

As to what you should change, I can't say. It's in the wild now lol.

link

PetahNZ 454 days ago

Yea seems the rate limit should be stricter, I made a quick script to do 1 move (capture) every 150ms, doesn't seem to get throttled.

I like how generals.io did it where they explicitly made a bot only server/version.

link

phatskat 456 days ago

Six weeks is pretty quick! Can I ask what editor you use (always curious), and what other languages you have a background in?

link

eieio 456 days ago

these days I mostly use vscode / cursor, although I still really like vim and use it for languages that I know really well (mostly python these days) and quick edits.

I spent much of my professional career at Jane Street Capital, which means that I spent a long time just writing OCaml and some bash (and a tiny bit of C). I'm very comfortable with Python, and over the last year I've gotten pretty comfortable with frontend javascript. And now golang!

I could probably write semi-reasonable java, ruby, or perl if you gave me a few days to brush up on them. And it'd take me a while before I was happy putting C on the internet. Not sure otherwise.

link

sebmellen 457 days ago

Makes sense that you’re a Jane Street alum. Damn cool stuff.

link