| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by trishume 1300 days ago

I think I'm pretty careful to say that this is a simplified version of Twitter. Of the features you list:

- spam detection: I agree this is a reasonably core feature and a good point. I think you could fit something here but you'd have to architect your entire spam detection approach around being able to fit, which is a pretty tricky constraint and probably would make it perform worse than a less constrained solution. Similar to ML timelines.

- ad relevance: Not a core feature if your costs are low enough. But see the ML estimates for how much throughput A100s have at dot producting ML embeddings.

- web previews: I'd do this by making it the client's responsibility. You'd lose trustworthiness though so users with hacked clients could make troll web previews, they can already do that for a site they control, but not a general site.

- blocks/mutes: Not a concern for the main timeline other than when using ML, when looking at replies will need to fetch blocks/mutes and filter. Whether this costs too much depends on how frequently people look at replies.

I'm fully aware that real Twitter has bajillions of features that I don't investigate, and you couldn't fit all of them on one machine. Many of them make up such a small fraction of load that you could still fit them. Others do indeed pose challenges, but ones similar to features I'd already discussed.

1 comments

sayrer 1300 days ago

"web previews: I'd do this by making it the client's responsibility."

Actually a good example of how difficult the problem is. A very common attack is to switch a bit.ly link or something like that to a malicious destination. You would also DoS the hosts... as the Mastodon folks are discovering (https://www.jwz.org/blog/2022/11/mastodon-stampede/)

For blocks/mutes, you have to account for retweets and quotes, it's just not a fun problem.

Shipping the product is much more difficult that what's in your post. It's not realistic at all, but it is fun to think about.

link

sayrer 1300 days ago

Here are some pointers:

"Our approach to blocking links" https://help.twitter.com/en/safety-and-security/phishing-spa...

"The Infrastructure Behind Twitter: Scale" https://blog.twitter.com/engineering/en_us/topics/infrastruc...

"Mux" https://twitter.github.io/finagle/guide/Protocols.html#mux

I do agree that some of this could be done better a decade later (like, using Rust for some things instead of Scala), but it was all considered. A single machine is a fun thing to think about, but not close to realistic. CPU time was not usually the concern in designing these systems.

link

sayrer 1300 days ago

Here's the Twitter edge server from years ago: https://courses.cs.washington.edu/courses/cse551/15sp/notes/...

link

NavinF 1300 days ago

I'll go ahead and quote that blog post because they block HN users using the referer header.

---

"Federation" now apparently means "DDoS yourself." Every time I do a new blog post, within a second I have over a thousand simultaneous hits of that URL on my web server from unique IPs. Load goes over 100, and mariadb stops responding.

The server is basically unusable for 30 to 60 seconds until the stampede of Mastodons slows down.

Presumably each of those IPs is an instance, none of which share any caching infrastructure with each other, and this problem is going to scale with my number of followers (followers' instances).

This system is not a good system.

Update: Blocking the Mastodon user agent is a workaround for the DDoS. "(Mastodon|http\.rb)/". The side effect is that people on Mastodon who see links to my posts no longer get link previews, just the URL.

---

I personally find this absolutely hilarious. Is that blog hosted on a Raspberry Pi or something? "Over a thousand" requests per second shouldn't even show up on the utilization graphs on a modern server. The comments suggest that he's hitting the database for every request instead of caching GET responses, but even with such a weird config a normal machine should be able to do over 10k/second without breaking a sweat.

link

ilyt 1300 days ago

> I personally find this absolutely hilarious. Is that blog hosted on a Raspberry Pi or something? "Over a thousand" requests per second shouldn't even show up on the utilization graphs on a modern server.

Mastodon is written on Ruby on Rails. That should really answer all your questions about the problem but if you're unfamiliar Ruby is slow compared to any compiled language, Rails is slow compared to near-every framework on the planet and it isn't written that well either.

link

NavinF 1299 days ago

That makes sense, but I'm pretty sure jwz was whining about his blog getting DDoSed not a mastodon server.

link

vidarh 1300 days ago

While that may be funny, the number of Mastodon instances is growing rapidly, to the point where it will need to eventually be dealt with (not least because hosting on a Pi or having a badly optimized setup both happens in real life). But more to this example, it shows passing preview responsibility to end user clients is a far bigger problem. Eg not many would be able to handle the onslaught of being linked to from a highly viral tweet if previews weren't cached.

link

dpkirchner 1300 days ago

FWIW, jwz uses referer checking to redirect links from HN ... for "DoS" reasons.

link

ilyt 1300 days ago

Well, Mastodon is criminally slow

link