| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ariskk 3391 days ago
	Hi. I am the author of the article. Thank you for spending time to read this. The combined reach of the co-founders is very large, thus being able to provably handle scale was an essential prerequisite. Additionally, the requirements of the platform extend way beyond a simple content server. Content performance is tracked in real-time and this is fed to multiple ranking and recommendation models. Those frequently change, thus we need a way to retroactively process our data. Flexibility is key when trying to build an intelligent platform. Thus, we decided to early-on invest time in the ability to quickly iterate and experiment on algorithms, in real time over live data. You are right that the API fleet could be implemented using the aforementioned technologies; We use Scala and thus decided to use Akka HTTP instead. The challenging part is how you manage state behind that.

2 comments

aub3bhat 3391 days ago

Don't get me wrong, I am not denying that CQRS/stream processing style approach is not useful for any application. Rather it is unsuitable for this particular problem.

In my experience all these features sound nice on paper. But you quickly run into practical issues that are far easier when you know approximate information about the state.

E.g. Developing a model? you might just want a subset/batch data. Doing BI/Analytics? are you going to continuously tax your server to recompute? The argument about recommender systems is also honestly flimsy, having built and applied such systems to live traffic at very large scale (more than hundreds of millions of users). There is only a small advantage from being able to quickly reconfigure flows. In most cases you have a single baseline model which you compare against for a small fraction of the traffic. The real complexity/gains in recommender systems lie in choice of algorithm/hyper-parameters/features, not on continuous multi armed bandits with 1000 different models applied simultaneously while waiting an infinite amount of time to produce any statistically meaningful answer. In fact for a website like this one, recommender systems can only provide so much advantage.

There are actually several really good specialized use cases, e.g. Google secmon-tools uses a system like this one.

[1] https://web.stanford.edu/class/cs259d/lectures/Session11.pdf

link

ariskk 3390 days ago

You mention the word "batch" when talking about models. Also "BI/Analytics". Since Django/Rails applications do not support any of the two, another sort of system would be needed. This is the point where, having built everything on Django, with no foresight whatsoever about future requirements, we would have ended up creating DataFrames from SQL tables in Spark. Our BI guys have no experience with Spark, so we would need to load data to a DW-like solution, like BigQuery/Redshift/Impala/Presto/you-name-it. Instead of another sink in Flink, we would need to implement and schedule ETL jobs. Even at our current load, computing counters (eg likes) at read time would be slow and inefficient. Which means we would need a way to pre-aggregate them. Maybe another service, possibly behind a queue? You can see where I am going. As requirements evolve, systems evolve, and with no planning before hand, people end up with spaghetti architectures. We knew we were funded enough to run for a couple of years. We knew the site would have traffic. We were tasked with delivering an algorithmicly-driven product, and this is the solution we came up with.

I really do not understand how such a strong set of conclusions can be drawn out of so little information.

link

newman314 3391 days ago

Can you share some details around numbers and volume? "Very large" does not really convey why going down this route makes sense.

link

ariskk 3391 days ago

Unfortunately, I am not allowed to. The problem with this is that beforehand you cannot predict the volumes. 1K requests per second? 10K per second? Maybe 50K per second on special occasions? It is difficult to tell, especially when high profile personalities are involved.

PS: we do have lots of load

link