Hacker News new | ask | show | jobs
by alpos 1141 days ago
"We built a video stream processor by splitting every 1080p+, multi hour long, 30-60fps video into individual images and copying them across networks multiple times."

Not surprising that didn't go will. This strikes me as a punching bag example.

Anyone who has worked with images, video, 3d models, or even just really large blocks of text or numbers before (any kind of actually "big data") knows how much work goes into NOT copying the frames/files around unnecessarily, even in memory. Copying them across network is just a completely naive first pass at implementing something like this.

Video processing is very definitely a job you want to bring the functions to the data for. That is why graphics card APIs are built the way they are. You don't see OpenGL offering a ton of functions to copy the framebuffers into ram so you can work on them there only to copy them back to the video card. And if you did do that, you will quickly find out that you can be 10x to 100x more efficient by just learning compute shaders or OpenCL.

You could do this in a distributed fashion though, but it would have to look more like Hadoop jobs. I predict the final answer here, if they want to be reasonably fast as well, is going to be sending the videos to G4 instances and switching the detectors over to a shader language.

In general, if the data is much bigger than the code in bytes, move the code, not the data.

IO is almost always the most expensive part of any data processing job. If you're going to do highly scalable data processing, you need to be measuring how much time you spend on IO versus actually running your processing job, per record. That will make it dead obvious where you should spend your optimization efforts.

2 comments

To be fair it is somewhat a punching bag example but I think what people are reacting to, but maybe not articulating well, is the presumption for microservices by the powers-that-be.

Of course the only rational take on monoliths versus microservices is "use the right tool for the job".

But systems design interviews, FAANG, 'thought leaders', etc basically ignore this nuance in favour of something like the following.

Question: design pastebin (edit, I of course mean a URL shortener not pastebin)

Rational first pass but wrong Answer: Have a monolith that chucks the URL in the database.

Whereas the only winning answer is going to have a bunch of services, separate persistence and caching, a CDN, load balancing, replicas, probably a DNS and a service mesh chucked in for good measure.

I think this article shows that this is training and producing people who can't even think of the obvious first answer they have been so thoroughly indoctrinated.

I think the realtime requirement removes hadoop as an option. They might have considered using HDFS as the data store instead of S3, since putting lots of objects into s3 is expensive. Or just using a big EFS volume instead of S3.

It would be nice to know how much latency there was in the microservice version vs the monolithic version.

You never get "realtime" in data processing. Actual realtime systems are a totally different animal. Mostly done in the embedded space, the design of a realtime processing system involves setting up fixed time windows for each task that needs compute time and optimizing the code for each task until it fits into the time window for it, on every execution, every time. This is done in order to provide hard guarantees on how fast a system can respond to new data flowing in. It's usually only safety critical systems that actually have such responsiveness and delivery time constraints.

I point this out because how we talk about a problem determines what solutions we even acknowledge as being on the table here. Saying it's a realtime system when it isn't, or thinking we need realtime processing when we don't, makes people throw out solutions per-maturely, that the thrown out solutions are often right answers.

Once you acknowledge that your system will not be "realtime" and you actually don't have the time-boxing and specific time window delivery constraints that actual realtime problem spaces have, you can weigh all of your actual options with an eye for what will be fastest and most efficient given the budget and hardware you have to throw at this problem.