| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by je42 2354 days ago

I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

From my perspective: batching is more complicated, than batching. (Batching requires you to define parameters like batch size and interval, while streaming does not for example). But may be batching tools are simpler than streaming tools, but i am not so sure.

Batching in general has also high(er) latency. That's why I usually don't prefer it unless:

That said batching has an advantage over streaming, it can ammortise a cost that you only pay once per batch process. With streaming you would pay the cost for each items as it arrives.

Further, the mindset requirements for engineers that work with batching is different than for streaming.

Each of these items can be valid concern for batching vs streaming. However, I find it difficult to value statements like "Batching" is the default because the industry has been doing this for years by default.

I think the industry as a whole benefits when engineers in these kind of discussions repeat why certain conditions lead to a choice like batching.

1 comments

TeMPOraL 2354 days ago

> I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

Not OP, but I'm guessing because most of that data is not actionable in real-time. There's zero point to get real-time data to analysts or decision makers if they're not going to use it to make real-time decisions; arguably, it can be even counterproductive, leading to an organizational ADHD, where people fret over minute-to-minute changes, where they should be focusing on daily or monthly running averages.

link

jacques_chester 2354 days ago

While they focus on the very-fast-updates thing, I think their technology will apply to batch cases also. In either of streaming or batching I want to do the least possible work, their claim is that they can skip a lot of unnecessary computations automatically.

That said, I find that batch systems have enormous inertia due to simple don't-touch-it syndrome. A report got developed in 1992 for a manager who retired in 1998 and died in 2009. Each night it churns through 4 billion records in a twelve-way join that costs tens of thousands of dollars of computing time per year.

Who reads this report? Nobody. In fact, the person who asked for it read it two or three times and then stopped. But it's landed reliably in an FTP folder for 28 years and by god nobody is game to find out whether the CEO reads it religiously.

link