| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jerf 1470 days ago

Consider the whole data sector as:

    Generation ->
    Ingestion ->
    Transform (and possible looping back as derived data is created) ->
    Resting place ->
    Final Useful Product

Not saying that's a perfect model, just something to hang the terms I use in this post on.

And then consider that in order to get from the beginning of that process to the end, there is a certain amount of "Data Cleanup" to be done, ranging from merely validating that the data is sensible to in the limit literally handing huge blobs of text and unstructured data to humans and making them input something useful into the system out of it.

My assessment of the whole data community right now (and please, by all means react to this with your own opinions, I'm curious about them) is that the entire flush of fads going back and forth right now amounts to an argument about how exactly to distribute the necessary Data Cleanup work across that pipeline. The theoretical ideal is for everything to just be super awesome at the generation phase and nothing else has to worry about it, but it was rapidly discovered that making the generation part so expensive inhibits the data from ever being generated. With clean data, downstream could do all sorts of database-y storage technologies and do all sorts of clever things with the clean data, but the data is never clean.

The natural overreaction is to flip entirely in the other direction and just get it in and push the validation as far down the pipeline as possible. Here you get the "big piles of vaguely organized files". You get more data this way because you lower the costs of generation and to some extent ingestion, but you complicate everything downstream.

It seems to me we're currently in a phase where everyone is just sort of hoping somebody else will do it, and we're flailing around a bit.

Very opinionated: Where we're going to settle in, and where you can already see the shape forming up, is that it'll be a little mix & match at each level. Do what's easiest in each level at that level, and you end up with the cheapest and most effective result across the pipeline considered as a whole, even though no individuals working in any part of it will be 100% happy. There won't be a magic solution, but if, for instance, Ingestion demands that the Generation at least be amenable to some tabular view, even if there are some escape hatches for generic JSON bits, they can start operating with sensible tools (SQL-ish like Clickhouse or something) instead of just having a pile of opaque nothingness, and then the next levels down won't be able to count on data quality or coherence 100% but you can start layering in cleanliness and coherence as you go, etc. There just isn't a magic solution that fits into bullet points cleanly.

(There's this "bronze/silver/gold" thing going on, which I think is silly because there's really not much benefit to trying to force an arbitrarily-deep and complicated pipeline into such classifications, but the idea is there.)

Or, in short, yes I expect to see more tabular data. It just won't be tabular for the same reason that relational DBs use tables. It'll be tables even fairly early just because you need some sort of handle on the data to do any sort of useful manipulation on it. If relational DBs use tables as an emphasis on tables qua tables of data, data lakes will use tables as defined handles on individual pieces of data to be able to manipulate them as opposed to pure unstructured piles of "something".

It reminds me of the 20+ year, still ongoing argument about where in the "Browser -> Server -> Backend Services (including DB)" stack the work needs to be done. There's a certain amount of work that has to be done. You've got a bajillion choices about where to do it, and it's been sloshing back and forth across the entire time the web has existed ("do it all in SQL procedures! Do it all on the client!") because there is no simple hard & fast correct answer that everyone can follow for every case.

Just as with that world, this reality won't stop a pile of vendors from promising they can somehow make this problem go away, but they really can't. They can reduce the accidental complexity, and that's cool and may be worth paying for, but there's essential complexity that isn't going anywhere.

(Stretching even more abstractly, I'm writing a bit about how to do stream processing with io.Reader in Go, and it reminds me a bit of that, too. Stream processing is too complicated to write a single-shot conversion from "whatever's coming in" to the golden data you're looking for in many cases, and the solution is to fold in several transforms at a time, each comprehensible and testable, until you get what you need. The whole composed stream would be impossible to understand at once, but each piece can make sense. Trying by ideological fiat to jam it all into one piece or forcing the wrong place to do something is a recipe for disaster. You have to let the problem guide along its solution, or you'll end up wasting effort fighting to impose your beliefs on a system that doesn't care about them at all.)