| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by agent281 1031 days ago
	I agree. Really seasoned data people are not common enough. Small companies need to buy services to lighten the load. We both seem have a sense of the size of companies at different percentiles. At what percentile would you put your company with petabytes of data?

1 comments

sanderjd 1031 days ago

Super hard to say, so ... 80th or 90th? With very low confidence.

But I do have very high confidence that the 99th percentile is much larger than petabytes (think: what's next after "exa"), and I believe that many companies these days crack into "peta" territory.

But as I saw another comment mention, I think another, probably more important, consideration besides size in bytes is cardinality and structure. So maybe this whole classification we're doing is kind of beside the point :)

link

agent281 1030 days ago

Yeah, it's hard to say with any certainty. I agree that the far end is the curve probably looks nothing like the "neighborhood" a couple percent away, relatively speaking.

I also agree that the variety of data plays a big part in its complexity. If you have a few petabytes of data, but it's really only a handful of tables you can real hone in on the relationships. If it's a wide array of sources with many tables between them then you have some nasty problems like entity resolution.

All happy data sets are alike; each unhappy data set is unhappy in its own way.

link

sanderjd 1030 days ago

> All happy data sets are alike; each unhappy data set is unhappy in its own way.

Ha, gonna steal that for some doc I write someday :)

link

agent281 1030 days ago

That's only fair: I stole it from Anna Karenina. :]

https://en.m.wikipedia.org/wiki/Anna_Karenina_principle#:~:t....

link

sanderjd 1030 days ago

Ha I know, I love that opener, despite it being super cliche to love it. Things are usually cliches for a good reason :)

link