| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jnewhouse 3358 days ago
	A standard database table isn't large enough to handle our large datasets. For example, the Hercules dataset was over 2 petabytes and even after optimization is almost 1 petabyte. Big data systems like Spark, Impala, Presto, etc. are designed to make the data look like a table, even though it is spread out into many files in a distributed filesystem. This is what we do. It's pretty common to reimplement some database features onto these big data file formats. In our case we have very fast indexes that let us quickly fetch data, similar to an index in a postgresql table.

1 comments

Alupis 3358 days ago

Well, you understand your system and requirements better than I, obviously, but...

    A standard database table isn't large enough to handle our large datasets

... isn't much of an answer as-to why you're storing objects in your database.

As you already mentioned in your post, serialized objects are big - they contain all of their data, plus everything necessary to deserialize the object into something usable.

I imagine your objects have the standard amount of strings, characters, numbers, booleans, etc... why not just store those in the database and select them back out when needed? Less data in the database, and faster retrieval time since you skip serialization in both steps (storage and retrieval). Even if you have nested objects within nested objects, you can write-out a "flat" version of the data to a couple of joined tables surely.

On the other hand, serializing the object is probably more "simple" to implement and use... but then you get the classical tradeoff of performance vs. convenience.

barrkel 3358 days ago

What's "the database" that you have in mind?

Start out with the idea that you have hundreds of machines in your cluster, with 1000s of TB of data. Suppose the current data efficiency is on the order of 80% - that is, 80% of the 1000s of TB is the actual bytes of the data fields. What database do you have in mind to store this data, still on the order of 1000s of TB?

You say: a couple of joined tables. So you have hundreds of machines, and the tables are not all going to fit on one machine; they're going to be scattered across hundreds of machines each. How do you efficiently do a join across two distributed tables?

It's no picnic.

If each row in one table only has a few related rows in the other table, it's much, much better to store the related data inline. Locality is key; you want data in memory right now, not somewhere on disk across the network.