|
|
|
|
|
by jnewhouse
3358 days ago
|
|
A standard database table isn't large enough to handle our large datasets. For example, the Hercules dataset was over 2 petabytes and even after optimization is almost 1 petabyte. Big data systems like Spark, Impala, Presto, etc. are designed to make the data look like a table, even though it is spread out into many files in a distributed filesystem. This is what we do. It's pretty common to reimplement some database features onto these big data file formats. In our case we have very fast indexes that let us quickly fetch data, similar to an index in a postgresql table. |
|
As you already mentioned in your post, serialized objects are big - they contain all of their data, plus everything necessary to deserialize the object into something usable.
I imagine your objects have the standard amount of strings, characters, numbers, booleans, etc... why not just store those in the database and select them back out when needed? Less data in the database, and faster retrieval time since you skip serialization in both steps (storage and retrieval). Even if you have nested objects within nested objects, you can write-out a "flat" version of the data to a couple of joined tables surely.
On the other hand, serializing the object is probably more "simple" to implement and use... but then you get the classical tradeoff of performance vs. convenience.