Hacker News new | ask | show | jobs
by Boxxed 906 days ago
One thing I'm confused about is why does Iceberg need a spark deployment to function? Or am I wrong about that? I would rather avoid that ecosystem if I can.
2 comments

You don't need a Spark deployment. The first reference implementations for reading and writing were in Spark.

Now, with PyIceberg, there is read support in Python. Write support should be merged very soon - https://github.com/apache/iceberg-python/pull/41 So, very soon, you will be able to read/write Iceberg tables in Python. I look forward to doing data transformations in Polars for data of reasonable scale (up to 100GB or so) and writing to Iceberg tables with PyIceberg. No Spark.

Well, what about other languages? Every language needs bindings or a re-implementation? (i.e., iceberg tables are written/queried in-process as opposed to via a network API?)
It tends to be more library dependencies than live clusters.

A lot of data lakes are managed using Hadoop and Spark so I think it’s just an artefact of that.

In the end I can’t see why you wouldn’t just be able to create and manage Iceberg files directly from a standard Python/JS/Java without that legacy.