Hacker News new | ask | show | jobs
by reggieband 1852 days ago
I am absolutely no expert in any of these domains but the level of confusion described in these comments seems a little exaggerated. Is it so hard to see what is going on here?

A data lake is a centralized repository where all of a companies data is aggregated. This allows analysts to perform queries against a single data source (often masquerading as a SQL database) rather than against 100s of distinct databases (which may be a hodgepodge of no-sql, sql, custom-rest-api, etc.). These "data lakes" often grow to a massive size since they will often not only include your application data (usually batch replicated from prod databases on some schedule or in some cases streamed directly) but also data from external sources (e.g. a feed from your payment processor, compressed events from your app/website analytics, server logs, marketing and advertising sources).

Storing and processing that volume of data efficiently is a difficult task. Many companies decide to just dump that data in a raw format into cloud storage services like AWS S3. Then some third parties made the SQL-like interfaces run on top of S3 (or connectors from S3 into other familiar tools like Spark). This allows for low-cost storage while also allowing data analysts the ability to use tools they are already very familiar with. This way of handling large volumes of data stored for analysis has become very popular.

But now that you have so much data stored in S3 you might start to wonder how you can control access to it. An analyst doing queries on website performance might not require access to the payment processing data. Your security team might point out that your growing analyst team has more access to sensitive company data than is required. As you negotiate big corporate deals their security team might start to red-flag unnecessary access to data (or ask you for your policies governing access to that data and how those policies are enforced).

This product seems to allow finer control over access to data stored in these kind of data lakes. In the same way a bunch of tools appeared to create a SQL like facade on top of the data, this tool creates a facade on top of data access control.

Not only is what they are doing completely understandable from a quick skim of the article, it also seems totally necessary. I have no doubt this is a massive market and this product has every chance to serve a real need.