| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mona_rakibe 1418 days ago
	As databases were evolving and migrating to the cloud, the data models were changing too. Most modern cloud data warehouses or data lakes now support semi-structured schemas. Today pretty much any big data platform supports working with NDJSON, i.e., New Line Delimited File, where each row is a proper JSON representing a record. And even native support of JSON schema type: [https://cloud.google.com/bigquery/docs/reference/standard-sq...] This brought a great opportunity for data architects to design the most efficient data model for storage and querying. However, at the same time, it created a challenge for Data Observability. And the reason is multi-value fields, aka arrays. Such data requires special logic to properly calculate basic data quality KPIs like completeness, uniqueness etc. as there can be some records with no values vs multiple values per record. This blog highlights how we address monitoring for semi-structured data