Hacker News new | ask | show | jobs
by nerdponx 965 days ago
All of this sounds very appealing, both as a standalone tool and as a complement to some thing like Dbt.

However the following seems like kind of an anti-feature, which I at least would want the option to disable:

> You do not need to worry about the structure of a database or parquet files

> dlt will create a nice, typed schema out of your data and will migrate it when the data changes. You can put some data contracts and Pydantic models on top to keep your data clean.

This is the opposite of what I want in 99% of projects. Most of the time, there is some kind of well-defined schema, even if it changes a little bit over time. If that schema is going to be depended upon by something like a data warehouse ELT pipeline, I want precise control over it. I do not want to hand that off to an opaque library.

Moreover, the work of actually writing out the schema is like 1% of the overall effort in consuming a new data source, and usually it turns out to be a constructive, useful exercise in pinning down assumptions, finding gaps in understanding, etc. So I see a little benefit in hiding it.

A schema essentially forms a business-critical contract between two major sections of the overall data pipeline, and that is absolutely not something I want to be changing dynamically without my explicit understanding and consent.

This reminds me of the temptation I have seen in some developers (several of them ostensibly "senior") to use MongoDB for a straightforward CRUD-like application. The argument that it's schema-less to me is a striking anti-feature, something I explicitly do not want!

The only time I really want this is in the rare and atypical case where I truly have no schema at all, or the schema is changing erratically and frequently in ways that I cannot reasonably anticipate and/or cannot dedicate developer resources to accommodating. That's a niche case that most people flatly do not have. Of course it's nice when a tool supports the niche use case that is very hard to deal with by conventional means (see also: OpenRefine), but it should absolutely not be the default and our tools should not encourage us to lie to ourselves that it's something we want or need.

If you just want to reduce manual grunt work effort, consider something like generating a schema from an OpenAPI specification / JSONSchema.

2 comments

ahh good old manual fine tuning and maintenance. We are adding data contracts for things like event ingeston where schema needs to be strict or cases where you know ahead of time what to expect.

Our experience comes from startups that usually do not have time to track down the knowledge and rather go out and find/make their own. Here you definitely want evolution with alerts before curation - so load to raw, and curate from there. Picking out data out of something without a schema is called "schema on read" and you can read about its shortcomings. So this is both robust and practical.

For the fine tuning, as I mentioned, data contracts are a PR review and some tweaks away. They will be highly configurable between strict, rule based evolution, or free evolution. Definitely use alerts for curation of evolution events!

Fair enough, especially if explicit alerting is involved.

Have you considered a hybrid solution, something that generates a contract from a large corpus of data, which can then be deployed statically?

I consider "responding to change" as a somewhat different scenario from "heterogeneous but not changing". So statically generating a contract from an existing corpus supports the latter.

I could also envision some kind of graceful degradation, where you have a static contract, but you have dynamic adjustments instead of outright failures if the data does not conform to that contract.

I worked with dlt guys on exactly that. Using OpenAI functions to generate a schema for the data based on the raw data structure. You can check that work here: https://github.com/topoteretes/PromethAI-Memory It's in the level 1 folder
we actually spent several weeks writing openAPI -> dlt pipeline converter. you can check what've got here: https://github.com/dlt-hub/dlt-init-openapi

we'll continue this project but I learnt from it that most of the openAPI specs are a mess with hundreds of endpoints, incomplete definitions, lack of relations between endpoints, unique constraints etc. so there's tons of heuristics needed anyway. but sometimes it works. and is quite amazning!

if your source has well defined schema, we support ie. arrow tables natively. we keep 100% of that schema: https://dlthub.com/docs/blog/dlt-arrow-loading if you want to define your own schemas you can do it in many different way: - via pydantic models: https://dlthub.com/docs/general-usage/resource#define-a-sche... - via json-schema like definitions: https://dlthub.com/docs/general-usage/resource#define-schema - in a schema file: https://dlthub.com/docs/walkthroughs/adjust-a-schema

if you want to enforce schema and data contracts: - you can use pydantic models to validate data (if you use pydantic model as a table definition, this is the default) - we have soon-to-be-merged schema contract PR: https://github.com/dlt-hub/dlt/pull/594

My observations are that it is more than 1% of people that are fine with auto-generated schemas. But that could be selection bias (they use our library because they like it).