| All of this sounds very appealing, both as a standalone tool and as a complement to some thing like Dbt. However the following seems like kind of an anti-feature, which I at least would want the option to disable: > You do not need to worry about the structure of a database or parquet files > dlt will create a nice, typed schema out of your data and will migrate it when the data changes. You can put some data contracts and Pydantic models on top to keep your data clean. This is the opposite of what I want in 99% of projects. Most of the time, there is some kind of well-defined schema, even if it changes a little bit over time. If that schema is going to be depended upon by something like a data warehouse ELT pipeline, I want precise control over it. I do not want to hand that off to an opaque library. Moreover, the work of actually writing out the schema is like 1% of the overall effort in consuming a new data source, and usually it turns out to be a constructive, useful exercise in pinning down assumptions, finding gaps in understanding, etc. So I see a little benefit in hiding it. A schema essentially forms a business-critical contract between two major sections of the overall data pipeline, and that is absolutely not something I want to be changing dynamically without my explicit understanding and consent. This reminds me of the temptation I have seen in some developers (several of them ostensibly "senior") to use MongoDB for a straightforward CRUD-like application. The argument that it's schema-less to me is a striking anti-feature, something I explicitly do not want! The only time I really want this is in the rare and atypical case where I truly have no schema at all, or the schema is changing erratically and frequently in ways that I cannot reasonably anticipate and/or cannot dedicate developer resources to accommodating. That's a niche case that most people flatly do not have. Of course it's nice when a tool supports the niche use case that is very hard to deal with by conventional means (see also: OpenRefine), but it should absolutely not be the default and our tools should not encourage us to lie to ourselves that it's something we want or need. If you just want to reduce manual grunt work effort, consider something like generating a schema from an OpenAPI specification / JSONSchema. |
Our experience comes from startups that usually do not have time to track down the knowledge and rather go out and find/make their own. Here you definitely want evolution with alerts before curation - so load to raw, and curate from there. Picking out data out of something without a schema is called "schema on read" and you can read about its shortcomings. So this is both robust and practical.
For the fine tuning, as I mentioned, data contracts are a PR review and some tweaks away. They will be highly configurable between strict, rule based evolution, or free evolution. Definitely use alerts for curation of evolution events!