Hacker News new | ask | show | jobs
by snidane 1460 days ago
The article is not explaining the point, which I believe is: type your dicts if you want to provide strict guarantees to your downstream about data shape.

If you know precisely what the data is used for - great, go ahead - type system is your friend.

If you don't know how the data should be used, it's often a different story. Wrapping data in hand typed classes is a terrible idea in the typical data engineering scenarios where there might be hundreds of these api endpoints, which also might be changing as the upstream sees fit. Perfect way to piss off your downstream users is to keep telling them "sorry the data is not available because I overspecified the data type and now it failed on TypeError again". Usually the downstream is the domain expert, they know which fields should be used and they don't know which ones before they start using it. Typically the best way is to pass ALL the upstream data down, materialize extra fields and NOT modify any existing field names, even when you think you're super smart and know better than domain experts. Too often it happens that a "smart" engineer though he knew better and included only some fields. Only for then to be realized that the data source contained many more gold nuggets, and it was never documented that these were cleverly dropped.

5 comments

Another option besides types is using a schema library. You can do more things, like define custom validation rules over eg several fields, publish the schema as data (eg at an API endpint, openapi or json schema etc), reuse it in another language (depending on schema system), version it explicitly, and machine generate it if it comes from some external spec (like a db schema).

Also great for property testing / fuzzing. And other fun meta datamodel stuff like eg inferring schema from example data.

In general programming language type systems are pretty weak in comparison because they're not very programmable. (In most languages, for most people, etc .. there are fancy level type systems approaching formal proof toolkits but they're hard to use)

This sounds specific to a particular company's organization where there are at least three different systems involved and no single source of truth. It seems like that's a problem in itself - how do you get everyone to refer to and update the same document?

Ideally everyone would be using a single type definition. Admittedly that's more common with protobufs, though, where you can't send any data that's not in the definition.

Come to think of that, it's true of plain old structs too.

This is more common than you might otherwise think. I've worked at multiple companies that have multiple systems/sources of truth for various reasons. One example of that is my current company has stored and handled all its transactional data in a legacy point of sale system from the early 90s. They decided to upgrade to a modern ERP system a couple years ago, but it takes a while to fully implement and roll over to a new source system. Especially in a high transaction system that cannot go down otherwise the company will start losing a lot of money. Thus its being incrementally rolled out, resulting in both systems running together and being read and written to simultaneously.
Sometimes defining who should have authority over a singular original type definition isn't possible. This is sometimes true at companies, and it's even more true in open source projects. Even when possible, single type definitions in those cases often end up as Homer-car monstrosities that are too big and difficult to construct when only a small subset of fields are needed.
The normal way to handle this is to deserialize into your application specific type, and store extraneous data in an extra field that is private but included in reserializations.

Because your application will fail if fields you need aren't there.

That can turn into an enormous amount of work to provide all the permutations of type conversions between 3+ classes, and then manually shuffle between them over and over. It's even harder when you don't have the power to add similar conversions for the classes you're trying to convert to/from.

Classes aren't a great abstraction when enforcing program invariants like "this object must at least have fields a and b." With a dict you can just have "dict(a=1, b=2, c=3)" and it works everywhere without serialization/deserialization and manual type conversions. Python's type checkers can't provide any safety for you if you do that, but that's a deficiency in the language, not the concept.

> Classes aren't a great abstraction when enforcing program invariants like "this object must at least have fields a and b."

Can you elaborate on the functional difference between

- an enhanced dictionary where certain keys are guaranteed as part of the type, and

- a record type (class, struct, whatever) with named fields?

This is one of the things I appreciate about languages like Go and Rust (I'm sure there are others as well). If the data is static, use a struct. If the data is dynamic, use a map/HashMap. No need to worry about TypedDict vs classes vs DataClasses vs etc, and no one uses HashMap for static data (they could, but virtually no one in those communities is such a glutton for punishment).

From Zen of Python:

> There should be one--and preferably only one--obvious way to do it

Forget about DataClasses, TypedDict etc. Can't you achieve the same in python with a class and a dict? Is there a difference, other than perhaps being overloaded with options?
A dataclass will get things like __repr__ for free.
Eq, hash, and init is also "free".
There’s also implementation differences. Accessing attributes and methods on a regular class may be slower because (IIRC) it has to do a lookup on each instance’s dict, whereas I believe dataclass implementation is more optimized (I’ve clearly forgotten the details).
Python is one of those languages where everything starts to look like a nail.
Yeah, I’ve programmed in Python professionally for about a decade and I used it for hobby stuff for about 5 years before that. I still don’t feel like I’ve mastered it, whereas I felt I had mastered Go after about ~2 years. I think Go implements Python’s “zen” a lot better than Python does.
A popular AWS API library does this and it is infuriating. AWS added a new field but the library hasn't been updated yet? Too bad, you can't use that field then!
True. Don't slap in types just because you can — add types when you need to work with your data. Most of the time, I worked with systems where my python code *was* the downstream and required data to run some business logic. In that context, types make the most sense.