| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bayesian_horse 1525 days ago
	What Pandas does is notoriously hard to fit into a compile-time type system. Certainly too hard to go into the brains of scientists who didn't grow up coding. No, the code in data science isn't bad because of the lack of typing. The code is "bad" mostly because those writing it are relatively fresh from starting to program. Also there is more pressure to make things possible, often just to run it once, and neglect repeatability or scaling to larger code bases. Different emphasis. That doesn't mean an experienced full stack developer would do Data Science better, because he might lack a lot of skills that matter more in that domain.

5 comments

NumberCruncher 1525 days ago

> That doesn't mean an experienced full stack developer would do Data Science better, because he might lack a lot of skills that matter more in that domain.

This resonates with my experience. I had the opportunity to work on a DS codebase written entirely in Scala with all the typing, parallelism, actor model, whatnot. Basically I joined the company because of this technical factor. It was fun until I figured out that DS was "typed IF-THEN-ELSE written by Java devs in Scala returning stuff the users complain about with high reliability within milliseconds". Now I am happy to be back to the single threaded untyped Python world. Still no bugs in production, because we validate all requests to death, have unit tests and integration tests running on real data not on mokups. Basically we follow the principle: if the integration test passes, our typing is just right, or at least good enough. Funnily all the typing errors we catch are caused by wrongly typed data, coming from the productive system written in a typed programming language... what a strange world.

bostik 1525 days ago

> integration tests running on real data not on mokups.

I can see you are enjoying the life outside of a highly regulated industry. Having certain kinds of production data in tests (or feeding that to test environment) would be a major audit finding in any finance or healthcare company.

Makes for both a blessing and a curse.

bayesian_horse 1525 days ago

You can anonymize such data or get the necessary agreements for a small subset. All of which is tricky, but not impossible.

dragonwriter 1524 days ago

> You can anonymize such data

...thereby destroying the production-like features for which you want it for testing, which you then need to recreate and reintroduce, so you might as well just synthesize test data in the first place, since that's what you end up doing anyway, in effect.

bayesian_horse 1524 days ago

I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data. Without talking about specifics, anonymizing is just the operation of making the process of deanonymization a lot of harder. "Hard enough" is usually specified in some form by the regulator. You can identify an individual by their ECG data, for example, it's just really hard...

No, in actual practice you don't scrub the stuff you actually need to test.

dragonwriter 1522 days ago

> I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data.

No, I’ve spent ~20 years in healthcare, with this issue as a frequently recurring issue.

> No, in actual practice you don't scrub the stuff you actually need to test.

In actual practice, the stuff you really need to test often overlaps with the stuff minimally required to scrub to legally deanonymize the data. The most common scenario I’ve seen trying to do this is both creating most of the work of generating synthetic data and failing to legally deidentify the source data.

NumberCruncher 1525 days ago

What I wrote was meant in the context of data science. You can not do ML without having access to real data, not even in highly regulated industries. Obviously you won't touch PIIs. But whether the real data is sitting in your train/test/validation data set or you use it for integration tests, doesn't make any difference from the perspective of an audit.

jghn 1525 days ago

> make things possible, often just to run it once

This is the largest difference. When there's no expectation of code lasting beyond a very short lifespan, why go through the effort to future proof things, improve maintainability, have better ergonomics, etc?

bogwog 1524 days ago

Because science doesn't work unless experimental results are reproducible.

jghn 1524 days ago

Sure. And once you land on something worth keeping, you clean it up then. But between time point 0 and then, a whole lot of code gets written to be run very few times.

krageon 1525 days ago

> Pandas does is notoriously hard to fit into a compile-time type system

What, mutate data with known structure? What exactly do you imagine is hard about this, let alone notoriously so?

qsort 1525 days ago

Not hard per se, but extremely unergonomic.

A pandas dataframe types to something like Iterable[SomeKindOfProductType[int, str, str, ... (78 other columns)]]. The formal type of a dataframe in the middle of a transformation is... not very useful to know.

smallerfish 1525 days ago

You wouldn't typically type tabular data in any language. Give it (at most) a DataFrame<StoreTransaction> type if you must, where StoreTransaction describes the structure of a row - maybe declaring only columns that model generation code was doing typed operations on (e.g. numerics vs strings) to avoid the need for reflection.

bayesian_horse 1525 days ago

Either you type the tabular data at compile time or you don't get type checking of tabular data at compile time.

The number and types of the columns aren't necessarily known at compile time. Which leads to runtime errors. Even in a statically typed language, such dataframes are a kind of "dynamic typing escape hatch". As complexity of a software increases, such mechanismus of dynamic typing and throwing runtime errors creep up all over the place.

smallerfish 1525 days ago

Sure. If you're dealing with untyped data at runtime you can't type it at compile time. Not a new issue, and handled all the time in otherwise typed languages.

int_19h 1524 days ago

It's not very useful to type, but that's why we have type inference.

It's certainly very useful to know, even if that knowledge is indirect via the type of the resulting dataframe after all the transforms.

bayesian_horse 1524 days ago

From dabbling in F# I have a feeling such "information" would be somewhat more annoying than useful.

arinlen 1525 days ago

> What Pandas does is notoriously hard to fit into a compile-time type system. Certainly too hard to go into the brains of scientists who didn't grow up coding.

I'm not sure if that's true. Doesn't Pandas handle ETL and some anaysis? There is nothing inherent to ETL that makes it a hard problem with compiled languages.

In your opinion, what does Pandas do that's hard to do with compile-time languages?

bayesian_horse 1525 days ago

I didn't say "with compile-time languages" but "with compile-time type systems". And many similar tools in a statically typed language will necessarily create a way to have one static type that doesn't care what the data inside actually looks like.

This even starts with basic Numpy and handling tensor objects. It's not easy for a type checker to understand what operations you can do with what shape of tensor. Worse, most often you don't know (or want to know) some of the dimensions or even dimensionality of some of the objects. Then it is impossible to check all of this at compile time.

arinlen 1525 days ago

> This even starts with basic Numpy and handling tensor objects. It's not easy for a type checker to understand what operations you can do with what shape of tensor.

That doesn't sound like a Python problem.

Instead, it sounds like the natural consequence of numpy being designed in a way where their data types aren't organized into subtypes, and leave that as runtime properties. This is a natural reflection of numpy's take on vectors, matrices, and tensors, which in terms of types are just big arrays with runtime properties.

To put things in perspective, in C++, Eigen supports static dense vectors and matrices whose size is specified and known at compile-time. I'm sure Python doesn't impose addition static type constraints than C++.

bayesian_horse 1524 days ago

Of course it's not a Python problem, all similar tools have the same "problem" that they can't easily fit that stuff into their type systems, so they invent some way to not care about it.

HelloNurse 1525 days ago

It isn't a matter of "compile time": explicit type declarations and definitions can often be formally sound but practically worthless.

Significant types in ETL-style applications typically come from outside (e.g. a certain CSV column in the input file contains a date in YYYYMMDD format, or maybe YYYYDDMM, figure it out, and don't forget time zones or your accounting will go wrong).

Then types are mostly complex but obvious and easily deducted (e.g. multiplying matrices of compatible shapes necessarily gives a matrix of a certain shape, why should the program say anything more detailed or lower-level than "do a matrix multiplication" or "do a tensor product"?); they are an often dynamic and unpredictable property of the data, not a useful abstraction.

int_19h 1524 days ago

The source code shouldn't need to say anything about the type of the resulting matrix explicitly, perhaps. But why shouldn't the type system keep track of shapes and deduce the accurate type for the result of said multiplication?

bayesian_horse 1524 days ago

Because the shape can be dynamic, for example.

int_19h 1524 days ago

> What Pandas does is notoriously hard to fit into a compile-time type system.

Sounds a lot like F# type providers to me.

https://thesharperdev.com/introduction-to-fsharp-type-provid...

bayesian_horse 1524 days ago

As I said, it is hard.

And good luck teaching beginning scientists F#.