|
|
|
|
|
by sanderjd
586 days ago
|
|
Frankly, the heuristic I've developed over the past few years working on a team that sounds like yours is: The data scientists are probably right. If you're actually operating on an object, ie. the equivalent to a single row in a dataframe, then yeah, it's silly to use a dataframe library. But if you're operating on N value objects ... yeah you probably want a dataframe with N rows and a column for each field in your object. Your mileage may vary I guess, but I resisted this for quite awhile and I now think I was the one who was wrong. |
|
Most software devs are used to working 1-dimensional collections like lists, or tree-like abstractions like dicts, or some combination of those. This is why most abstractions and complex data types are built on these. Objects are a natural progression.
But in the data world, high-dimensions are modeled using dataframes (analogously, tables). This is a paradigm shift for most pure software people because manipulating tables typically require manipulating sets and relations. Joining two tables requires knowing how the join-columns relate (inner, full outer, left, right, cross). Aggregation and window functions require thinking in terms of sub-groupings and subsets.
It's not super intuitive unless you have to manipulate and reshape large amounts of data every day, which data scientists and data engineers have to do, and which software engineers typically don't do very often. It's just a kind of muscle memory that gets developed.
I definitely had trouble at first getting software engineers to buy into DuckDB because they were unfamiliar and uncomfortable with SQL. Fortunately some of them had a growth mindset and were willing to learn, and those folks now have now acquired a very powerful tool (DuckDB) and a new way of thinking about large-data manipulation. When data is a certain size, iterative constructs like "for" loops become impractical.