Polyaxon is an open source machine learning automation platform. It allows to schedule notebooks, tensorboards, and container workloads for training ML and DL. It also has native integration with Kubeflow's operators for distributed training.
https://dolthub.com is the cool kid right now. There is pacaderm, git lfs, IPFS.
Really what we need is version control for data, it's not just an ML data problem. It's a little different though, because you would like to move computation to data, rather than the other way around
The utility of version controling production-sized (not sample training data) data (as opposed to code) is something I've having trouble grasping unless I'm missing something here -- and I may be, so please enlighten me.
It seems to me to be able to time-travel in data you almost need to store the Write-Ahead Log of database transactions and be able to replay that. Debezium captures the CDC information, but it's a infrastructure level tool rather than a version control tool.
In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written". Then you can roll things back to any ETL point in a performant fashion. This is particularly useful for debugging recursive algorithms that get retrained every day.
But these are infrastructure level approaches. I'm not sure that it's a problem for a version control tool.
Tim , CEO of Liquidata, the company that built Dolt and DoltHub here. This is how we store the version controlled rows so that we get structural sharing across versions (ie. 50M + one row chgange becomes 50M+1 entries in the database not 100M with no need to replay logs):
Thanks, that looks like an interesting approach. I may have missed this in the article, but let's say I have a SQL database with 600m records, and an ETL process does massive upserts (20m records) every day, with many UPDATEs on 1-2 fields.
Wouldn't discovering what those changes are still entail heavy database queries? Unless Dolt has a hook into most SQL databases' internal data structures? Or WALs?
You have to move your data to Dolt. Dolt is a database. It's got its own storage layer, query engine, and query parser. Diff queries are fast because of the way the storage layer works.
Right now, Dolt can't be distributed (ie. data must fit on one hard drive) easily so it's not meant for big data, more data that humans interact with, like mapping tables or daily summary tables. But, long term if we can get some traction, we plan on building "big dolt" which would be a distributed version that can scale to as big as you want.
So for most analytic workloads, typically a columnstore db is used due to the need for performance and advanced SQL features (windowing functions) for complex analytic queries -- which I don't expect Dolt to replace. Which means if we wanted to use Dolt's features, we would have to continuously ETL the data into Dolt, which would entail mirroring the entire database (or at least the parts we want to version control).
Dolt essentially becomes a derived database specifically used for versioning. I see how this might work for some use cases.
One of the cool things about Dolt is that you can query the diff between two commits. This functionality is available through special system tables. You specify two commits in the WHERE clause, and the query only returns the rows that changed between the commits. The syntax looks like:
`SELECT * FROM dolt_diff_$table where from_commit = '230sadfo98' and to_commit = 'sadf9807sdf'`
> In data science, most time-travel issues are worked around using bitemporal data modeling: which is a fancy way of saying "add a separate timestamp column to the table to record when the data was written".
Not quite, this is "transaction time". You also need "valid time" to be truly bitemporal. Recovering the database as of some point in time is not enough to answer questions like "when will this fact become false?" or "when did our belief about when it would become false change?", because you didn't preserve assertions about the time range over which the fact was held to be true.
In terms of implementations, ranges are better than double timestamps. They provide their own assertion of monotonicity and can be easily used in exclusion indices.
Glad I could help! The research seems to have puttered on for a while after this book was written, but appears to fizzle out by around the turn of the millennium.
Some notion of bitemporalism showed up in SQL 2011, but somewhat constrained compared to what Snodgrass describes.
Not really -- in many forecasting applications in fast-changing markets, it is fairly common to dynamically retrain your recursive model to a moving window of historical data in order to adapt to your current environment (with some regularization). The length of the window depends on how fast the market changes.
For these types of recursive model applications, you cannot just fit the model once and forget about it.
Honestly, I've heard people in Vegas tell me the same about their strategies vs. slots. Genuinely, if you have made money from this - well done, take it out now, congratulate yourself. If you haven't...
Thanks !
There are indeed players many new in the data versioning space (DVC and Quilt also probably worth mentioning).
I totally agree that data management problems are not just ML related. But I personally think that there are additional challenges in the space that are not just version control for data.. all the area of data quality management and monitoring for example.
I liked the analogy to devops, source version was super critical problem to solve in software development, but it didn't stop there, with things like CI/CD etc.
I believe we'll see similar evolution in the data space..
Disclaimer: i am a co-founder of Logical Clocks. There are loads of interesting technical challenges in this "Feature Store" space. Here are just a few we address in Hopsworks:
1. To replicate models (needed for regulatory reasons), you need to commit both data and code. If you have only a few models, fine just archive the training data. But, if you have lots of models (dev+prod) and lots of data - you can't use git-based approaches where you commit metadata and make immutable copies of data. It scales (your data!) badly. We are following the ACID datalake approach (Apache Hudi), where you store diffs of your data and can issue queries like "Give me training data for these features as it was on this date".
2. You want one feature pipeline to compute features (not one for training and a different one when serving features). Your feature store should scale to store TBs/PBs of cached features to generate train/test data, but should also return feature vectors in single ms latency for online apps to make predictions. What DB has those characteristics? We say none, and we adopt a dual-DB approach with one DB for low-latency and one for scale-out SQL. We use open-source NDB and Hive on our HopsFS filesystem - where all 2 DBs and the filesystem share the same unified, scale-out metadata layer (a "rm -rf feature_group" on the filesystem also automatically cleans up Hive and feature metadata)
3. You want to be able to catalog/search for features using free-text search and have good exploratory data analysis. The systems challenge here is how to allow search on your production DB with your features. Our solution is that we provide a CDC API to our Feature Store, and automatically sync extended metadata to Elastic with an eventually consistent replication protocol. So when you 'rm -rf ..' on your filesystem, even the extended metadata in Elastic is automatically cleaned up.
4. You need to support reuse of features in different training datasets. Otherwise, what's the point? We do that using Spark as a compute engine to join features from tables containing normalized features.
https://github.com/polyaxon/polyaxon