Hacker News new | ask | show | jobs
by dodata 2307 days ago
Neat! Congrats on the launch - the demo is very helpful to understand the product. Having consumed long, painful PDF data dictionaries in the past, this is a big breath of fresh air. Excited to see where Syndetic goes!

For me, the most painful part of working with 3rd party data was actually figuring out the "match rate" to internal data. For example, you might be a consumer-facing company who hopes to add more context to your internal data by pulling in 3rd party information for existing clients. To match your internal data to a 3rd party dataset, you usually match on some hashed email (or similar identifier) to see what percentage of your consumer records will be available in the 3rd party dataset. Have you thought about something like that with your tool? Maybe you can upload a sample of hashed emails and see how different match rates pan out.

1 comments

Yes! This has come up across multiple industries and is probably the feature on our roadmap I'm most excited about. The implementation is tricky but customers definitely care about the intersection of a provider's data with their own. Some more sophisticated providers have internal tools for generating things like sample sets customized to a prospect.

We're going to be adding a feature where we can flag fields as identifying keys and index them. We'll start with a simple intersection count ("upload 100 stock tickers, see how many records match"). Then we'll add an interactive feature to let a prospective customer generate all of the stats in the dictionary scoped down to the subset of data they care about. It's important to be able to answer questions like "for the 100 tickers I care about, how many NULLs are there for this other column?".

Maybe someday we'll even get into the more general record linkage problem when there's no reliable matching key.

That sounds very useful.

I am also super impressed that you managed to present your product without mentioning "big data" or "machine learning" or AI - given that anyone that does anything these days crams those big words in.

Thats is good, good luck.

Thanks, I'm with you on the big data buzzwords and trying to avoid overengineering things (one of my favorite HN posts ever is https://news.ycombinator.com/item?id=8908462).

Right now the data scanning is just a fork of https://github.com/BurntSushi/xsv/ running in a container with plenty of ram, and we've handled files in the ~20GB range with no problem. I think we could actually scale up to ~100GB files with xsv, which seems to cover 99%+ of data providers we're running in to. Providers might be processing massive amounts of data but the eventual deliverable they share with their customers is rarely too big for one machine.

That said, we will probably move away from our super simple stack towards running a Spark cluster in the medium term. Not for "big data" (actually I expect Spark to have higher latency and possibly to be slower for a moderate sized dataset than the rust solution) but because we want to be able to run multiple parallel scans over the datasets for upcoming future features. Some of that will involve a DAG of dependencies (e.g., do type detection first to figure out fields that are categorical, then generate visualizations where the plots are grouped by whatever the values are of the categorical field). There are also a bunch of nice libraries in the spark world for more comprehensive stats.