Hacker News new | ask | show | jobs
by akudha 1855 days ago
In one of my previous jobs (financial services firm), we needed some specific data from census/bls etc. We looked around to see if anyone is selling the data - they were, but they quoted tens of thousands of dollars for what amounted to be downloading and parsing of few CSV files from the gov sites. We did it ourselves. That same client was paying tons of money for other data sets, which were also public datasets. The value add here is collating, cleaning etc.

What I am trying to get at - there is a market even for easily and freely available data, as long as the seller is dependable. The only problem is, you need to get your foot in the door. That might be much harder than the technical work.

2 comments

How’d they got on your firm’s radar? i.e. how’d they market their services to u?
This is one reason why we made Splitgraph [0]. We index 40k+ datasets from government data portals, and we give you one Postgres endpoint where you can query all of them together. Any table in a data “product” (live or versioned) is addressable in a SQL query with `namespace/repository:tag.table`. Splitgraph parses the query and forwards it to the appropriate data source(s), translating it to whatever upstream query language(s) defined in a Foreign Data Wrappers (FDW).

Here’s an example of a query across coronavirus tables at the external data portals for Cambridge, MA and Chicago, IL. [1] In this case the FDW implements the Socrata query language (the data portal vendor for 200+ gov agencies or municipalities).

I like this Ask HN thread, because it’s almost word for word the question we asked ourselves three years ago. How can we make it easy for anyone to sell data?

Reasoning backwards, we realized what needed to exist: a single “database” where anyone can connect or upload a “table” of data.

We talked more about this in 2018, in a presentation [2] emphasizing three core ideals: data should be composable, maintainable, and reproducible. This is kind of like the idea of treating “data as a product” within a “data mesh,” [3] as described by Zhamak Dehghani at ThoughtWorks.

The same ideas apply, whether sharing data within an organization or across the web. The question is, how do we connect data publishers and consumers? The answer should be roughly the same, whether the data consumer is a paying customer or a colleague. Think GitHub.com vs GitHub Enterprise. It’s the same software.

Part of our vision for the analogous Splitgraph.com is a place for developers to make money by selling data with minimal effort. That could be as simple as entering the read-only credentials to an existing database of data they want to sell. Or even (eventually) scraping the data, inserting it into a “virtual database,” configuring billing and calling it a day. It becomes a part of the “global database” that any authorized consumer can query along with all the other data on there.

[0] https://www.splitgraph.com/explore

[1] https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...

[2] https://www.slideshare.net/splitgraph/splitgraph-ahl-talk

[3] https://martinfowler.com/articles/data-monolith-to-mesh.html

This is just too much jargon, too much complexity. Hedge funds don’t care about data meshes. They do care about:

- whether your data has a long history of at least 3 years - whether it has some predictive value - whether it can consumed easily thru an API or FTP. - whether it’s cleaned, formatted and accurate - whether it hasnt been tampered with historically.

Yeah, that’s why we’re building the fabric for vendors to deliver exactly that.

As a vendor, you just worry about acquiring your data and dumping it into your favorite database. Give us the read-only credentials and we’ll take it from there. We handle distribution and give you tools to configure automatic versioning. (Ultimately we can reduce this even further, so you don’t need to maintain a database and can write directly to your virtual Postgres database at Splitgraph).

As a consumer, you just need to connect to one database and you can query any table you’ve been granted access to, using your existing Postgres client. Or you can use a REST API that exists for every version of every dataset.

We’ve gotta add all the billing, access controls, and some web UI, but otherwise this all works now. We’re a long way from focusing on any marketplace aspect though. We’re prioritizing the intra-organizational use case for now (where they over-rely on jargon like this, btw).

This is super impressive. There is so much one can do with your service, my imagination is running wild.

Can you give some examples of how splitgraph is used in the real world?