Hacker News new | ask | show | jobs
Launch HN: Hubble (YC S20) – Monitor data quality inside data warehouses
125 points by oliver101 2130 days ago
Hey everyone! We’re Oliver and Hamzah from Hubble (https://gethubble.io/hn). Hubble runs tests on your data warehouse so you can identify issues with data quality. You can test for things like missing values, uniqueness of data or how frequently data is added/updated.

We worked together for the last 4 years at a startup where we built and managed data products for insurers and banks. A common pattern we saw was teams taking data from their internal tools (CRM, HR system, etc.), application databases, and 3rd party data and storing it in a warehouse for analysis. However, when analysts/data scientists used the data for reports they would spot something suspicious and the engineering team would have to manually go through the data pipelines to find the source of the problem. More often than not it was simple things like a spike in missing values because an ETL job failed or stale data because a 3rd party data source hadn’t updated correctly. We realised that reliability/ trustworthiness of the raw data was essential before you could start abstracting away more interesting tasks like analysis, insight or predictions.

We wanted to do this without having to write and maintain lots of individual tests in our code. So we built Hubble, which connects to a data warehouse and creates tests based on the type of data being stored (i.e. freshness of timestamps, the cardinality of strings, max value of numbers, missing values, etc.). We’ve also added the ability to write any custom tests using a built-in SQL editor. All the tests run on a schedule and you’ll get an email or slack alert when they fail. We’re also building webhooks and an Airflow operator so you can run tests immediately after running an ETL job or trigger a process to fix a failing test.

Instead of asking users to send their data to us, the tests are run in the data warehouse and we track the test results over time. Today we support BigQuery, Snowflake and Rockset (which lets us work with MongoDB and DynamoDB) and are adding more on request.

We’re planning on charging $200 a month for a few seats, and $30-50 for extra users after that.

We’re still at an early access stage but want the HN community’s feedback so we’ve opened up access to the app for a few days, you can try it out here https://gethubble.io/hn. We’ve added a demo data warehouse you can start with that has data on COVID-19 cases in Italy and bike-share trips in San Francisco. Thanks and looking forward to hearing your ideas, experiences and feedback!

10 comments

Customer here (comment not solicited!). We've been trying out Hubble for a month or so and it's looking really promising.

I love the idea of being able to outsource the creativity/problem solving of predicting things that could go wrong with our data to a service that specialises in just that, and I can totally see how they can automate this in a big way as they grow.

How does hubble compare to Great Expectations or DBT for pipeline testing? It looks like more emphasis on automated profiling than "having to write and maintain lots of individual tests" and obviously hubble being a saas offering is the big difference?

Also any plans to profile and test file-based stores as well? There's a lot that can go wrong in a pipeline before data even reaches BigQuery or Snowflake, and you may help your customers save money if you could profile data in S3 before it goes through a potentially expensive transform process.

Best of luck, though! Data testing is a very real need in most data organizations I've been in, and I'm glad more and more tools seem to be popping up recently to help with it.

Thanks! We love DBT and take a lot of inspiration from their work. We’re putting a lot of effort into suggesting the right tests based on the data types, sources, and field names. A lot of these tests are pretty repetitive to write so we want to make it easy to spin them up.

We’ve also found that keeping a history of the state of the warehouse over time is really useful context for determining whether a test has failed (example: this table tends to update every 30-40 minutes so we’ll set a threshold at an hour).

We also handle the scheduling, which is surprisingly annoying to manage (we built a couple of internal tools for this in the past). That’s something we really missed with great expectations (you get this with DBT cloud). Testing files is an interesting use case, to an extent we support this using Athena or Bigquery external tables for json/csv/parquet. We’re intentionally limiting it to SQL for now.

Very interesting tool, I am trying to do this with Dataform/Looker, and feel like some kind of inference like below would be great.

> this table tends to update every 30-40 minutes so we’ll set a threshold at an hour

Can you achieve these tests with metadata or do you need 100% read access to the database?

I also wonder if this would work as part of a Analytics Engineering CICD process? Something like how dbt cloud will block pull requests that fail certain criteria.

Metadata is a valuable place for finding information like load times, rows inserted / updated. Currently we just rely on read-access and raw SQL. A common way users are doing this now (and we are internally for our analytics data) is using, for example, the Fivetran logs table to monitor ingestion times and inserted rows, rather than querying the raw tables.

For CICD, absolutely we want to support this as well as stopping/conditional execution in DAGs (e.g. airflow). We’re launching webhooks very soon

this is interesting! running tests on data is certainly a pain point for me, and there doesn't seem to be nearly the kind of infrastructure available as for, say, tests for code functionality.

Is this open source? Sending my data to a third party is a no-go, as is having a third-party connect to the database. Something part of a managed hosting service, though, or an add-on to an existing trusted hosted service that has gone through compliance (e.g. Heroku, AWS), would be more palatable.

This was the same pain point we had when we saw how good the tools were for testing our software vs our data.

It's not open source but we can deploy on-prem (or cloud-prem more accurately) pretty easily. We’re also going to setup as an add-on available through AWS marketplace. Feel free to shoot me an email if you want to see if this can work for you hamzah[at]gethubble.io

Running a full table scan on BigQuery every hour can get quite expensive. Do you support some sort of deltas?

I signed up. Unlike the video, I do not see Redshift as an option. Any idea when Redshift will be supported?

How does billing per user make sense here? What prevents me monitoring thousands of tables under single user? Your workload costs will be higher than $200 here, no?

Do you have a set of fixed IPs you're connecting from to allow me to whitelist you?

Full table scans can get expensive. We’re adding support for incremental tests so for append-only tables you’ll only test the recent rows. This is especially useful if you use partitioned tables in bigquery.

Actually in the first version of the product we automatically tested every column in every table. The tests are more selective now, which is partially due to cost and partially because nobody wants to navigate through 10,000 tests.

Redshift will be supported this week! We have a list of new sources to get through and it’s right at the top. We’ve been emailing over the IP for whitelisting but we’ll add it to the connection page too.

As for pricing, we’re experimenting. Our costs do scale with number of tests (more scheduled tasks, more historical results stored). At the moment we retain the last month or so of test results, which is manageable for pretty large workloads.

Looking forward to Redshift!

BTW, you don't need to navigate 10K tests... you only need to navigate the failing ones.

Co-founder of intermix.io here (which we sold in March). We came more from the performance monitoring angle (specifically for Redshift), but then shifted to a product that works horizontally across all warehouses, to track usage, workflows and user engagement. "Shift to Data Products" was the narrative we started using in Q4 2019. If you read the copy on the current intermix.io website, I think you'll find yourself nodding. (FYI - we got bought by a small PE Fund that is rolling the product into Xplenty, an ETL product).

My experience is that monitoring data quality is a still an under-appreciated discipline. I've found that most teams still have an "not invented here" mentality, or don't even know they have the problem! That can lead to a "oh, we can just fix it when it happens" type of mentality. But your timing may be better than ours - we started back in 2016.

I haven't played with your product (yet), only took a look at this thread and your website. Some observations:

- SQL Editor - big plus! I think giving your users a space where they can take action is a super value-add, we didn't have that.

- nice work running the tests inside the customer's warehouse. That has two benefits for you. 1) you're not incurring the cost to crunch the metadata, it can get quite expensive, depending on the number of tables in the warehouse. 2) you're avoiding data access issues, getting access to the warehouse was always a hurdle, even though we only needed access to the system tables.

- pricing model. I think the per-seat model is the way to go. We tried charging by number of rows, and size of the warehouse (number of nodes), but then you run into weird situations with customers who are dealing with huge historic datasets, but really only look at the last 30 of data.

My unsolicited $0.02 is that you think hard about distribution. I think you want to think about hitching your wagon to the cloud marketplaces, and Snowflake's marketplace. For example, attaching themselves to Snowflake is what made all the difference for Fivetran.

I have a bunch of more scars that I can share if you care to know them :-)

Fantastic blog post, thanks for sharing.

So I guess if you had to pick arbitrary revenue/data/fte cutoffs, do you see the org chart of these adopters as you’ve described looking a certain way? Let me try to rephrase that.

Do you think there’s a step function of “here you need one DBA who is a holy librarian” and “here we need a gitlab styled data team with SLAs and the data equivalent of HR business partners who get assigned to the BU”?

Tangential to your comment but curious if you believe the human side scales akin to the infrastructure side.

Where is the blog post?
> My experience is that monitoring data quality is still an under-appreciated discipline.

We agree with this a lot, we found there are often a lot of unknown unknowns that drive data issues, and a lot of teams aren’t sure of where to start. It’s why we’re spending so much time on trying to make relevant tests in Hubble that are easy to set up and use (and then let users create custom tests once they get the hang of it).

Great point on the distribution, we do think being close to the data warehouses is really important for us, most teams already have one set up, but don’t know if what’s inside it is correct or useful. We’re looking to get set up on their marketplaces soon!

It sounds super relevant, we’d love to hear more - you can get me at hamzah[at]gethubble.io

Awesome - just followed up on your ping!
> we got bought by a small PE Fund that is rolling the product into Xplenty

I'm interested to hear more about your experience building data warehouse related products, and perhaps learnings you've had along the way. I guess selling to PE wasn't the initial goal, but I'd imagine your product is very well suited to the Redshift space.

I've been working on Snowflake related products, and their adoption speaks to a world of new problems being created, similar to your product with Redshift. I suppose the risk is being squashed by Snowflake building the feature, or businesses migrating to something new (perhaps Redshift products have suffered because of Snowflake)

Basically, what do the battle scars look like :D

there are always things the warehouses can't build themselves.

For example, with intermix.io, it was the tracing we had built for other tools like Looker and dbt. The insight was that the result of a DAG involves many different calculations across different tables. The metadata only tells you that the steps happened, but doesn't tell you in what sequence they happened, where the "hiccup", latency, etc.

Redshift is clearly suffering from Snowflake. I wrote about that in my post-mortem. That post also has a few battle scars:

https://medium.com/@larskamp/why-we-sold-intermix-io-to-priv...

Ping me on LinkedIn if you care to hear more :-)

Cheers, appreciate it, will ping you on LI!
FYI: Snowflake seems to be a commercial marketplace that lets users download data sets (weather, marketing etc) and presumably people to upload their data sets

https://www.snowflake.com/data-marketplace/

I assume there is a open version that's really good but less cool

Have you considered picking a different name? Searching for "Hubble" for whatever reason is going to return millions of irrelevant results for your customers.
I can't think of a worse name for SEO purposes. You'd have to fight through a well loved and well known space telescope, the astronomer it was named after, and Hubble contact lenses, which has raised ~74MM.
If a customer is looking for you specifically, they will find you (e.g. "hubble data" as stated above). If they are looking for a "data quality monitor" then the SEO will need to reflect that. The name is largely irrelevant at that point, it's merely a moniker.

In the grand scheme of problems a new company has, this is so trivially minor that I can't fathom this having any tangible effect on the success of a company. It's one thing if there's another data warehousing company called "hubble", but that's not the case you're making.

Hubble data brings up, as I would expect, data from the Hubble Space Telescope. Not one of the first page of results points to anything else but HSTS information.
The product literally just launched -- give it a few weeks, it'll show up.
I don't know who's advising you on SEO, but you will not ever outrank STSCI, NASA, ESA, AWS Open Data's HSTS archive, The Planetary Society, the National Academy of Sciences, or the ESO on "hubble data" as long as Hubble is still what people think of when they hear Hubble. The telescope and related sites/agencies/organizations have a 22 year head start building a relevant link profile in Google. And if you did, Google would get suspicious.

Hubble is fine as a name if you pick the right keywords to target in your marketing, but "Hubble data" is never going to show a link to something that isn't at least tangentially related to the telescope.

> Search for "hubble"

> See irrelevant results

> Search for "hubble data"

Problem solved. People are smart enough to modify their search if the initial results are about telescopes and not data pipelines.

One of my clients had a similar name to a global pizza chain. It hasn't been an issue at all, besides having to hear the same pizza puns over and over.

Yeah we called this project hubble long before we were worried about SEO.

Actually, the name does relate back to Edwin Hubble. We previously worked together on an internal data tool called Telescope (it was used for annotating medical images for computer vision). The telescope project slowly evolved into the product we have today. So we changed the name to our favourite telescope. I have a fondness for the Hubble telescope: there was a huge poster of it on the way into the computational physics dept. and takes me back to the grad school days!

The main thing is to be mindful of keywords you target. Don't do as another commenter suggested and target hubble data[0] unless you apply what you make to actual Hubble data. Like AWS did with its Open Data thing that comes up for that keyword.

The telescope is older than the web and is what every single person on the planet with some access to space-related media thinks of when they think of Hubble. Think long tail, not one or two keywords. Hubble data is out unless you go with a telescope-related project, but you already rank indirectly for hubble data warehouse.

[0] https://news.ycombinator.com/item?id=24229880

As the person you may be referring to, I'd like to clarify that I was not in any way suggesting they target "hubble data." It was just an example of how a user might modify their search if they were looking for this company but found telescope content instead.

There's no sense in doing SEO for your company name, unless you're at the point where competitors are trying to outrank you for your own company name. (Which is a pretty good tactic, actually: https://www.gkogan.co/blog/alternative-pages/.) So don't target "hubble," don't target "hubble data," don't target "hubble the YC company I saw on HN a while back," don't worry about it. Try and catch the people searching for use cases or solutions instead.

>> "As the person you may be referring to"

Nope. I was referring to the person I replied to who believed that it would rank for this keyword in a few weeks.

Yeah it immediately brings to mind https://hubblestack.io/
I'm sure Jobs and Woz heard similar...
Yes of course, because of how important search engine optimization was in 1976. Nothing has changed in the business environment between now and then.
I signed up and I think the concept is promising. It was very easy to add a couple of tests. SQL interface is handy and convenient, but sometimes still limited. It would be good to add a support for some custom scripts (i.e. Python, R). Another important thing for my team would also be seamless integration with other tools (i.e. email, SMS, Slack) to notify the team about the failed test(s).
+1 for alleviating data scientists/engineers of boring, repetitive manual tasks and empowering them to focus on the more challenging stuff
What does the tech stack look like?

Is there any caching for those situations where you may read the same historical data over & over?

Yes, we store the historical value of each test so you can always scroll back through time and see the state of the data warehouse at any given point.

For example, if you have a test that counts the number of rows "COUNT(*)" - that value will be recorded. So you can look back an hour/day/week and see how many rows the table had without executing any SQL. These values are stored in a time series db, so querying history is fast.

Our tech stack: monolith backend in python + postgres + react. The test themselves are all SQL queries and run in the data warehouse.

Do you have/think you need an on-prem version?
Yes we can run the whole stack on-prem. We realised very early that on-prem would be needed for many users. So we've made it easy to spin up Hubble in a k8s cluster in your cloud or on bare metal.