Hacker News new | ask | show | jobs
by physcab 2092 days ago
I've used Snowflake a fair amount. It's a decent product, probably on par with Redshift / BigQuery. Obviously theres a lot of hype and free money floating around but my take on why they are popular is that they are basically a replacement for large Hadoop installations that have become untenable to manage over the past decade. If a company is already using Redshift or BigQuery I'm not sure why they would switch.

I would be apprehensive in investing in Snowflake long term purely because their product is highly susceptible to being obsoleted in the next 5-10 years.

7 comments

I was at a company that switched from Redshift to Snowflake. It was a night and day difference. Faster (orders of magnitude!), cheaper, and significantly easier to work with (since everyone had their own personal view of the data to mutate/work with).

As far as I can tell, it is a unique product in the database space. Extremely well executed ideas and design.

Snowflake seems like a unique product and I can only imagine the complex math they're doing under the hood to achieve these incredible query times. memsql is the only real competitor I know of. Redshift is a lot less user friendly (constant need to run vacuum queries). Parquet lakes / Delta lakes don't have anything close to the performance.

Predicate pushdown filtering enabled by the Snowflake Spark connector seems really promising. Lots of companies are currently running big data analyses on Parquet files in S3. Snowflake has the opportunity to grab a huge slice of the big data market.

What kind of math is involved in building a faster database? Genuinely curious. I would guess maybe linear algebra, indirectly.
Not at all. I'd highly recommend CMU's 15-445/645 Intro to Database Systems course (sponsored by Snowflake lol) because they put all their lectures online on YouTube [1]! Here's what's involved in making fast databases from the syllabus [2]:

This course is on the design and implementation of database management systems. Topics include data models (relational, document, key/value), storage models (n-ary, decomposition), query languages (SQL, stored procedures), storage architectures (heaps, log-structured), indexing (order preserving trees, hash tables), transaction processing (ACID, concurrency control), recovery (logging, checkpoints), query processing (joins, sorting, aggregation, optimization), and parallel architectures (multi-core, distributed). Case studies on open-source and commercial database systems are used to illustrate these techniques and trade-offs. The course is appropriate for students that are prepared to flex their strong systems programming skills.

[1] https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_...

[2] https://15445.courses.cs.cmu.edu/fall2020/syllabus.html

Oof... CMU courses directly sponsored by Snowflake. Gross.
Please elaborate? I can see a lot of ways a sponsored course could go badly, but I can't immediately see which ones apply here.
>I can only imagine the complex math they're doing under the hood to achieve these incredible query times

Maybe its cynical/paranoid, but in this age of Theranos I must ask: is it possible their algorithm excels at showing you a reasonable looking number, rather than an accurate one?

It's SQL, if they were giving wrong answers people would notice.
It's not too terribly difficult to load test Snowflake to get a sense of scaling. Jmeter does the job well. Heck I can pass you along some sample projects I've done against them if you really wanted.
yeah redshift is not at all comparable to snowflake. big query is much closer, it's ahead in some areas and in the last year has closed some of the gaps where it wasn't. big query's biggest problem is that it's tied to gcp which is a distant 3rd in cloud marketshare. they have big query omni coming which is multi-cloud but it'll probably be a while before it's comparable to big query in gcp.
The other problem with BigQuery is that you can very easily write a query that's going to cost you a lot of money to run - with Snowflake you can let it run for an hour or so, and then realise it was a bad idea and you're only out a few credits, a handful of dollars.

The killer feature for me was the query profiler - you can see WHY a query is taking a long time and optimise it - BigQuery just felt like Google were brute forcing the performance, and then charging you accordingly.

When the project I was on switched, the micro-clusters (and the ability to recluster a table) as well as the MERGE semantics beat BigQuery hands down - although those features my be out of beta now (but I've moved on to a new gig).

That's also a problem that it'd be fairly straightforward for Google to solve by automatically spinning up smaller, entirely separate serving clusters for customers who are worried about such a blowout (for a fee, obvs). It's just the serving tree (+ whatever in-memory storage service they use to do distributed joins nowadays), no need to duplicate the rest of the service. The caveat is, a smaller cluster will favor query optimizations specific to that smaller cluster. Some of those "small cluster" optimizations could hurt query performance when deployed against BQ proper with its tens of thousands of workers.

Also, BQ does explain the query plan to some extent: https://cloud.google.com/bigquery/query-plan-explanation. Not quite at the level of a "regular" SQL DB, but it does give you some info to work with when optimizing queries. If you haven't used it in a while I'd give it another try.

I believe this is exactly what slot reservations in BigQuery achieve. Instead of paying on-demand pricing that is determined by data read, you purchase a fixed number of “slots” that are shared by queries running within that particular project.
Ah OK, after reading their docs I see they've changed what "slots" used to mean in Dremel (internal version of BQ). It used to be that slots _guaranteed_ capacity, but did not limit it. Meaning that you could rely on having a certain number of workers in the cluster when you issue a query, but if Dremel had more it'd give you all it's got. Obviously this is not viable when people have to pay per terabyte read, because a ton can be read.

What they have now strikes me as an even better solution to the problem of bankrupting someone with a query IMO. Not sure how pricing compares to redshift et al, but pricing is the easiest thing for Google to change.

BQ Slots lets you do essentially that (pre-commit to a particular cluster size)
I was hitting some rough edges / complexity with BigQuery's MERGE recently, but wasn't able to ascertain any significant difference with Snowflake by scanning their docs briefly -- what aspects of the MERGE semantics are better in Snowflake in your opinion?

Wondering if this is a somewhat new feature in BQ since you used it, or if there's still a feature gap here (e.g. see https://cloud.google.com/blog/products/gcp/performing-large-...).

BQ has per-project and per-user cost controls. Normally when running new large queries one would run them under a special user with a limit on costs.
I think the obsolescence issue is complicated.

I recently saw a criticism of Palantir which went: "The company has largely succeeded, they say, not because of its technological wizardry but because its interface is slicker and more user friendly than the alternatives created by defense contractors."

A lot of the most successful tech firms started post-dot-com are decent interfaces to not-particularly-revolutionary databases. In high-end consulting and investment banking, appearances are hugely important. You can't have trash decks. It's unsurprising to me that the same is true in defense and intelligence. You can get a roof over your head and breakfast at a trashy motel or the Ritz. Everybody knows the Ritz can command a much higher price because "its interface is slicker and more user friendly than the alternatives."

I think the same thing is true here.

The ritz has far better beds, cleaner & safer rooms, better food and is far more likely to deliver that consistently. It's not just the appearance.
A closer reading will reveal that I'm not talking about superficial appearances, but the interface. That's an important distinction.

When I start talking about the Ritz and high-end consultants, I'm discussing the interface, which of course includes the "far better beds, cleaner & safer rooms, better food..." and consistency you're trying to contrast with appearance. I would agree that those things are more than superficial and are extremely important to the experience of the user, because that's exactly the point I'm making.

The beds and concierge are nicer at the Ritz, and the interface (note: not appearance) and support are better at Palantir (or, as we're discussing here, at Snowflake).

Maybe your Ritz experiences have been different than mine, but IMHO all hotel rooms are concrete boxes with a facsimile of home stuffed inside them, copied and pasted as many times as local demand will merit.

Hotel restaurants are the same principle, except replace furnishing with food.

Stay at an aging Courtyard Marriott. Some boxes are nicer than others.
I've stayed at everything from a Motel 6, to Courtyards / Residence Inns / Sheratons between NYC and San Diego, to Four Seasons / Ritz Carltons.

I stand by my claim. The relative differentiation in niceness is swamped by their mass produced boxness.

Ironically, my favorite road chain tends to be Aloft. At least they're upfront about their capsule-esque nature, in a sort of ironic/not-ironic way?

Least favorite: Embassy Suites. shudders It's like every Disney vacationing family's fantasy about what a hotel should be... packed with every Disney vacationing family. Omelette?

The point of hotel chains, and chains in general, is the consistency of the mass-produced experience. I can walk into a DoubleTree hotel anywhere in the world and get the same welcome cookie. It's a positive, not a negative; people often enjoy knowing what they're going to get. If you prefer a more unique experience, which is perfectly understandable, then simply avoid chains perhaps?
Totally get your point of view, and I share it in vacation contexts.. As the hotel chains have consolidated, they slice pennies everywhere.

When I'm travel for business or putting my head on a pillow on a roadtrip, consistency makes my life easier and less stressful. I'm a gorilla-sized person :), I would rather stay at higher end hotel that provides an actual bath sheet than a marriott whatever where I have to call for 6 towels. Surprises aren't delightful at 10PM when you've been on the road for 15 hours.

I got eaten up by gnats (they claim not bed bugs) over a week at a particularly nice hotel. On the plus side, nothing came home with me, the bites healed, and they gave me enough "points" as compensation to cover a luxury hotel in Barcelona for 2 weeks. So... Future Self can look back on the experience with a smile.
Nothing ever gets obsolete once it gains a large foothold in the enterprise space. There's a reason why Oracle and IBM are worth what they are today.
> Nothing ever gets obsolete once it gains a large foothold in the enterprise space.

Lotus? Delphi?

Both still in very heavy use. In 2014, anyway, every single IBM employee had to keep a Lotus Notes window open. It was hellish.

Dunno if that's changed since Red Hat took them over.

Used Lotus notes as recently as 2010, I am pretty sure it's going strong in my megacorp former employer.
Lotus is all over in government and insurance. As a mail client it is mostly dead, but the apps live on.
There is a reason, but ain’t bc of their cloud databases...
Novell, Word Perfect
Wordperfect was used in certain industries (legal especially i think) long after it started dying everywhere else. I don't think its an exception to this rule.
Yes but it’s dead now.
There was a post maybe two weeks from Tavis Ormandy (a tweet) that made the HN front page, about how he uses WordPerfect:

Tavis Ormandy (@taviso) Tweeted: @mkolsek Funny you should mention that, I was recently curious if there are any console word processors. I discovered there's a community who still use WordPerfect 5.1 for DOS. They kinda sold me on it, got it working in DOSEMU. https://t.co/t6j0c1G3w1

WordPerfect still has some users.

Last year we recruited an attorney from a firm that still uses WordPerfect for all their documents.

My school district still runs on ZENWorks.
At the end of the day, all the data warehouses run on SQL, with a bit of customization around ingestion and export. Most of them are backed by object storage (S3/GCS) and those integrations look very similar.

I wouldn't be that worried about lock-in or being made obsolete. Business logic is going to be pretty easy to port between Redshift, BigQuery, Snowflake, or whatever comes next.

> going to be pretty easy to port between Redshift, BigQuery, Snowflake, or whatever comes next.

This isn't even remotely true. Each has unique SQL syntax, and once you have few hundred or thousand queries written using vendor-specific SQL (be it date functions or JSON), it is non-trivial to migrate.

> Most of them are backed by object storage (S3/GCS)

Redshift is backed by worker instances that have their own stores in what's basically an EC2 instance. It's definitely not backed by S3 like Athena.

Bigquery and GCS are both built on top of Colossus, but they have different layers in between them.

With the newer Redshift ra3 instances you use S3 backed storage with local SSD caching

https://aws.amazon.com/redshift/features/ra3/

Same applies to Teradata vantage on cloud.
Sorry, probably should have been more precise. Meant to say: most users are going to interact with the warehouses via object storage for import and export of data.

Since the object store APIs are almost identical across platforms, it doesn't matter that much which warehouse you actually use for production work. It's something that does massive SQL, imports data from S3, and exports data to S3.

> most users are going to interact with the warehouses via object storage for import and export of data.

No, most are going to be using SQL IDE's to query and export data.

> I would be apprehensive in investing in Snowflake long term purely because their product is highly susceptible to being obsoleted in the next 5-10 years.

This can be said about most products and companies. What keeps them alive is how robustly they capture (and hold on to) the market, reduce costs through economies of scale, and innovate. This specific market is also very rapidly growing.

I would think it wouldn't be the same product in 5-10 years.
Lots of companies have built on top of snowflake.