The reality is that most people don't want to manage their own Clickhouse store, and not all engineers can operate with SQL as efficiently as with code (me included). Nonetheless, this is pretty cool!
> not all engineers can operate with SQL as efficiently as with code
I don’t mean for this to sound insulting but I honestly do not think this is an acceptable take to have as a developer.
Not knowing SQL is like refusing to learn any language that has classes in it, simply because you don’t like it.
I’ve heard stories of huge corporations failing product launches because some code was written to SELECT * from a database and filtering it in-app instead of doing the queries correctly, and what’s so fun with these types of issues is that they usually don’t appear until weeks later when the table has grown to a size where it becomes a problem.
When you’re saying that you’d rather find the data in-app than in-database, you’re putting the work on an inferior party in the transaction simply because you can’t be bothered.
The code will never* find the correct data faster than the database.
* there may be exceptions, but they’re far enough between to still say “never”.
Dropping down to SQL to write a really complex query is, in my professional experience, always a poor use of time. It's far simpler to just write the dumb for-loops over your data, if you can access it.
Of course not all engineers can operate with SQL as efficiently as code -- that's the whole point. Otherwise why would we be writing code? Learning SQL intimately doesn't change that fact.
> Dropping down to SQL to write a really complex query is, in my professional experience, always a poor use of time.
We’re not talking about Assembly here, “dropping down” to SQL is something that anyone should be expected to do as soon as you’re grabbing or modifying any data from a database in any scenario where performance or integrity matters. The errors you can see in situations like this are extremely complex and databases literally exist to solve them for us.
Also, if we just completely disregard the performance for a second and focus on data security instead, how do you ensure sensitive data isn’t passed to the wrong party if you don’t care about what queries are being sent?
I mean, it doesn’t matter if it’s not “in the end” displayed to an end user in the application you’re writing, or if its not stored in the intermediary node where your code is running, that data is now unnecessarily on the wire in a situation where it never should have been in the first place. If you end up mixing one customers data with another’s and sending all of it in such a way that it could even theoretically be accessed by a third party, that’s a lawsuit waiting to happen regardless of whether it was “displayed” or “forwarded” or not.
Imagine if you sniffed the packets going to some logistics app you use on your phone and you saw meta-data for all packages in your zip code in the response, or if some widget showing you your carbon footprint actually was based on a response containing the carbon footprint of every customer in the database. Even if it’s just [user_id,co2] it’s still completely unacceptable.
Never mind scenarios where you’re modifying, adding or deleting data, those are even worse and no explanation should be necessary for why.
Obviously it greatly depends on what you're doing. If you're using a relational database as a glorified key-value store for offline or batch processing of a few hundred megabytes of data, sure. Hell, just serialize and unserialize a JSON document on every run if it's small and infrequent enough ¯\_(ツ)_/¯
If you've got a successful data hungry web service with a reasonably normalized schema and moderately complex access patterns though, you're not going to be looping over the whole thing on every page load.
Way way way slower though. I just added something to our app that took 600ms for the naive ‘search and loop’ version (and kept getting slower the more items you needed, completely unscaleable) vs the 30ms for the ‘real SQL query’ version. Guess which version got actually committed.
Did your for loop solution include concurrent access by multiple clients?
I highly doubt "engineers can not operate with SQL as efficiently as code"
can implement anything even remotely as robust as what SQL DBMS offer even for basic use cases. Are you mutating data? What will happen if the system crashes in the middle of the mutation? How are you handling concurrent writes and reads?
It's unclear whether you mean that it's simpler to make a query and iterate over the rows to massage the result in your application or to make a query and then iterate over the returned rows to make more single-row queries. (Or perhaps some secret third thing I'm not considering.)
I'll admit I'm a little curious about what exactly you mean here.
There’s a difference between writing olap and oltp sql queries. Hell, in the industry we even have a dedicated role for people who, among other things, write olap queries: data analysts. I’m assuming here that we are talking about writing complex analytical queries.
"Don't want to manage their own" has for so long been a valid excuse but cloud costs haven't been going down for so long - in many cases prices have increased - and hardware keeps getting more badass. In so many cases it's fear speaking.
A decent sized server will host a hugely capable instance that you may not have to think about for years. The scoffing down at DIY has made sense to some degree, but it just works brilliantly keeps getting to be a stronger & stronger case & most just assume reality can't actually work that well, that it'll be bad, and those folks won't always be right.
But in this case we are not even talking about own/rented HW vs cloud. It's self-hosted(even on cloud) vs SaaS softwares!
SaaS, especially in this space, can be *extremely* costly and its cost will scale up quickly as you send more traffic (either willingly or by mistake). Yes, Datadog, NewRelic etc will give you many pre-built and well-thought dashboards and some fancy AI-powered auto-detection thing but they will charge many $$$ for it. Consider that now cost management/analysis tools that were historically focused only on cloud, are now adding the same tooling for costly SaaS solutions!
I understand that many HN readers are skewed towards SaaS solutions, usually because they work at a SaaS shop, but depending on the size of the company, the overhead for managing it internally can totally be worth. There is overhead with SaaS as well...
We just left ours running for months in a Docker container. The volume is external, we just replace container image with new one, it takes 5 seconds to update, and spans are treated ephemeral. We store only 7d of data. We could use S3 but we have no use for that data in the long run.
To be fair, we wanted to get experience on ClickHouse and it's a special database need special attention to details on both ops and schema design.
I'm beginning to sound like a broken record at this point, but if you don't know SQL very well but know how to use GPT-4, you have access to enough SQL to get a lot more done than you might think.
I don’t mean for this to sound insulting but I honestly do not think this is an acceptable take to have as a developer.
Not knowing SQL is like refusing to learn any language that has classes in it, simply because you don’t like it.
I’ve heard stories of huge corporations failing product launches because some code was written to SELECT * from a database and filtering it in-app instead of doing the queries correctly, and what’s so fun with these types of issues is that they usually don’t appear until weeks later when the table has grown to a size where it becomes a problem.
When you’re saying that you’d rather find the data in-app than in-database, you’re putting the work on an inferior party in the transaction simply because you can’t be bothered.
The code will never* find the correct data faster than the database.
* there may be exceptions, but they’re far enough between to still say “never”.