Hacker News new | ask | show | jobs
by wsmith 3492 days ago
I once reduced the running time of a report from 45 minutes to 3 seconds (900x improvement) by moving the code inside the database.

If a programming language wants to stay fast it must eventually become a database. I realize this is an unpopular opinion but popularity is the wrong metric to judge by.

5 comments

576 hours -> 12 minutes (2880)

Admittedly the code I was moving away from was hilariously bad. Threaded custom build of PHP bad. Then again I haven't tried to optimise the code I wrote at all

You are absolutely right about fast code becoming a database, this is simply down to the query planner - it can try and do the least possible work for the data you actually have.

I recently used temporary per-transaction tables (CREATE TEMPORARY TABLE .. ON COMMIT DROP, basically CTEs that persist across statements, and that can be indexed) with json_to_recordset and turned a three-minute ruby ETL background process into a sub-1-second inline call.
CREATE TEMP TABLE is really awesome. Not really related, but I used it at my previous gig to optimize the unit tests. They would previously all use the same database that devs use during development, so scans over some large tables were particularly inefficient, and tests could break when someone modified table contents for some development task.

I implemented a small function where you could list the tables that you were working with, and it would create empty temporary tables with the same name in a namespace that takes precedence over the normal namespace in postgres' search_path, therefore giving you a blank slate to work with for the unit test, while the other tables were still populated with useful data. (Some of these data come from expensive infrastructure scanning jobs, so starting from an entirely empty database was not an option.)

What language were you using to do that?

Just curious - not sure if you were using Ruby like the parent above. I literally just built a library to back ActiveRecord models with temporary tables. It's very useful since you can use AR scopes off of your temporary table then (I had to do this instead of using CTE's because my DB currently doesn't support CTE's)

https://github.com/ajbdev/acts_as_temptable

Just thought I'd share it in case it helps, or to hear if anyone else has a better approach to this. (warning: it's very experimental and not production tested yet). I couldn't find any other gem that would do this.

Cool idea.

How did you resist the temptation to call it temptation?

Although I think temptable is almost the opposite of contemptible, therefore also good going in the pun department.

My code is all Ruby, and I ended up pushing all the work into SQL and just eventually selecting with find_by_sql into existing models for the results. There is possibly concurrent invocation vs updates and a race against itself, so it's also all wrapped with a serializable transaction and starts with a mutex lock on the controlling model.

Hah! I was sitting around trying to think of a clever name, but then I got tired of sitting around and just went with temptable. Had I thought of temptation I would have gone with that.

The approach I went with works really well for cases when you want to persist the temporary table through the end of the request (it works good for master/index type views that may have a lot of sums/counts and filter options available on it).

One shower idea that I had is the concept of a clojure version that instead of compiling to JVM bytecode, javascript or CLR, compiled to postgreSQL. I think that would be awesome: you could just run the same functions, seamlessly, from the database, through the web server, to the browser. And, whilst of course you need to know about the limitations of the database, it could be great for pushing code to the database seamlessly.

Unfortunately I don't think I have the skills for that, so posting here in the hopes that someone that can likes the idea :)

This is kind of what LINQ does. You write a query in C#. The structure of the query (like an abstract syntax tree) can be exposed to the query provider which can interpret it or compile it to the CLR or compile it to SQL or whatever.
A sufficiently advanced LINQ (http://wiki.c2.com/?SufficientlySmartCompiler) would do wonders in some cases, but I haven't encountered it. AFAIK, LINQ to SQL only knows about joins and some aggregate functions.

One thing that makes improving it cumbersome is that the semantics of many operations are slightly different in the database than in C#. For example, SQL Server doesn't do NaNs and infinities, supports a zillion string string collations, and its date types may not 100% map to C# ones.

Also, databases may run stored procedures faster than the SQL that LINQ genrerates on the fly because they can cache their query plans (most databases will detect that an on-the-fly query is equal to one run recently, though, so this may be somewhat of a moot argument)

Which can lead to great, or less ideal results.. Seeing some of the resulting queries via monitoring are wild though.
At work we've got a LINQ query that gets passed around a few functions, eventually getting into a core of 60 lines of LINQ logic. Colleague verified that chaining selects produces different output, but gets ran at the same speed (Chain selects in order to somewhat declare variables, Select(x => new { x, y = <query-x> }) then you can Select(xy => new { use xy.y multplie times }))

Sometimes I think I should just be using sql.. (which we do on other projects)

Yeah, I've been pretty happy without ORM in node.js, I even wrote a semi-nice wrapper so I could turn template strings into parameterized queries. Made writing a bunch of migration scripts a cakewalk.

Sometimes it's really just easier to write SQL directly.

Do you mean something like Datomic? It has been intriguing me for a while...
Hence Kx, ZODB.
I'm not too familiar with ZODB, but it looks like it tries to impose OO on a database, while in reality relational model works best with data, so making programing language able to interact with data that way would be better. I think something like JOOQ[1] is close to that.

In order to get good performance you want to minimize number of back and forth requests over network. So instead of making a request to obtain list of items, and then fetching each of the item one by one (so called N+1 select issue) you will get a better performance if you make the database send only the data you want, nothing more, nothing less.

[1] http://www.jooq.org/

> I'm not too familiar with ZODB, but it looks like it tries to impose OO on a database, while in reality relational model works best with data, so making programing language able to interact with data that way would be better.

ZODB itself is essentially a transactional single-object store, where that "single object" is usually the root node of an arbitrary (possibly cyclic) object graph. Storage depends on the backend, nowadays you'd usually use file storage for development and relstorage for any kind of deployment.

It doesn't have any query language, Python is. There are no indices beyond those that you explicitly create (eg. a dictionary would be a simple index, but other packages provide more complex indices, like B-Trees - the nice thing here is that anything that goes into the DB is fully transacted, which removes a lot of headache that you'd have with other solutions (eg. flat XML files)).

ZODB thrives when your data model is too complex to easily or efficiently map into SQL, when you don't do or need efficient complex ad-hoc queries and when 180 % performance and concurrency (although this was recently improved through the MVCC reimplementation) isn't the highest priority. Since it's pretty good at caching (much better than most ORMs) performance usually doesn't suck.

ZODB is not a database, nor is it fast.
More like a database toolkit. Performance depends on application but was clearly never the top priority. (Can still be pretty impressive in comparison due to the rather effective instance caching)
Yep, the K interpreter is pretty mediocre (it is very small and the language is inherently pretty fast, but the interpreter is otherwise not very advanced¹) but being directly integrated with a very optimized in-memory column-store database means it smokes anything else. Even the massive engineering effort in a JVM+RDBMS simply can't compete with running K directly on memory-mapped files.

1. I believe they added SIMD support a year or two ago and got a massive (2-10× IIRC) speedup.