Hacker News new | ask | show | jobs
by blibble 939 days ago
that chart of the "inefficiency of client protocols" tripped my bullshit alarm

the paper is here: https://15721.courses.cs.cmu.edu/spring2023/papers/15-networ...

it's a super-contrived example that's not using any of the functionality of the database and is just using it as "cat"

basically just doing cat over localhost, well, what a surprise, if you add a layer of serialisation of course it's slower that just doing memcpy()

if you're using your database to store files... maybe don't do that

3 comments

I know of a DSP Engineer that used memcpy as a baseline to compare the speed of a sound filter. I think it is a good measure for first principle thinking.

There are other things wrong with the talk, it takes way too long to get to the point for one thing. DuckDB is cool and all but most of data management is getting the data in the right format/place and doing security or stuff like that, not running some query.

memcpy seems like a reasonable baseline for a function designed to operate on things in memory

not for a database

Man, you should read the original thread where memcopy was brought up as another example why netcat is a bad baseline for a network protocol and I was like, yea no that part at least sort of makes sense because that is the baseline. Sometimes I don't know why I keep commenting in this website. It is like talking with idiots all day.
Also I know some of these databases. For example, if you use something like MongoDB with its default configuration, it will be slow as molasses. It will send 20 documents over the network (default cursor batch size) and then yield its time to the operating system and wait for further instructions.

If that document is just three small fields, then you just effectively succeeded receiving maybe couple packets before the server gave up. Pitiful.

Change the batch size to maybe 2 or 20 thousand, enable network compression, increase client read buffer size from its ridiculously low default size, and this could start looking more like a data transfer we expect.

For a lot of data science/analytics, what you really want is "cat" of the data.

The database can't always do the data reduction and analysis you want to do quickly, and even in many of the cases where it can, trying to tell it about them in SQL and stored procedures can be pretty gross.

I say this as a huge proponent of SQL, stored procedures, and doing lots of work in the database.