Hacker News new | ask | show | jobs
by orcadk 5145 days ago
There are quite a lot of inaccuracies in this post, as well as hints that don't really mean much.

First up, the non-SQL-injection code doesn't specify the parameter types. As such, each length of monster_class will result in a different execution plan being compiled and cached by SQL Server. Granted, there's only 50 different kinds - but just specify the parameter types to avoid it.

Sure, you could setup a VLAN, but we're lightyears from having that affect our performance at this point.

Parser overhead is not an issue as long as parameter types are specified. The plan will be compiled and cached. The only advantage to stored procedures at this point is reduced network data as well as management.

Page faults - Sure, but for inserts the data will most likely be in memory already as we're writing new data. And if there's enough data, the relevant b-tree pages will most certainly be hot.

Varchars - Granted, avoid them unless you're storing string data. I have no idea where the "hundreds of bytes of of memory" comes from. SQL Server caches all data as pages, exact copies of what's stored on disk. Variable-length varchars have an overhead of exactly two bytes + 2/#varlengthCOlumns in total.

Disk IO - The transaction log is never merged with the main data file. SQL Server uses write ahead logging, writing to the log first while modifying the data pages in memory. Now and then a checkpoint is run, writing all the dirty pages to the main data file. A log is only "merged" with the data file if the server has crashed and recovery is being run automatically during startup.

"This increases the number of rows you can insert before SQL Server succumbs to memory pressure" - Since the article only deals with inserts, why are we talking about memory pressure? You'll get disk contention for sure, but memory won't realistically be an issue unless you're reading as well.

There's an alternative to keeping the potentially long running transaction open - avoiding both SqlBulkCopy and transferring datatables. The BCL has a CommandSet notion that utilitizes a TDS feature that allows batches to be streamed to SQL Server. Unfortunately you'll have to use reflection to get at it (https://github.com/ayende/rhino-etl/blob/master/Rhino.Etl.Co...), but the code has been stable for several releases and is very simple to interface. It basically allows you to create normal SqlCommands and just send a large batch of them at once, all being executed atomically with minimal locking.

Finally, with the micro optimizations being mentioned, I'm surprised he doesn't recommend to "SET NOCOUNT OFF".

Overall, some good points with some inaccuracies. I have a hard time with the closing statement "As you can see, you don’t need to abandon SQL Server to get massive performance improvements.". These are decent improvements, but this is such a simple scenario that I find it pretty optimistic to let this define whether to abandon SQL Server or not. The RDBMS problem is rarely with inserting enough data, but with reading data at the same time while avoiding locking.

3 comments

"and if there's enough data, the relevant b-tree pages will most certainly be hot"

That is not the case if you have indexes. Most secondary indexes are high-entropy (if they weren't, you probably wouldn't be storing them, is an emotional proof), so inserts on larger-than-memory data sets almost always incur a disk I/O, in a b-tree, at least.

Shameless plug: I have a talk about how to deal with this: http://www.youtube.com/watch?v=q6BnG74FZMQ

I don't agree that most secondary indexes are high cardinality. Some definitely are, but there are plenty of indexes with low cardinality, simply used for quicker scanning of those relevant values.

First up, in his example we're talking about 10k rows with a size of 35 bytes. At 8050 bytes of available space per page, that gives a fan-out of 8050/35 = 230. 10k/230*8kb = 350kb. At 350kb of inserts (hobt data, he doesn't go into secondary indexes), memory is completely irrelevant - the only force going on here is latency on writing the log to disk.

If we had a huge data set (as in, did not fit in memory, at all) with high cardinality - sure, we'd have a lot of cold leaf level pages. With no further info on his case, I can only assume most of the hobt and secondary indexes will fit in memory. At worst we'll have to read a cold leaf level page into memory to perform the addition in-memory.

As is there's no mention of even a clustered index, causing all of this to be heap inserts which is arguably one of the fastest insert methods there are (barring certain very special cases).

I said most secondary indexes are high entropy.

If you have a large enough data set that you have to do a leaf read per insert (or close to that often), it will kill your throughput.

In fact inserts almost never incur a disk IO, not until the transaction commits. Even then the IO is sequential, in the log file, not random in the mdf file.

The mdf pages will get marked "dirty" at the time of insert, and they will get flushed to disk eventually - either when checkpint time comes, or if memory gets scarce and the dirty pages get evicted.

It might seem that postponing the write until checkpoint/eviction is not meaningful, but it actually serves a purpose - for one, the transaction is able to finish and return confirmation to the user very quickly, and then if the same page is touched twice (or more) between two flushes, all but the first touches are "free".

This argument does not hold up to analysis. You can break it just by making your data set larger.
There are three options when it comes to relative RAM and data sizes:

1. All data fits in RAM 2. Active data fits in RAM, but inactive data does not 3. Active data does not fit in RAM

Case #1 is trivial, though still widely used for e.g. web sites.

Case #2 is where the relational databases have their sweet spot. For example you could have a 200Gb database with all customers and history of orders, but only a subset of those customers are placing orders at any one time, and all the new orders are stored close to each other, so only 8Gb of data is hot, and the rest is accessed occasionally. Assuming you have more than 8Gb of RAM, there will be no IO at the time of the insert, there will be some sequential IO in log file at the end of the transaction, and there will be some "random" IO come at checkpoint time. The latter will not be totally random as SQL Server will reorder all delayed writes in the most sequential manner possible to reduce the seeks (hence there is a good reason to space out checkpoints). The log flush is also sometimes shared between several adjacent transactions if they end up committing at the same time, so under load you can get less than one IO per transactions.

Case #3 is the one you refer to. In this case each btree update will push out some other old dirty page to disk, causing IO elsewhere in the database.

Now what you are saying in this comment is that because there exists case #3, there is no case #2. I think you severely underestimate the importance of the case #2.

I recommend a copy of "Inside SQL Server" by Kalen Delaney to learn more about SQL Server.

case 2 is fine, it just doesn't interest me
I agree with most of your points.

I have to give it to the author though for a genuine attempt at one of the important things when dealing with databases. There is lot of room for performance improvement in application code, before you start blaming the platform for not scaling.

Another thing is that he doesn't need to use a stored procedure, he could use prepared sql statements.

In regards to your last sentence, he could change the isolation mode if he didn't care so much about consistency.

He already is using prepared statements, that happens under the hood when you use parameterized queries. Problem is the lack of type definition (in the first revision), causing a new compilation for each length of the varchar column.
I'm not following... Could you explain a little further? My understanding is that prepared statements will be added to the plan cache, in the same way that stored procs have associated plans for each query on the proc.

One warning about this is that at times you should consider having a new plan generated - especially if the data changes over time.