Hacker News new | ask | show | jobs
by montanalow 4345 days ago
About 2 hours. The regex is IO bound, on a text column with 87GB + 68GB postgres toast. It's on a new 1TB 3k iops EC2/EBS ssd volume, which seems to be able to sustain about 20MB/s.
1 comments

20MB/sec is pretty dreadful, and 2 hours is not timely for that quantity of data.

I'm guessing you're paying a high price for the convenience of using a database? The kind of query you did, I'd run using grep on the command line source, possibly combined with a summarizing program written in Ruby.

I love the power of the unix shell. find | cat | grep would get the job done just as well if you had all the source in an accessible file tree, but I don't think you'd see any performance increase as the bottleneck is still random reads from EBS.

The single 3k iops EBS volume being used delivers a max theoretical speed of 24MB/s with 8k pages. I'm fine living with 20MB/s in practice.

In fact, postgres does inline (de)compression and optimizes for sequential reads, so it's likely the shell would be slower for this workload given the apples to oranges characteristics. I'd love to see any performance tests making this sort of comparison, they're always educational.

Even with a database it's dreadful. At 20MB/sec they need to value the time they have to wait very low before it'd be cheaper/faster to buy a small server outright and put a couple of ssd's in it if they do this kind of analysis more than a couple of times.

Or even load it up with enough memory to keep everything in RAM during normal operations. I can't remember the last time I worked on a system that did less than a couple of hundred MB/sec... And we generally buy servers in the $3k-$6k range, so nothing ridiculous.

Probably even faster using LC_ALL=C and parallel grep [0]

[0] http://www.gnu.org/software/parallel/man.html#example__paral...