Hacker News new | ask | show | jobs
by ggruschow 5414 days ago
Some things to think about:

  - 50mb is >> total cache on most systems you'll use.
  - This is far from the only function you need to optimize.
  --> The best performing algorithm in a micro-benchmark can be sub-optimal in real use.

  - You've got a fixed-for-the-day list of <10k <9 letter unique upper-case strings
  - The bulk of them you'll process are <= 4 letters (CMCSA/CMCSK be damned).
  - The median is 3 and the mean is <3.6.
  - The symbols aren't anywhere close to uniformly active
  --> Benchmark with a typical data stream, not a uniformly distributed one.

  - You're likely not even considering trading most of them.

  - How much better is your algorithm than the obvious binary-search?
  - Or worse -- plain'ol front-to-back linear search?
  - Are you sure you understand 100% of how the computer works?
    --> Yes? Mail a check to Goldman and find a better line of work.
    --> No? Test it with a quick benchmark.

  - Is binary search cache friendly?
  - Why not? Plot out the memory accesses if need be.

  - What's wrong with taking a linear-interpolated guess?
  - Is that cache friendly? .. or could it be?
  - When your guess misses, what's that cost?
  - Can you think of a cheap modification to lower that cost?
  - Is the distribution really linear?

  - What's wrong with a "perfect hash function"?
  - Does it need to be perfect?
Incidentally, hardware wins at this function. You can just generate logic to do all the if symbol == "XYZ": return 3's in parallel.
2 comments

This is a crazy smart comment. Thanks.

... is anyone really using hardware to do this, by the way?

Hardware regexing at this scale is practically a COTS part, or was when I stopped doing network security products in '05.

Unfortunately it's not that simple anymore. You actually have to do most of the order-book-building legwork to figure out if the quote is relevant: every order modification message, such as canceling an order or execution, omits the ticker. Makes sense: after all, the add-quote actually tells you the ticker, so there's no need to repeat that info.

This is where it gets tricky. FPGA based systems don't really give an edge here, and tilera-based solutions are still nascent

This craziness is what happens when you try to implement a specialized algorithm on a general-purpose state-machine. What stupid thing is a "cache" anyway, but an optimization because we are not using real computers but hugely complex state-machines called CPUs. FPGAs are as close as a real computer as we can have today.
The parallels to high end network security are again interesting here; for instance, on the more interesting network processors, memory accesses were asynchronous, and instead of a "cache" you had a manually addressed "scratch" memory with latency comparable to registers.

... because "general purpose cache" wasn't helpful, but "exploitable locality" definitely was.

That's pretty close to how the first NVIDIA GPGPU (G80) worked. There was no cache, but only different kinds of memory, of which one was as fast as registers. Optimal access to RAM memory was possible with a limited set of access patterns, so "caching" had to be managed carefully manually.

In later GPUs they (mainly due to programmer laziness :-) ) switched to a more GPU-like automatically managed general purpose cache. The problem was that in most cases, being able to quickly write code that is reasonably fast was more important to developers than being able to squish out every cycle (though I believe it's still possible to disable the cache and manage it manually...)

The problem with automatic cache is the same as all "intelligent" CPU features, the CPU tries to predict what the program is doing, the programmer has to predict what the CPU will predict to optimize for that CPU. In the next generation of the CPU, the CPU vendor will try to optimize again for programs currently around. In the end, optimizing becomes much more complex.

There's certainly an advantage to keeping the logic "dumb" and simply having the software manage everything.

veyron points this out: http://www.fixnetix.com/articles/display/129/

Personally, I think specialized hardware is a bad idea.

- The bulk of them you'll process are <= 4 letters (CMCSA/CMCSK be damned).

With 5 bits per letter, this means the bulk of stock tickers could be stored in 20 bits, so the data structure could just be one big 128 MB array, if the size of the structure was rounded up from 109 to 128 bytes.