| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by misterHN 3144 days ago

put data in text files, ASCII printable characters, one data point per line

put data files in directory

name data files after columns

use ".data" filename extension for data files

write a tool to create index files (append ".index" to the name of the input text file) that map record number to byte offset in data file

If data files are all < 4GB, use a 32 bit unsigned integer to represent the byte offset in the index file

Each index file is a packed array of 32 bit integers

Write a tool to create length files ".length" that count the number of entries in a data file

Generate .length files for all data files

Use mmap to access index files

Use C for all of the above

This is for variable-length data values. Not every column will have these, making the .index files redundant in this case; the .index files should not be created in this case and program logic should support both uniform value length access and nonuniform value length access. The reason to prefer two access modes is to keep data from the .index files out of the cache when it is redundant.

When all of this is done, the next thing to do is write a tool to test the cache characteristics on your processor by implementing sorting algorithms and testing their performance. Unless you are using a GPU (why?) all data your algorithm touches will go through every level of the cache hierarchy, forcing other data out. If possible, use a tool that reports hardware diagnostics. These tools may be provided by the processor vendor.

Now, there is a trend to give the programmer control over cache behavior

https://stackoverflow.com/questions/9544094/how-to-mark-some...

I don't know if this is worth exploring or a wild goose chase. It may improve performance for some tasks, but it sounds a little strange for the programmer to tell the computer how to use the cache...shouldn't the operating system do this?

Anyway, that's a start.

1 comments

posnet 3144 days ago

This sounds almost identical to the datastore honeycomb.io built and describe in the talk https://www.youtube.com/watch?v=tr2KcekX2kk

link