| put data in text files, ASCII printable characters, one data point per line put data files in directory name data files after columns use ".data" filename extension for data files write a tool to create index files (append ".index" to the name of the input text file) that map record number to byte offset in data file If data files are all < 4GB, use a 32 bit unsigned integer to represent the byte offset in the index file Each index file is a packed array of 32 bit integers Write a tool to create length files ".length" that count the number of entries in a data file Generate .length files for all data files Use mmap to access index files Use C for all of the above This is for variable-length data values. Not every column will have these, making the .index files redundant in this case; the .index files should not be created in this case and program logic should support both uniform value length access and nonuniform value length access. The reason to prefer two access modes is to keep data from the .index files out of the cache when it is redundant. When all of this is done, the next thing to do is write a tool to test the cache characteristics on your processor by implementing sorting algorithms and testing their performance. Unless you are using a GPU (why?) all data your algorithm touches will go through every level of the cache hierarchy, forcing other data out. If possible, use a tool that reports hardware diagnostics. These tools may be provided by the processor vendor. Now, there is a trend to give the programmer control over cache behavior https://stackoverflow.com/questions/9544094/how-to-mark-some... I don't know if this is worth exploring or a wild goose chase. It may improve performance for some tasks, but it sounds a little strange for the programmer to tell the computer how to use the cache...shouldn't the operating system do this? Anyway, that's a start. |