|
Here's the solution that I came up with prior to following the link or reading any comments. We're given 1 million integers from 0 to 99,999,999. We only have 1MB of RAM, or an average ~8 bits for each of the million numbers. So we can't store the numbers directly since they take ~27 bits each. First thought was to use a bitset but that would require 100 million bits, and we only have ~8 million bits RAM, so that's not going to work. Also need to deal with duplicates. How about this. Something similar to a selection sort algorithm that stores deltas of distances between sorted numbers. As a number is streamed in, we start scanning from the beginning of the list until it's correct position found, where it is inserted and then push down the remaining numbers. This will be O(n^2). Since the average delta distance between numbers is about 100, we'll use 8-bits to store the delta value. Value 0 means the number is a duplicate of the current number. Values 1-254 mean add this number to the current number for the new number. Value 255 means add 255, then use the next byte as the delta value (repeat until the value != 255). (Case 1)
1 million ints exactly 100 apart:
0, 100, 200, 300, 400, ..., 99999800, 99999900
Stored as a list of 8-bit delta values:
0, 100, 100, 100, 100, ..., 100, 100 (1 million bytes total) (Case 2)
1/2 million zeros, then 1/2 million values of 99999999.
Stored as:
start: 0, 0, 0, 0, ... (1/2 million zero deltas)
then: 255, 255, 255, 255, ... (99999999 / 255 = 392,156 times repeated, which gets us to number 99,999,780)
then: 219, 0, 0, 0, 0, ... (another 1/2 million zero deltas) So the total amount of storage for Case 2, which I presume is worse case (but correct me if I'm wrong!) is:
500,000 + 392,156 + 1 + 500,000 = 1,392,157 bytes to store the delta values. 1MB = 1,048,576 bytes, so I'm over by 343,581 bytes... (so close!) We'll have to modify this scheme so that we reduce the number of 255 values, which should not be hard to do and will get us under the 1MB size limit. Or we could try something fancier like huffman coding to reduce the size of the delta values. |