| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andrewcooke 4993 days ago
	uses reservoir sampling -http://en.wikipedia.org/wiki/Reservoir_sampling (so it presumably consumes the entire stream before giving any results; any alternative i can think of would not be "really random" unless you knew the length of the stream in advance).

3 comments

snoble 4993 days ago

the magic of reservoir sampling is the memory footprint is the size of the output while being completely random. In this particular implementation the order of the output is slightly biased but each row has an equal chance of being in the output.

link

snoble 4993 days ago

but yes. it has to wait until the stream is done before producing any output.

link

avibryant 4993 days ago

Yep, though a feature request I've put in is to respond to a ctl-c by producing the results from the stream so far... that way if it's taking a while on a large file you can interrupt and still get something useful.

link

cgs1019 4993 days ago

This breaks ctl-c in my opinion. When I ctl-c I want shit to stop, not dump (potentially large quantities of) output into my terminal.

link

wodow 4993 days ago

It could accept another signal (e.g. SIGUSR1) and have this clearly documented.

link

neilk 4993 days ago

I just added the SIGUSR1 feature to my hacky perl script (see my other comment).

e.g.

    $ (while true; do cat /usr/share/dict/words; done;) | ./randline 3 &
    [2] 93937
    
    $ kill -s SIGUSR1 93937
    declinograph
    brotheler
    woolpack

    $ kill -s SIGUSR1 93937
    lustrify
    brotheler
    bromophenol

link

cgs1019 4993 days ago

Maybe I'm overlooking something obvious but wouldn't piping through awk 'rand()<.1{print}' give a decent streaming random sample of roughly 10% of input?

link

minimax 4993 days ago

Reservoir sampling allows you to return a specific number of samples regardless of the size of the input rather than a specific fraction of samples (which is what your awk script does).

link

philsnow 4993 days ago

On the other hand, the awk script handles streaming quite nicely; If you have a few hundred machines and you want to just get a sense of log messages on all of them (in real time; say you're about to change something and want to just eyeball whether the mix of error messages changes), you could do something very quick and dirty with that awk script.

link

taltman1 4992 days ago

As minimax stated, your awk code won't provide exactly k samples. Here's a bit of awk code that implements reservoir sampling (apologies in advance for any bugs) and prints the current sample to stdout, without needing to first process the entire stream. It simply prints the set of samples to stdout whenever the sample updates, with sets of samples separated by a distinct string. It is called as follows (of course, replace dmesg with any stream of your choosing):

  dmesg | gawk -f reservoir-sample.awk k=5 record_separator='==='


  #!/usr/bin/gawk -f

  ### reservoir-sample.awk
  ###
  ### Sample k random lines from a stream, without knowing the size of the stream.
  ###
  ### (Tomer Altman)

  ### Parameters: (set from command-line)
  ##
  ## k: number of lines to sample
  ## record_separator: string used to separate iterations of the sample in stdout.

  ## Define a function for returning a number between 1 and n:
  function random_int (n) { return 1 + int(rand() * n) }

  ## Initialize the PRNG seed:
  BEGIN { srand() }

  ## For the first k lines, initialize the sample array:

  NR <= k { sample[NR] = $0 }

  ## If we've initialized the sample array, and we pick an integer between one and the current record number that is less than or equal to k, update the sample array and print it to stdout:
  NR > k && (current_random_int = random_int(NR)) <= k {

    sample[current_random_int] = $0

    for (i in sample) print sample[i]
    
    print (record_separator != "") ? record_separator : "--"
  }

link

taltman1 4990 days ago

In case anyone is interested, I've extended this code and put it up on GitHub as a script called 'samp':

https://github.com/taltman/scripts

link