Hacker News new | ask | show | jobs
by louai 951 days ago
ChatGPT 4

> Use reservoir sampling to get 100 random lines from a dictionary that doesn’t fit into memory. Give me an implementation in Python.

> Reservoir sampling is a great choice for randomly selecting a fixed number of items from a stream of unknown size, especially when the entire dataset cannot fit into memory. Here's a Python implementation for selecting 100 random lines from a large file, like a dictionary, using reservoir sampling:

  ```python
  import random

  def reservoir_sampling(file_name, k):
      reservoir = []
      with open(file_name, 'r') as file:
          for i, line in enumerate(file):
              if i < k:
                  # Fill the reservoir array
                  reservoir.append(line.strip())
              else:
                  # Randomly replace elements in the reservoir
                  # with a decreasing probability.
                  # Choose an integer between 0 and i (inclusive)
                  j = random.randint(0, i)
                  if j < k:
                      reservoir[j] = line.strip()
      return reservoir

  # Example usage
  file_path = 'path_to_your_large_file.txt'  # Replace with your file path
  sampled_lines = reservoir_sampling(file_path, 100)
  print(sampled_lines)
  ```
> This script opens a file and reads through it line by line. For the first 100 lines, it adds them directly to the reservoir. For each subsequent line, it randomly decides whether to replace an existing line in the reservoir with the new line. The probability of replacing decreases as it progresses through the file, ensuring each line has an equal chance of being included in the final sample.