|
|
|
|
|
by louai
951 days ago
|
|
ChatGPT 4 > Use reservoir sampling to get 100 random lines from a dictionary that doesn’t fit into memory. Give me an implementation in Python. > Reservoir sampling is a great choice for randomly selecting a fixed number of items from a stream of unknown size, especially when the entire dataset cannot fit into memory. Here's a Python implementation for selecting 100 random lines from a large file, like a dictionary, using reservoir sampling: ```python
import random
def reservoir_sampling(file_name, k):
reservoir = []
with open(file_name, 'r') as file:
for i, line in enumerate(file):
if i < k:
# Fill the reservoir array
reservoir.append(line.strip())
else:
# Randomly replace elements in the reservoir
# with a decreasing probability.
# Choose an integer between 0 and i (inclusive)
j = random.randint(0, i)
if j < k:
reservoir[j] = line.strip()
return reservoir
# Example usage
file_path = 'path_to_your_large_file.txt' # Replace with your file path
sampled_lines = reservoir_sampling(file_path, 100)
print(sampled_lines)
```
> This script opens a file and reads through it line by line. For the first 100 lines, it adds them directly to the reservoir. For each subsequent line, it randomly decides whether to replace an existing line in the reservoir with the new line. The probability of replacing decreases as it progresses through the file, ensuring each line has an equal chance of being included in the final sample. |
|