| I built a prototype GPU-based vector search system that runs locally on a consumer PC. Hardware: RTX 3090
consumer CPU
NVMe SSD Dataset: ~70 million vectors (384 dimensions) Performance: ~48 ms search latency for top-k results. This corresponds to roughly ~1.45 billion vector comparisons per second on a single GPU. The system uses a custom GPU kernel and a two-stage search pipeline
(binary filtering + floating-point reranking). My goal was to explore whether large-scale vector search could run
efficiently on consumer hardware instead of large datacenter clusters. After thousands of hours of work and many failed attempts the results
finally became stable enough to benchmark. I'm currently exploring how far this approach can scale. I'm currently exploring how far this approach can scale. I'd be very interested to hear how others approach large-scale vector search on consumer hardware. Happy to answer questions. |
I've been iterating on the approach and managed to push the coarse search further.
Currently seeing ~100M vectors scanned in ~10ms on a single RTX 3090 (binary stage only).
Still experimenting with trade-offs between speed and recall, but it's interesting how far this can go on consumer hardware.
Curious what kind of numbers others are seeing for large-scale vector search on GPUs.