| Not sure if it can, but Kagi can: • GitHub's previous code search was slow, limited, and did not support searching Forks due to indexing challenges. A new system called Blackbird was built from scratch to address these issues. • Indexing code poses unique challenges compared to natural language documents, such as handling file changes in version control systems and deduplicating shared code across repositories. • The talk discussed techniques used in Blackbird like trigram tokenization, delta compression, caching, and dynamic shard assignment to improve indexing speed and efficiency at scale. • Architectural decisions like separating indexing from querying and using message queues helped Blackbird scale independently without competing for resources. • Data structures like geometric XOR filters were developed to efficiently estimate differences between codebases and enable features like delta compression. • Iteration speed was improved by making the system easier to change through frequent index version increments without migrations. • Resource usage was optimized through techniques such as document deduplication, caching, and compaction to reduce indexing costs. • Blackbird's design allowed it to efficiently support over 100 million code repositories while the previous system struggled at millions. • Building custom solutions from scratch can be worthwhile when leveraging data structure to outperform generic tools for a domain. • Anticipating and addressing scaling challenges at each magnitude is important to ensure a system remains performant as it grows over time. |