• GitHub's previous code search was slow, limited, and did not support searching Forks due to indexing challenges. A new system called Blackbird was built from scratch to address these issues.
• Indexing code poses unique challenges compared to natural language documents, such as handling file changes in version control systems and deduplicating shared code across repositories.
• The talk discussed techniques used in Blackbird like trigram tokenization, delta compression, caching, and dynamic shard assignment to improve indexing speed and efficiency at scale.
• Architectural decisions like separating indexing from querying and using message queues helped Blackbird scale independently without competing for resources.
• Data structures like geometric XOR filters were developed to efficiently estimate differences between codebases and enable features like delta compression.
• Iteration speed was improved by making the system easier to change through frequent index version increments without migrations.
• Resource usage was optimized through techniques such as document deduplication, caching, and compaction to reduce indexing costs.
• Blackbird's design allowed it to efficiently support over 100 million code repositories while the previous system struggled at millions.
• Building custom solutions from scratch can be worthwhile when leveraging data structure to outperform generic tools for a domain.
• Anticipating and addressing scaling challenges at each magnitude is important to ensure a system remains performant as it grows over time.