| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thanatos_dem 2307 days ago

If this were to be offered by an actual company (a first party solution), there are some features that'd be expected that make the problem space a lot harder. Here's an "intro to search" article that's a good read, and I'll use it to highlight some of the things that'd be different in a first party solution - https://medium.com/startup-grind/what-every-software-enginee...

(See the "Theory: the search problem" section)

Size: This is only indexing ~500k public repos. A first party solution would be expected to index all of it, public and private.

Indexing speed: This can take up to a few days to index. A first party solution would be expected to have a much lower index latency - seconds to minutes.

Query language: This can (and does) have its own simple query language. A first party solution would need to have support embedded into and not break backwards compatibility with the current query language.

Context-dependence: A first party solution would be expected to index private repos as well, and now the query context (logged in user) becomes another variable in an already multi-variate problem space.

Latency: Gets harder with scale, and a first party solution would likely provide a SLA/SLO around latency.

Access control: Same issue as context-dependence, with private repos being included.

There's also unknown but likely considerations around compliance and internationalization, which are quite tricky problems.

Note - I don't mean for this to be critical of the author at all. This is an awesome and useful tool, with a fantastic UX. I just want to make it clear that search at scale is a lot harder than it seems at first glance, especially as the feature requirements increase.

2 comments

fjania 2307 days ago

Engineering manager for code search at GitHub here... this is an excellent summary of many of the concerns we have as we work on code search at GitHub scale!

link

sdesol 2307 days ago

For GitHub, I would have to imagine only being able to search public repos with regexp would be good enough. GitHub has many strategies, but the main one is, they want to maintain, if not, expand their open source mind share.

The more reasons you give people to go to GitHub, the better off they will be in the future. So I do agree with you that as a commercial solution, this may not be viable, but for GitHub's public repos, this can turn into a very positive thing.

link

marceloabsousa 2307 days ago

That might well be true but to scale this type of service to all public repos with decent latency and update ratio is a major technical challenge and likely very costly to maintain.

link

sdesol 2306 days ago

This is my personal observation, but GitHub appears to be a much more ambitious company, now that they are part of Microsoft. With a CEO that understands both the open source and the enterprise world and with Microsoft cash at hand, I don't think spending money to make search better would cause any concerns.

Doing technical things that GitLab, Bitbucket, etc. can't is quite valuable. It also helps with recruiting, since smart people want to work on difficult problems.

It may well be costly to maintain, but I think the operating cost would be well within the realm of an incumbent that wants to maintain and expand their reach. I've been studying the code hosting space for quite sometime and GitHub, from an outsiders perspective, appears to be much more focused and ambitious, which should cause serious concerns for GitLab.

link