| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by firesteelrain 173 days ago
	Funding could help, but it still requires PyPI/Warehouse to ship and operate a new public search interface that is safe at internet scale.

3 comments

coldtea 173 days ago

They operate a public package hosting interface, how is a search one any harder?

link

miketheman 173 days ago

PyPI responses are cached at 99% or higher, with less infrastructure to run.

Search is an unbounded context and does not lend itself to caching very well, as every search can contain anything

link

bastawhiz 173 days ago

Pypi has fewer than one million projects. The searchable content for each package is what? 300 bytes? That's a 200mb index. You don't even need fancy full text search, you could literally split the query by word and do a grep over a text file. No need for elasticsearch or anything fancy.

And anyway, hit rates are going to be pretty good. You're not taking arbitrary queries, the domain is pretty narrow. Half the queries are going to be for requests, pytorch, numpy, httpx, and the other usual suspects.

link

froh 173 days ago

I wonder how a PyPi search index could be statically served and locally evaluated on `pip search`?

link

firesteelrain 173 days ago

PyPI servers would have to be constantly rebuilding a central index and making it available for download. Seems inefficient

link

ptx 172 days ago

Debian is somehow able to manage it for apt.

link

froh 172 days ago

that depends on how it can be downloaded incrementally.

link

woodruffw 173 days ago

The searchable context for a distribution on PyPI is unbounded in the general case, assuming the goal is to allow search over READMEs, distribution metadata, etc.

(Which isn’t to say I disagree with you about scale not being the main issue, just to offer some nuance. Another piece of nuance is the fact that distributions are the source of metadata but users think in terms of projects/releases.)

link

bastawhiz 172 days ago

> assuming the goal is to allow search over READMEs, distribution metadata, etc.

Why would you build a dedicated tool for this instead of just using a search engine? If I'm looking for a specific keyword in some project's very long README I'm searching kagi, not npm.

I'd expect that the most you should be indexing is the data in the project metadata (setup.py). That could be unbounded but I can't think of a compelling reason not to truncate it beyond a reasonable length.

link

woodruffw 172 days ago

You would definitely use a search engine. I was just responding to a specific design constraint.

(Note PyPI can’t index metadata from a `setup.py` however, since that would involve running arbitrary code. PyPI needs to be given structured metadata, and not all distributions provide that.)

link

coldtea 172 days ago

>The searchable context for a distribution on PyPI is unbounded in the general case, assuming the goal is to allow search over READMEs, distribution metadata, etc.

Even including those, it's what? Sub-20-30GB.

link

Kwpolska 172 days ago

How does the big white search box at https://pypi.org/ work? Why couldn’t the same technology be used to power the CLI? If there’s an issue with abuse, I don’t think many people would mind rate limiting or mandatory authentication before search can be used.

link

firesteelrain 172 days ago

The PyPI website search is implemented using a real search backend (historically Elasticsearch/OpenSearch–style infrastructure) layered behind application logic on Python Package Index. Queries are tokenized, ranked, filtered, logged, and throttled. That works fine for humans interacting through a browser.

The moment you expose that same service to a ubiquitous CLI like pip, the workload changes qualitatively.

PyPI has the /simple endpoint that the CDN can handle.

It’s PyPI philosophy that search happens on the website and pip has aligned to that. Pip doesn’t want to make a web scraper understandably so the function of searching remains disabled

link

bastawhiz 173 days ago

Pypi has a search interface on their public website, though?

link

BiteCode_dev 172 days ago

If you really need it, they publish a dump regularly and you can query that.

For simple use cases, you have the web search, and you can curl it.

link