Hacker News new | ask | show | jobs
by retonato 1929 days ago
In principle it is possible, but there would be some limitations:

1. Postgresql supports indexing to make regex searches faster from v9.3 (the current version is 13): https://www.postgresql.org/docs/12/pgtrgm.html . However, it makes them faster only in some (simpler) cases, in others we are back to plain old full table scan.

2. Speaking of full table scan, it is not that bad of an option, the average list of file paths for a torrent file is approximately 6 KB, that's 300 GB per 50 mln torrents, that's completely within reach of some VPS providers (like BuyVM). Still, up to a few minutes per one pass.

3. So, unless I find some efficient technology for regex search in large volumes of text, I would be able to implement only "offline" search (submit query, receive link to search results in 5-30 minutes, depending on server load).

4. Another option would be to provide listings of files for downloading as csv files (torrent_id, filepath) and let user to use command line for searching (or some text viewer/editor, though most popular ones still load all contents into ram). Compressed size would be around 1 KB per one torrent file, that's near 50 GB per 50m torrents.

1 comments

Yeah, it isn't something that is feasible if you can't tailor the infrastructure to the underlying data. An SQL backend is about as generic as it gets, hence the poor results. Enterprise DB deployments get away with it because they can justify throwing a lot more hardware behind a centralized generic data store.

Give that second link a closer look if you change your mind. It demonstrates a way of indexing 1.6 billion 80B strings in 24GB, and then returning the result of a fixed string search in 100ms. That is the reward you get when you venturing outside of the LAMP stack: less resource consumption, greater performance, increased utility.

Thank you, I will