| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tal8d 1928 days ago

Yup, the most resource constrained form (PCRE2 would be the Cadillac-DoS-my-webapp option). I don't know how abnormal my problem is, but I've run into several cases where I needed a file that had a very specific naming structure - but a couple of substrings needed to be masked because that was the information I was looking for. That can easily be handled with two fixed string passes on a local file system, but not so much at the scale we're talking here.

As far as the indexing issues you mentioned, I'm not a python guy, so I can't recommend a drop in solution. But I have indexed pretty massive string datasets, and you definitely want to select an index method that was specifically devised for the key/value datatype you intend to ingest. So hashmaps are probably out :) It would also pay to adapt either the implementation or the data itself. For example, say you had a bunch of font file name you wanted to index (XyzSans.otf, AsdfSans.otf, AsdfSerif.otf, etc): a prefix tree would be a pretty good fit, especially if you reversed all the strings.

https://en.wikipedia.org/wiki/Trie

https://blog.burntsushi.net/transducers

1 comments

retonato 1928 days ago

In principle it is possible, but there would be some limitations:

1. Postgresql supports indexing to make regex searches faster from v9.3 (the current version is 13): https://www.postgresql.org/docs/12/pgtrgm.html . However, it makes them faster only in some (simpler) cases, in others we are back to plain old full table scan.

2. Speaking of full table scan, it is not that bad of an option, the average list of file paths for a torrent file is approximately 6 KB, that's 300 GB per 50 mln torrents, that's completely within reach of some VPS providers (like BuyVM). Still, up to a few minutes per one pass.

3. So, unless I find some efficient technology for regex search in large volumes of text, I would be able to implement only "offline" search (submit query, receive link to search results in 5-30 minutes, depending on server load).

4. Another option would be to provide listings of files for downloading as csv files (torrent_id, filepath) and let user to use command line for searching (or some text viewer/editor, though most popular ones still load all contents into ram). Compressed size would be around 1 KB per one torrent file, that's near 50 GB per 50m torrents.

tal8d 1928 days ago

Yeah, it isn't something that is feasible if you can't tailor the infrastructure to the underlying data. An SQL backend is about as generic as it gets, hence the poor results. Enterprise DB deployments get away with it because they can justify throwing a lot more hardware behind a centralized generic data store.

Give that second link a closer look if you change your mind. It demonstrates a way of indexing 1.6 billion 80B strings in 24GB, and then returning the result of a fixed string search in 100ms. That is the reward you get when you venturing outside of the LAMP stack: less resource consumption, greater performance, increased utility.

retonato 1928 days ago

Thank you, I will