|
|
|
|
|
by tal8d
1928 days ago
|
|
Yup, the most resource constrained form (PCRE2 would be the Cadillac-DoS-my-webapp option). I don't know how abnormal my problem is, but I've run into several cases where I needed a file that had a very specific naming structure - but a couple of substrings needed to be masked because that was the information I was looking for. That can easily be handled with two fixed string passes on a local file system, but not so much at the scale we're talking here. As far as the indexing issues you mentioned, I'm not a python guy, so I can't recommend a drop in solution. But I have indexed pretty massive string datasets, and you definitely want to select an index method that was specifically devised for the key/value datatype you intend to ingest. So hashmaps are probably out :) It would also pay to adapt either the implementation or the data itself. For example, say you had a bunch of font file name you wanted to index (XyzSans.otf, AsdfSans.otf, AsdfSerif.otf, etc): a prefix tree would be a pretty good fit, especially if you reversed all the strings. https://en.wikipedia.org/wiki/Trie https://blog.burntsushi.net/transducers |
|
1. Postgresql supports indexing to make regex searches faster from v9.3 (the current version is 13): https://www.postgresql.org/docs/12/pgtrgm.html . However, it makes them faster only in some (simpler) cases, in others we are back to plain old full table scan.
2. Speaking of full table scan, it is not that bad of an option, the average list of file paths for a torrent file is approximately 6 KB, that's 300 GB per 50 mln torrents, that's completely within reach of some VPS providers (like BuyVM). Still, up to a few minutes per one pass.
3. So, unless I find some efficient technology for regex search in large volumes of text, I would be able to implement only "offline" search (submit query, receive link to search results in 5-30 minutes, depending on server load).
4. Another option would be to provide listings of files for downloading as csv files (torrent_id, filepath) and let user to use command line for searching (or some text viewer/editor, though most popular ones still load all contents into ram). Compressed size would be around 1 KB per one torrent file, that's near 50 GB per 50m torrents.