|
No large-scale email scraper has the budget necessary to run the content it scrapes through a LLM. So as far as real-world goes, nothing changes: the .2 cents it would cost to run ChatGPT on a page to extract potentially obfuscated emails would cost magnitudes more than it could ever bring in revenues. Regarding the examples provided, there is nothing there that a simple regex couldn't achieve, so I don't really see the benefit of introducing a LLM into the flow, besides making it slower and more costly. john [at] company [dot] com was never a safe obfuscation in the first place, and ~99% of text obfuscations are known (because they have to be read by a human ultimately, and conventions are a thing). |
I've run hundreds of millions (150m so far in a couple of weeks of non-continuous running as I tweaked things) of tokens through my 2x 3090 with a 13b llama2 model I fine tuned on tasks like: summary, knowledge graph generation, writing using the knowledge graph, grammar, spelling, and transcription correction, etc.
This type of stuff is going to be done at scale with a modest budget if you have the skills to tune more efficient and faster models to your use cases.