Hacker News new | ask | show | jobs
by axlee 962 days ago
No large-scale email scraper has the budget necessary to run the content it scrapes through a LLM. So as far as real-world goes, nothing changes: the .2 cents it would cost to run ChatGPT on a page to extract potentially obfuscated emails would cost magnitudes more than it could ever bring in revenues.

Regarding the examples provided, there is nothing there that a simple regex couldn't achieve, so I don't really see the benefit of introducing a LLM into the flow, besides making it slower and more costly. john [at] company [dot] com was never a safe obfuscation in the first place, and ~99% of text obfuscations are known (because they have to be read by a human ultimately, and conventions are a thing).

13 comments

You do realize how possible it is to fine tune a task like this (along with a hundred others in a similar vein) on a tiny model you can scale on your own hardware?

I've run hundreds of millions (150m so far in a couple of weeks of non-continuous running as I tweaked things) of tokens through my 2x 3090 with a 13b llama2 model I fine tuned on tasks like: summary, knowledge graph generation, writing using the knowledge graph, grammar, spelling, and transcription correction, etc.

This type of stuff is going to be done at scale with a modest budget if you have the skills to tune more efficient and faster models to your use cases.

It's even easier than that. There's no need to even fine tune an LLMs to do it. Here's a screenshot[1] of a 4 bit quantised version of an off the shelf open LLM (WizardLM 13B v1.2) doing it on my Mac.

[1]: https://imgur.com/a/S9jnHWJ

Yep, I use Llama2 70b for larger tasks on my MacBook and 13b for more “single use” type tasks. It’s a game changer.
That may be true, and for some tasks the accuracy may be high enough. I have gotten much more consistency in my tasks by fine tuning though.

Getting a consistently good result for one shape of input may not indicate that same performance for another shape of input for example.

The system confabulated the www subdomain of the “URL provided in the text”, right?
How does one efficiently learn how to do such things, and what kinds of problems such approaches are fruitful for?

I find there to be a giant gap in learning about this stuff between material that boils down to "use magic words and system prompts to improve results from one of the big models" and "how do LLMs work from first principles".

I still haven't found a great resource that covers this middle ground, which seems to me to be where a lot of the power of these approaches is going to reside.

So I described my approach to how I fine tune a specific task below to another user, but I'll copy it here:

> Design your tasks to be repeatable and small steps, call the OpenAI API and log all requests/responses. > Filter out any bad responses and take a representative sample of the data you have collected from OpenAI,and train a Mistral or Llama2 model with the request/response pairs.

> Measure the quality of your model vs OpenAI for the same inputs, and then swap out the model in your workflow once happy with the results.

If you do this, be careful how/if you publish your weights trained on OpenAI output as if they look into how it was generated and it becomes clear you broke the ToS, they'll most likely ban you from the platform.
How would they "look into how it was generated"?
You train your model, publish it on huggingface and then write in the README:

> This is how I made this model: Design your tasks to be repeatable and small steps, call the OpenAI API and log all requests/responses. > Filter out any bad responses and take a representative sample of the data you have collected from OpenAI,and train a Mistral or Llama2 model with the request/response pairs.

Just one example.

If you're looking for a practical guide to getting started with fine tuning, I wrote one a couple of months ago that got pretty popular here on HN. Might be helpful if you're interested in playing around with it! https://news.ycombinator.com/item?id=37484135
The industry term for that middle ground is a “moat”, and the people who are most familiar with it are getting paid for what they know, so they’re not giving it away.
I think that may be right, but if so, that seems pretty unusual to me.

I've gone through a few of these "new kinds of software becoming useful" transition periods - most notably applications moving to the web, and then native smart phone applications - and in none of those transitions was there a dearth of resources on how to spin up on doing useful things due to this "moat" concern.

Nobody was protecting their iphone app dev moat by not publishing books and training courses on Objective-C and XCode...

> I still haven't found a great resource that covers this middle ground, which seems to me to be where a lot of the power of these approaches is going to reside.

Read papers, build intuition, experiment.

That last part may be the most important.

I think this is the disconnect: It doesn't strike me that what I'm talking about has anything to do with "papers". So from your comment, I'm once again left wondering what you mean.

My sense is that I have a much better grasp of the foundational material here, having read in depth books and papers about that, but still can't quite wrap my head around the question of how people are actually "operationalizing" this into useful software.

But to your point about experimentation, it might just be the kind of thing where there is no path to enlightenment besides working on a project and running into and overcoming all the hurdles along the way.

huggingface is your friend.
But not at webscale. It's fine if you want to summarize something for personal use. The size model you're talking about is still way too large if you're trying to harvest millions of e-mail addresses from billions of webpages.
I'm also looking forward to what Apple Mail and other local clients are able to do. My laptop's CPU is idle most of the time, why not use that extra CPU time to do something cool like filter spam better?
Microsoft already does that, and its Antimalware agent is the bane of my existence. It will see idle machines spin up their fans to full and drain batteries within a short few hours. No thank you!
That sounds like something that's easily fixable with battery saving options. Basically, when in battery, don't do that. That would be a good default.
Because that costs energy, the fact that your CPU is idle most of the time is why you can get hours of battery life.
When plugged into the grid, it makes sense to spend a few cents of energy a day to filter out unwanted solicitations, harassment that you may not want to see, scam emails or texts, etc.

If I didn't have to worry about my grandparents getting scammed because they were having 99.99% of it effectively filtered or warned about at one layer or another before it actually became a problem...can you imagine how much you could lower that type of fraud/abuse?

> When plugged into the grid

Exactly! Apple for instance already does this with some ML tasks that only run when your device is plugged in, I think it's a great compromise.

The grid, of course, is less sure about this compromise.
My guess is you wouldn't lower it by much because there's more incentives for attackers than for defenders to invest in these approaches, so it's likely that by the time grandmas are running LLM-based anti-fraud tooling the attackers will already be running LLM-based attacks as well.
You don't need a "model" for this - I remember a Coursera course on ML I did some years ago, and one of the exercises was email extraction. With some very basic algorithms, nothing more than a bunch of common python libraries and couple of days of work, it's possible to extract over 90% of emails with commonly used tricks. I'm not sure the remaining number is worth making more complicated models for it - the returns are quickly diminishing, and wasting time on spamming people who are clever enough to invent their own unique email hiding technique probably doesn't have a good ROI anyway.
Why finetune a LLM if you can defeat most obfuscation tecniques with a few regex?
Because there's a chance your LLM might be able to still get what you need if the obfuscation technique is changed or altered.

Anyway, nothing to say you can't use both, or have a fallback system.

Is it possible to know the minimum model size / data set size it takes to train a model given certain efficiency parameters (latency, etc.)?
If OpenAI can generate those for customers they will make a killing. Export the piece out of ChatGPT you care about and run it on-prem for way less.
It's entirely possible without OpenAI doing anything else. Design your tasks to be repeatable and small steps, call the OpenAI API and log all requests/responses.

Filter out any bad responses and take a representative sample of the data you have collected from OpenAI,and train a Mistral or Llama2 model with the request/response pairs.

Measure the quality of your model vs OpenAI for the same inputs, and then swap out the model in your workflow once happy with the results.

This is assuming you’re using a really big LLM behind a paid service. There are plenty of smaller open source models. Not sure at what point it’s not “large” but when fine tuned they are capable of matching the largest LLM in performance on narrow tasks.

Some of these open source models can even be run on your local machine. It’d be very inexpensive to run thousands of pages through it.

https://llm-leaderboard.streamlit.app/

The "smaller" open source models with adequate capabilities are still rather large and thus compute-intensive. Running thousands of pages through it on a random CPU won't happen in minutes but rather in days, and extracting emails from only thousands of pages is not very valuable.
Thousands of pages is pretty good and what I’m coming to expect on the low side for cheap (single consumer GPU or NPU) throughput with the 5…8GB models now. Heck, with some of the optimizations that Llama.cpp has made, with SafeTensors and GGUF, you can reduce the actual memory usage down.

A cheap Mac mini with apple’s neural cores is good enough that it roleplays smut with a human at human speed. We’re going to see a rapid increase in throughput to price. We’ve already got small LLMs that run on mobile phones.

Scraping is about hundred of millions/billions of pages, not thousands.
This is a perfect task too have an LLM identify patterns and write regexes. This is not only more efficient, you can test the code. Examples of hybrid LLM use like this are the wonderful but badly named Code Interpreter or Eureka[1] where an LLM is used to write the objective function for an RL task to learn dexterous manipulation (watch the video it’s crazy)

https://blogs.nvidia.com/blog/2023/10/20/eureka-robotics-res...

(not advocating for collecting email, just responding to all the comments that LLMs aren’t appropriate because of speed/cost)

0.2 cents is how much a single request costs for well-protect website where web scrapers look for emails (e.g. LinkedIn): https://scrapingfish.com/#pricing

Paying additional 0.2 cents per request, if it can significantly improve your success rate, is not really that much and some people use LLMs for even simpler parsing tasks to save time on development efforts.

I don't see how that contradicts my point. With your provider, we're talking about a 100% price increase, for what can't be more than a few points of accuracy in return (and a huuuuge slowdown, because LLMs are slow). At scale, it's all about going through a lot of pages, and fast: accuracy is a bonus, and being 90% accurate is better than being 99% accurate if the throughput is divided by 10 or more.
No one uses a saas like that one for large scale scraping (billions of requests)
True, but the example in the post (Ask HN: Who is Hiring) shows somewhere it CAN be usefull: when they are not just collection emails for generic spam, but where the adressed are more valuable because they can be used for some targeted add (e.g. an add for a programming-job sendt out to programmers). Those lists are more valuable.

One can also imagine using GPT as part of a pipeline. Maybe one can extract most emails using regexps or other similar measures, but also use simple (computationally) heuristics to collect paragraphs of text which are likely to contain emails (which the regexps did not manage to extract), and then use chatGPT (or cheaper LLMs) to extract only from the most valuable paragraphs. Probably usefull (for the spammers) to also be able to categorize the paragraph, so programmers get programming-spam and chefs get chef-spam.

Indeed, a large handful of regexes would probably get you from, say, 98% (the unobfuscated majority) to 99.99% or better. That 0.01% is probably not worth the effort. And even then, it's probably few enough that you could fix them by hand (and add to the regex list as you go).
I get very very little "obvious" email SPAM these days. I get a great deal of email from conference badge scans, retail that I've done business with, and a great number of other sources that I often/mostly don't care about but I'm not sure how you would algorithmically prevent outside of some labor-intensive and error-prone whitelist.

I'm sure some people would prefer to basically block email by default but that's not most of us.

They don't need the budget. If you are doing scraping for nefarious purposes then you already have access to large numbers of stolen credit cards to sign up for multiple OpenAI accounts. Or there are plenty of files around with thousands of OpenAI cookies you can use to borrow someone else's OpenAI account.
There are likely only so many patterns that can exist and still be usable as a shareable and human reversible email address. Maybe one could have the llm identify these edge cases and write regex/code to parse future instances to convert back to a usable format.
Let's say the poster works at IBM and they want to provide a means for humans to contact them and frustrate scrapers so they say:

'My email address is my username@"the company no one ever got fired for buying from".com'

Write me a regex for that.

GPT on the other hand:

https://imgur.com/a/Hw0kcee

Yes, of course- my point was that the LLM could gradually write functions to parse most of the clever patterns as it comes across them. Use the LLM as a trapdoor when nothing else matches, and in the same text generation create a script to parse the same input next time.

I agree that the LLM is more versatile and retract what I wrote earlier saying that all patterns could be accounted for. I should have said most easy to figure out patterns.

https://imgur.com/a/uFV3XD9

> No large-scale email scraper has the budget necessary to run the content it scrapes through a LLM.

That's why it's awesome to be able to locally run open-source LLaMA! Not to mention that by running it locally OpenAI is not aware of your shenanigans.

> No large-scale email scraper has the budget necessary to run the content it scrapes through a LLM

Maybe not a big third-party commercial one, a smaller, purpose tuned one that the scraper runs on their own, OTOH...

ChatGPT does quite well on the same task when you send it multiple inputs; I think you could probably do 10 of these at a time without degradation.
If cost of compute is the only argument you have on this, then we're already in game over mode, LLMs won.

Cost of LLM compute will continue to fall, models will take less CPU time to run, and the hardware will become more powerful.