| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Apreche 876 days ago

Most people using machine learning to make search engines are replacing the search paradigm with a prompt + answer format.

I think there’s an easier way. Train an ML model to be able to tell apart legit web sites from garbage ones. It’ s just a binary classification. A site should either be blocked, or not.

Legit web sites being ones created by actual humans with actual content. Few to no ads. No malware, phishing, or other security threats. No content farms or SEO sites. No sites generated by other ML models. No paywalls, no pop-ups or other annoyances. Just real web sites.

You’re going to need a bunch of smart and trustworthy humans to spend hours and hours to help do this classification. But a model can help multiply the effectiveness of their efforts.

If the model works, then yes. You can make a very simple search engine. You just tell all the web crawlers to check the model, and only add sites to the index if the model says they are good web sites and not garbage sites.

1 comments

krapp 876 days ago

Websites that practice SEO, use paywalls, popups or "other annoyances" often also have actual content which should be relevant to a search engine. A search engine that refused to show me Wikipedia or IMDB or any mainstream news site wouldn't be useful to most people.

Also, tools that attempt to detect ML generated content tend not to work, and will only become less effective over time, as LLMs improve.

link