Hacker News new | ask | show | jobs
by vkkhare 1828 days ago
How do you train newer models then? From what I read you use public datasets to train your models but what about in future? You would need some kind of data collection mechanism?

Gpt-2 and gpt-3 are great but the datasets they are trained would soon get old.

1 comments

Hey thanks. We will not log searches or collect personal data.

There are public sources of search data that we can use with transfer learning against large scale language models like GPT-3, and that are updated regularly. Transfer learning works well without needing massive data sets with this sort of data (phrases mapped to intents).

Having said that, the app tracks the intents and topic profiles of searches (not the search itself, just for example FoodPlaceSearchIntent) and whether the execution was likely a good result or not based on signals (like whether the search was likely repeated or rephrased - again without recording the actual search), and the models learn from that. We're adding signals including anonymized upvote/downvote as well.

Approaches like differential privacy are something we want to pursue more in future. We are still very early days!

Makes sense though I wonder what would be the original source of that data (someone like google/microsoft must be logging user data and then making some parts of it anonymized and public).

Maybe also look into on-device learning, it can be efficiently hooked up with differential privacy and give more specific results.

Yes, ironically, SEO industry resources can be helpful, and we used them in putting together training data. If you're interested there are some good simple free ones to get started also, like these from Mondovo:

https://www.mondovo.com/keywords/

Brave browser uses aggregated search history data that's been anonymized, but we're not trying to personalize results (we're looking for objectively true, rather than "true for you"), so we're not trying to replicate ad-industry style personalization. A good set of labelled data matching intents to phrases helped us build some models simply that are surprisngly good at picking intents :)