Hacker News new | ask | show | jobs
by wetherbeei 2532 days ago
If you're interested in using patent data for AI, check out https://www.kaggle.com/bigquery/patents and https://www.kaggle.com/ostegm/plotting-similar-patents
1 comments

Thx for posting this. Very interesting dataset: https://bigquery.cloud.google.com/table/patents-public-data:...

Do you know by any chance how the `embedding_v1` vectors were generated? The data field description says "Machine-learned vector embedding based on document contents and metadata, where two documents that have similar technical content have a high dot product score of their embedding vectors."

Could this be word2vec, GloVe, or something else like that? Maybe produced from the tf-idf-transformed sum of the word tokens in the title+abstract of each patent?

We (I run Google Patents), generated them using Wsabie (https://research.google.com/pubs/archive/37180.pdf) trained on the set of words of the full text -> Cooperative Patent Classification codes. So summed word embeddings trained for a classification task, which works well on similarity too.