Hacker News new | ask | show | jobs
by ivansavz 2540 days ago
Thx for posting this. Very interesting dataset: https://bigquery.cloud.google.com/table/patents-public-data:...

Do you know by any chance how the `embedding_v1` vectors were generated? The data field description says "Machine-learned vector embedding based on document contents and metadata, where two documents that have similar technical content have a high dot product score of their embedding vectors."

Could this be word2vec, GloVe, or something else like that? Maybe produced from the tf-idf-transformed sum of the word tokens in the title+abstract of each patent?

1 comments

We (I run Google Patents), generated them using Wsabie (https://research.google.com/pubs/archive/37180.pdf) trained on the set of words of the full text -> Cooperative Patent Classification codes. So summed word embeddings trained for a classification task, which works well on similarity too.