Do you know by any chance how the `embedding_v1` vectors were generated? The data field description says "Machine-learned vector embedding based on document contents and metadata, where two documents that have similar technical content have a high dot product score of their embedding vectors."
Could this be word2vec, GloVe, or something else like that? Maybe produced from the tf-idf-transformed sum of the word tokens in the title+abstract of each patent?
We (I run Google Patents), generated them using Wsabie (https://research.google.com/pubs/archive/37180.pdf) trained on the set of words of the full text -> Cooperative Patent Classification codes. So summed word embeddings trained for a classification task, which works well on similarity too.
Do you know by any chance how the `embedding_v1` vectors were generated? The data field description says "Machine-learned vector embedding based on document contents and metadata, where two documents that have similar technical content have a high dot product score of their embedding vectors."
Could this be word2vec, GloVe, or something else like that? Maybe produced from the tf-idf-transformed sum of the word tokens in the title+abstract of each patent?