| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ep_jhu 936 days ago

One issue I always run into when implementing these approaches is the embedding model's context window being too small to represent what I need.

For example, on this project, looking at the generation of training data [1], it seems like what's actually being generated are embeddings on a string concatenated from each review, title, description, etc. [2]. With the max_seq_length set to 200, wouldn't lengthy book reviews result in the book description text never being encoded? Wouldn't this result in queries not matching against potentially similar descriptions if the reviews are topically dissimilar (e.g., discussing author's style, book's flow, etc. instead of plot).

[1] https://github.com/veekaybee/viberary/blob/main/src/model/ge... [2] https://github.com/veekaybee/viberary/blob/main/src/model/ge...

1 comments

alexmolas 936 days ago

I have the same problem with a project I'm working on. In my case I'm chunking the documents and encoding the chunks. Then I do semantic search over the embeddings of the chunked documents. It has some drawbacks but it's the best approach I could think

link