Hacker News new | ask | show | jobs
by naveedjanmo 1178 days ago
Hey! I'm the developer of Unriddle - it works using text embeddings. The document is split into small chunks and each chunk is assigned a numerical representation, or "vector", of its semantic meaning and relation to the other chunks. When a user prompts this too is assigned a vector and then compared to the rest of the chunks. The similar chunks are then fed into GPT-4 along with the query, ensuring the total number of words doesn't exceed the context window limit.
4 comments

//The similar chunks are then fed into GPT-4 along with the query

Since GPT can use things from his context arbitrarily ,does it solve the hallucination issue, even for ebooks?

Awesome - I knew about vectorising/embeddings for semantic search, but I hadn't thought of using the search results as a prompt prefix - clever!
Yeah it’s the pattern b all these tools are using.

Use SebtenceTransformers in python to write to the database (PineconeDB) and then do the same for queries. Use the results as context.

What OpenAI API calls allow sending these small chunks?

When you query something like "What is this research about?" is it able to use data from all chunks?

It's just the GPT-4 API - the chunks are sent as part of a prompt. In that case it won't use data from all chunks but it will try to find any chunks that provide descriptions of the document. I've found with research papers, for example, it fetches parts of the introduction and abstract.
Oh so there is pre-processing to find the useful portions? What are you using for the pre-processing?

I feel that it's inevitable that OpenAI et al. will be able to handle large PDF documents eventually. But until then I'm sure there's a lot of value of in this kind of pre-processing/chunking.

Yeah I think you're right - the 32k context window for GPT-4 (not available for everyone yet) is already enough for research papers. I'm using a library called Langchain, there's also LlamaIndex.
Can the vectorization of chunks and finding context close to query be done with any LLMs and then only relevant chunks be sent to OpenAI?
Vectorisation is done via OpenAI's embedding API. And the chunking/querying is happens through the Langchain library. But there are a few different ways of doing it - another good library is LLamaIndex.
Thanks a lot! Do you _have_ to do vectorization and querying with the same LLM? Can someone do vectorization with 1 and do querying with reevant chunks with another?