I imagine it is something similar to the following.
Preprocessing
1. Transcribe the dataset
2. Chunk the transcription into paragraphs.
3. Store the embedding of each paragraph into a vector database.
Querrying
1. Convert the user's query into an embedding
2. Query the vector database for the top N closest embeddings and fetch the paragraphs that correspond to them. To be robust against queries which you don't have results for you should limit how far away results can be from the user's query.
3. Using those paragraphs craft a propmt that you will give to a LLM.
4. Do any final filtering on the what you got back from the LLM.
I built something similar using a variety of YouTube channels focused on NLP, AI, etc. The app is here https://huggingface.co/spaces/jamescalam/ask-youtube - you can ask things like "what is a transformer model?" or "what is semantic search?"
- Use sentence transformer to create embeddings of text
- Index embeddings (with transcribed text, timestamps, and video URL attached) in Pinecone's vector database
- Wrap up the querying functionality in a nice UI
(this is for the search functionality)
If wanting to replicate the Q&A part, I also built something similar and wrote about it (https://youtu.be/coaaSxys5so) - it's essentially the same process but we return text snippets to GPT-3 along with the original question and it generates an answer
I should add, Riley used the ada embedding model (rather than sentence transformers). Performance wise they should be similar (in ability to encode meaning accurately) but the ada model can encode a much larger chunk of text. I don't know exact numbers but something like 1-2 pages of text in a typical corporate PDF. Whereas sentence transformers are typically limited to around a paragraph of text.
Typically you'd split the text in paragraph sized chunks to handle this requirement of sentence transformers, with GPT-3 embeddings you naturally have more flexibility there
Preprocessing
1. Transcribe the dataset
2. Chunk the transcription into paragraphs.
3. Store the embedding of each paragraph into a vector database.
Querrying
1. Convert the user's query into an embedding
2. Query the vector database for the top N closest embeddings and fetch the paragraphs that correspond to them. To be robust against queries which you don't have results for you should limit how far away results can be from the user's query.
3. Using those paragraphs craft a propmt that you will give to a LLM.
4. Do any final filtering on the what you got back from the LLM.