| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by DeWilde 1320 days ago
	This is pretty amazing. Is this approach documented or explained anywhere? I have some ideas of my own that I would love to implement similarly to this and it would help to know how to get started.

2 comments

charcircuit 1320 days ago

I imagine it is something similar to the following.

Preprocessing

1. Transcribe the dataset

2. Chunk the transcription into paragraphs.

3. Store the embedding of each paragraph into a vector database.

Querrying

1. Convert the user's query into an embedding

2. Query the vector database for the top N closest embeddings and fetch the paragraphs that correspond to them. To be robust against queries which you don't have results for you should limit how far away results can be from the user's query.

3. Using those paragraphs craft a propmt that you will give to a LLM.

4. Do any final filtering on the what you got back from the LLM.

link

jamesbriggs 1320 days ago

I built something similar using a variety of YouTube channels focused on NLP, AI, etc. The app is here https://huggingface.co/spaces/jamescalam/ask-youtube - you can ask things like "what is a transformer model?" or "what is semantic search?"

The way I built it is documented here: https://www.pinecone.io/learn/openai-whisper/

Afaik it's the same approach as Riley, that is:

- Scrape audio of YouTube videos

- Transcribe to text with OpenAI's Whisper

- Use sentence transformer to create embeddings of text

- Index embeddings (with transcribed text, timestamps, and video URL attached) in Pinecone's vector database

- Wrap up the querying functionality in a nice UI

(this is for the search functionality)

If wanting to replicate the Q&A part, I also built something similar and wrote about it (https://youtu.be/coaaSxys5so) - it's essentially the same process but we return text snippets to GPT-3 along with the original question and it generates an answer

link

jamesbriggs 1320 days ago

I should add, Riley used the ada embedding model (rather than sentence transformers). Performance wise they should be similar (in ability to encode meaning accurately) but the ada model can encode a much larger chunk of text. I don't know exact numbers but something like 1-2 pages of text in a typical corporate PDF. Whereas sentence transformers are typically limited to around a paragraph of text.

Typically you'd split the text in paragraph sized chunks to handle this requirement of sentence transformers, with GPT-3 embeddings you naturally have more flexibility there

link

DeWilde 1320 days ago

Thank you :)

link