Hacker News new | ask | show | jobs
How to feed content to ChatGPT and make it answer questions about it?
17 points by alexander-g 1147 days ago
I recently came across several startups that appear to feed content to ChatGPT and subsequently ask questions about that content.

Examples: 1. https://flowgpt.ai 2. https://www.chatbase.co

I'm curious about how this process works. Do these startups utilize the OpenAI API to train a model?

I attempted to train a model using the OpenAI API myself, but it seems like these startups have a different approach. I was unable to achieve the desired results simply by uploading content, even after spending hours creating JSONL files.

Chatbase, for instance, has a video demonstrating a user uploading a file and then asking questions: https://youtu.be/W4lzGger7_0

6 comments

Pretty easy, take a look at Langchain tutorials on YouTube. Basically you give it a set of documents, split these into smaller documents and then store these in a vector database and create embeddings (OpenAI, Jina, etc). Then when you interface with OpenAI on GPT-3 or GPT-4, you interface with those documents and embeddings and produce an answer based on the document set (or very near to it). It takes some practice, but with some repetitions you could code this together from scratch within 5-10 minutes. This channel on YouTube thought me within less than 2 days:

- https://www.youtube.com/@DataIndependent

Thank you! It was an eye opener for me. We've been using slightly different approach (at https://jopilot.net) but vector database + langchain allows to process much bigger amount of data.
No problem! You could probably improve it by fine-tuning GPT models on different categories of documents, prior to doing the vector retrieval from embedding. Fine tuning isn't available for GPT-3-turbo or GPT-4 yet, so I am waiting to try out this hybrid approach for when it does come available.
1. Take text of source material. Example: extract_pdf_text(pdf_path_string)

2. Segment text every 3000 characters.

3. Generate embeddings for every text segment and save the embeddings to PineconeDB. Make sure you also save the raw text segment as additional meta information so you can use it later. https://platform.openai.com/docs/guides/embeddings/what-are-...

4. Capture user's question and generate an embedding for it using the same OpenAI API.

5. Query your PineconeDB with the question's embedding you will get matches back.

6. Use these matches as context to hit the OpenAI chatgpt API endpoint. Example:

    Using this context answer this question: **QUESTION**

    **CONTEXT**
Thank you for the detailed explanation!
See here https://github.com/marqo-ai/marqo/blob/mainline/examples/GPT... and https://github.com/marqo-ai/marqo/blob/mainline/examples/Spe.... Multiple examples for answering questions from documents/manuals, text, or transcripts.
Langchain --> query_with_sources function (semantic search context)

If you finetune, you'll get mostly the style and sometimes worse factual performance. Source through semantic search context/RAG is a better approach.

Source: Creator of AnyQuestions.ai which works like that plus some extra neat tricks. (released the first demo in August 2022 before the onslaught)

When are you actually billed? First, when you create embeddings at the beginning. Second, every time you call the OpenAI chat complete API?
Yes, that's correct. Then OpenAI bills you around the end of the month so you have some leeway of credit.

You can use local embeddings, such as sentence transformer models, too, to save on embeddings (and maintain embeddings you can move to another provider with new queries etc)

This youtube series shows a basic intro to some techniques with code, he also runs through a great flow chart to show you how it all works

https://www.youtube.com/watch?v=ih9PBGVVOO4

I've only tried one json file I saw posted by someone, and it didn't work either. Well, it sorta worked, but it very quickly forgot that I had given it anything already.

I strongly suspect that the better results come through the APIs and not the web interface.

It's an API. Yes. But how it works? How to feed content and make it answer questions?