Hacker News new | ask | show | jobs
Show HN: ChatGPT and Document Parser = Ghost (ghostextension.com)
57 points by Ostatnigrosh 1205 days ago
I've always wanted to just upload a whole book to ChatGPT and ask questions. Obviously with the char limit that's impossible... So some buddies and I built Ghost. We have it limited to 5 pages for uploads for now, but plan on expanding the limit soon. Let me know what you guys think!
21 comments

This is not meant to be a critique just an open question to everyone trying it - does anyone find this to be more useful then just ctrl+f?

For compiling information or getting an immediate yes/no it's likely correct - but I found ctrl+f generally gets me there faster albiet with slightly more reading.

At least in the context of this lease agreement which does have everything well organized and uses carefully chosen keywords already.

It picks up some context questions that aren't there.

Consider the example question of "I won't be able to pay until the 9th of this month, will I get a fee?" - are you going to search for "fee"? There are 66 occurrences.

Modify the question to "If I pay on the 4th of the month, will there be any late fee?" and you get the correct answer too.

For the question "What restrictions are there on parties?" it appears to get that correctly answered. If you search for "party" you'll get 19 results that appear to be legal entity parties rather than the possibly noisy type.

I asked if I could use the rental as a foreign embassy location. It gave the reasonable answer, quoting the agreement that you could only use it as a private residence and you couldn't use it for other purposes.
5 pages fits in the context window. How exactly do you plan on expanding the limit? Without explanation we have to assume you haven't completely solved your core technical challenges.

In my testing the biggest challenges with using for example OpenAI embeddings with cosine similarity or something are A) figuring out the section breaks or right chunk size so that information stays in context and B) retrieving enough chunks to get the correct hit for a query without having too much extraneous information that confuses it.

I think that it's hard to make a parser that most optimally slices up arbitrary documents.

Since you have some larger documents preloaded I assume for those you have the embeddings search. But for user uploads you are skipping that now and just feeding all of the text extracted from the PDF into the prompt along with the query.

This explains why "What if I move out early?" for the sample document doesn't mention any of the information in the lease break section, which is definitely the most important section for moving out early. Whatever space they're projecting the question into doesn't capture that "lease break" and "moving out early" are synonyms.
It may only be retrieving the top N results with most similar embedding. If that answer is in the 3rd most similar chunk and it only fed 2 along with the query in the prompt, then GPT never got the information relevant to the question.
From the website, it seems as though they are retrieving five chunks. Also looks like they split documents by paragraph sections, unless the paragraphs are small enough- then they put a couple of them together.
Same, the is lies in the details. You basically need a good semantic search in front of GPT to feed it the best context given the question.
Any code or pseudo-code you could share that does something like that?
I didn't do any extensive testing but seems to be really useful. However, where can we see the privacy policy? People are probably going to upload some important and confidential documents so it's good to know how this data is being handled. The only thing I see is an asterisk after the 24 hour notice Also, the bot answer window may have 5 pages even if 1 page is enough for the answer, this may confise your users since they may think there's something else on other pages
Hey! This is a great point. So we delete the documents within 24 hours of upload, and have a limitation to 5 pages to cut our own costs as this is just a concept.
Could you please add a more detailed policy to your site? For example, who can see, use or access the uploaded documents in any other way and whether the documents are used to gather some data, analyze or sell it?
I will make sure to add more details. In terms of privacy, those files are visible to no-one but yourself, and we delete everything.
Projects like these (using embeddings) are great, but what I'm looking for is something that can ingest an entire book (let's say a fiction book) then answer questions about the entire content (and not just by effectively doing a text search over your input, but actually "understanding" the entire contents of the book); I presume such a thing is not possible with ChatGPT (without fine-tuning), correct?
What do you think about the responses generated by this:

https://www.konjer.xyz/the-alchemist (disclaimer: built by me)

What specifically is missing from the answers in your opinion?

That's pretty interesting but ideally, I'd be able to upload my own book (txt, pdf, epub) and interact with it. It's lacking implementation details so not sure if you use embeddings, fine tuning or a novel approach.
Could using GPT3 (davinci-003) to generate embeddings, then searching your vector database for relevant excerpts, then providing the results as context for the prompt lead to something close enough?
No. That works for documentation where you do text search and extract paragraphs around the results for "context".

I want it to understand a complete fiction book and tell me about how a character grows throughout their journey from chapter 1 to chapter 12 over 350 pages.

Depending on the book you could use that to extract all excerpts where the character appears.

Then each excerpt could be fed to the LLM asking it how this part relates to the question you’d want an answer to.

Then ask for each what it shows about the character and it’s personality, weaknesses, etc.

And finally recursively summarise them, asking for the summary to show how the character has grown through the summarised content.

Basically ending up with a map-reduce.

Bigger sources, or lots of content related to the character, would lead to less accuracy, and increase the likelihood of hitting the window’s limit.

It would also be highly specific and quite brittle, although one could probably turn it into a more generic process / pipeline (ie what dust.tt enables).

I might have completely missed your point or overlooked some glaring flaw though, in which case please do let me know what you think.

Right now that’s not a use case supported out of the box by ChatGPT.

It also seems to be one of the most important limitations of ChatGPT, and a lot of people/teams are looking for solutions.

I work in consulting and this is literally the use case that every single client wants right now - the ability to ingest a corpus of documents into ChatGPT or similar and then have it generate responses based on natural language questions. Right now most people are faking it by running the search using some other tool like Solr/ES and then taking the snippets that are returned and assembling them into a prompt that gets passed to ChatGPT.
Thank you, that’s very insightful.

Which option seems to you to be the best alternative? And where do you see the future of this?

Wow, thanks so much everyone for checking out Ghost! We are currently crashing because of all the traffic. Should be up and running in 30 minutes :)
Here's a free notebook for map reduce summarization I created: https://www.wrotescan.com

It's byok. Keys are not persisted. You can choose chat-gpt-turbo or text-davinci.

Limit is 2.4M tokens per call, working to get higher too.

(This critique is unrelated to this project. It works as expected, OP, and looks good.)

How could one ever trust the output of ChatGPT?

This feels to me a bit like non-L5 autonomous driving: If I have to assist at all, it'd be easier to do it myself. In the same vein, for this project (and ChatGPT generally): Can I actually trust that the output from ChatGPT in answering my question about the document is factually correct?

e.g., If I hand it a home rental agreement legal document and ask "What is the late move out penalty if I am 10-minutes late in dropping off the keys?", it may give the correct answer. Or it may generate a plausible-sounding answer using the words in the document that is completely (or perhaps even just slightly) incorrect.

How could I possibly know without reading it myself?

If it's using a search then it is possible to identify the paragraphs with a number in the database along with the embeddings. Then once the similar chunks are retrieved, part of the prompt could be to return the paragraph, line numbers or exact quote(s) used to answer the question.

Yours is not a good example though because "10 minutes late" is never going to be in a document like that.

Ghost is a well known blogging platform so you might want to change the name.

This seems similar to ChatPDF.com (with a 200 page limit though, instead of the 5 page limit that you have, it seems) which I suppose we'll see a lot more competitors for as the ChatGPT API expands.

There is also Ghostscript which is a postscript/pdf library, and since the site is operating on PDF content, my initial thought were that they were somehow related.
i would be great to have a little summary of the document you hit against the API like this extension: https://chrome.google.com/webstore/detail/chatgpt-suite-summ... (simply grab the prompts ;))
This is exactly the application I was thinking of when I first used ChatGPT. Using AI to summarize complex legal documents, and be able to ask questions about the document.

Have you thought of even larger knowledge-bases? like entire legal systems etc...

Anyway, amazingly executed, nice work!

A better way to do this might be to use the embedding API. That allows you to upload a text corpus and to then get vectors. You can then calculate the cosign similarity for a search string on those to get relevant results of clustered text from the uploaded corpus.
I don't get why people bother with chat interface and textual prompts. The whole concept of "prompt engineering" sounds to me like a practical joke that got out of hand.

It's like, imagine there's a complex machine with large panels full of buttons and levers - and then, someone covered the panels with tapestry. Beautiful tapestry, showing artistic interpretations of things mundane and holy, trivialities of everyday life next to impossible dreams. And then, people were told the machine is to be operated by touching that tapestry, and that the artworks are the guide to understanding it and using it effectively. And then a whole religion formed around studying patterns in the tapestry. To me, prompt engineering is that religion.

There's an actual interface to the machine hidden under all the clever wordplay. A precise, formalized one. An interface that eats tokens and spits out probabilities. I just don't get why most talk - even seemingly specialist talk - about LLMs is ignoring it entirely, and focuses on the tapestry that's just obscuring the nature of the model, effectively making everything more difficult.

>The whole concept of "prompt engineering" sounds to me like a practical joke that got out of hand.

I was on a call this morning and heard someone refer to two of their team members as "Prompt Engineers" as if that were an actual role.

My impression is that the industry in aggregate is actually trying to make it into an actual role.

Which would make sense if we were talking about humanity discovering magic is real and trying to reverse engineer it based on ancient spell books[0] - but we're not. We're talking about deep learning models made by other people, using publicly available knowledge and techniques, and often with source code and training set being publicly available too. Prompt engineering feels like people purposefully trying to treat technology as magic.

--

[0] - Or any of the scenarios equivalent to it under Clarke's third law, such as finding a crashed alien starship with a working black-box AI in it, built on a computing substrate we can't even identify, much less prod with a signal generator.

Because everyone can use text interface without knowing how to configure the low level one.
This makes sense at the UI layer, if you're making a chatbot or an NPC for a game. But if you're at the point of prompt engineering, it makes no sense to stick to the natural language interface. It's like another iteration of the idea of "programming via conversations in natural language instead of writing code" - it sounds like it makes sense, until you realize that programming languages and the mathematics underpinning it were developed specifically because natural language is nowhere near precise enough for the job.

Or, put another way, using text/conversation as user interface in a model is turning a normal engineering problem into a much harder reverse engineering problem. Why would anyone want to make life difficult for themselves this way, and ultimately turning engineering into voodoo?

Would you mind explaining this and maybe dumbing it down? Sounds useful
You can use models (OpenAI have some, there are other open-source self-hostable ones that are better if I recall correctly) that will take a sentence or a paragraph and spit out a vector. These vectors are called 'embeddings'

You then put those vectors in a vector database (e.g. pinecone, pgvector, chroma).

To run searches, you generate an embedding of the search term (could be the raw user search, could be something a model like ChatGPT was asked to transform the user's search into), then query the vector database for the n closest vectors. The trick is getting a model that generates good vectors for search (and transforming the user's query into some text that'd be useful vector(s) to search against). If feeding that into an LLM context, the next step is making sure that you get your prompt right, and don't overload the model with unrelated information (i.e. bad search results).

The key is that the vector representation embeds language concepts in how close vectors are to one another. An easy way to gain a feel for this is to look at single-word embeddings. Computerphile have a great episode on it[1]. You can take a vector for 'King', subtract the vector for 'Man' and add the vector for 'Woman' and the closest vector in that search will likely be 'Queen'. Scale up this idea to whole paragraphs (and larger vectors as a result).

LangChain has an example of searching a database of facts[2] (although I find their documentation pretty inaccessible - they explain their library, but don't step back from inside the weeds of what they're doing to really explain why / what's going on). Many of the features LangChain implements are distilling (or sometimes simply lifting and providing a toolkit to directly apply) LLM papers.

1: Computerphile Word Embeddings https://www.youtube.com/watch?v=gQddtTdmG_8

2: https://langchain.readthedocs.io/en/latest/use_cases/questio...

+1 to this. Maybe even some basic code to share on how to use embeddings to query ChatGPT with bigger data sets. Like thousands of phone call transcriptions, hundreds of documents or millions of user reviews? Thank you!
Just so you're aware of its existence: https://www.wordtune.com/read
What is the basic mechanic that is going on here? Searching the document then using it with one shot or multi-shot prompting?
I wanted to try it, but the pdf i tried just got an error with no info so ¯\_(ツ)_/¯
I wish I could try it, it would help me at this very moment.
Love this idea -- also curious on privacy policy
Got a 502 error, could not see the product
Self host possibility ?
Very nice. Very useful.
error uploading 5 page upload
Doesn't work