| I’ve been hacking on a project called Bookshelf (https://www.bookshelf.diy/). It lets you take an archive — say, your Substack export, a bunch of PDFs, or even saved HTML files — and turn that into a retrieval-backed GPT that your readers can query. The idea is: instead of scrolling archives, they just ask questions. Answers are pulled only from your original content, with citations. It’s aimed at writers and researchers who want their work to be more discoverable — but without spinning up vector infra or fiddling with RAG pipelines. For context: I’ve always gone back to Paul Graham’s essays for startup advice. But there’s no good way to search them semantically or contextually. So I tried indexing a few with Bookshelf. Asked:
“How does PG think about evaluating founders?”
and got a clean answer sourced from Do Things That Don’t Scale and a couple other essays — citations included. It was surprisingly useful. So far, one early test case is AnthropoceneGPT (https://sammatey.substack.com/p/introducing-anthropocenegpt) for Sam Matey’s newsletter. It’s seen ~100+ queries. Readers say it works like a smart librarian. He says it gives him ideas for what to write next. Rough implementation:
Input: HTML/PDF exports
Chunks + embeds via OpenAI (or local)
Stored in a vector DB
Retrieval API is called by the custom GPT
GPT is instructed to only use retrieved chunks and cite them
Auth Option: for tracking on queries to give writers some telemetry Here’s a demo GPT trained on Paul Graham’s archive:
Paul Graham GPT (https://tinyurl.com/paul-graham-gpt) Would love thoughts on:
What would make this better for writers or readers?
Any UX nits on the GPT side?
Has anyone tried doing something similar in-house? |
One UX thought on the GPT side: how prominent are the citations? And is it easy for readers to click through to the original content directly from the GPT's response? Making that flow seamless would be a huge win for verifying information and deeper engagement.