Hacker News new | ask | show | jobs
by marcotm 1181 days ago
I wanted to share a little side project of mine that I created while tinkering around with GPT-3.

The project uses the Algolia HN Search API [1] to retrieve the "Who is hiring?" posts from HN and then parses them with the help of GPT-3 / GPT-3.5 (I do not have API access to GPT-4, yet, but it already works quite well even with the older models). It then puts the job postings into a structured list that is hopefully easier to skim than the original postings. There are some additional features like sorting jobs by semantic similarity (based on the text embeddings from OpenAI). Filtering, sorting and saving favorites is implemented client-side, so your data and preferences remain local to your browser.

Originally, this wasn't even meant to be a public product, but if people find it useful (and HN is fine with it), I'll try to keep it running. I've also written a short article about how the parsing works behind the scenes [2]. It's quite amazing how easy many of the classic NLP tasks have become with the newer LLMs.

Happy to answer any questions about the project!

[1] https://hn.algolia.com

[2] https://marcotm.com/articles/information-extraction-with-lar...

4 comments

You can make the intermediate step a bit more structured too via https://github.com/HackerNews/API

For example, for the March one it is ID 34983767 (from the algolia search or a "there's only so many of them, here's a list that I'll add to each month").

You can then get a list of all the top level comments at https://hacker-news.firebaseio.com/v0/item/34983767.json?pri...

And then pulling up a comment at https://hacker-news.firebaseio.com/v0/item/35255027.json?pri... to not have to parse any of its child comments or the HTML of the page.

(late edit: and re-reading the blog post while not trying to pay half attention to a meeting... that is what you are doing)

Thanks for mentioning the Firebase-based API. I knew it existed, but somehow I went with the Algolia API by default. I use their HN search quite a bit, so that's probably why I stuck with them. (no affiliation)
This is really nice. I have 1 nitpicky comment on the blog. The font used is jarring for me to read.
It's like I've stepped into an episode of futurama!
I tried a similar thing today parsing unstructured text (client excel documents) and turn them into JSON. I ran into the problem that the output format changed and sometimes the JSON wants parsable.

Thanks for your prompt. There are some pointers how to improve mine

You're welcome! For the chat model, it definitely helps to let it know that you want valid, parsable JSON (and nothing else). Otherwise it tends to get chatty. ;-) Depending on your use case, you might even ask it to fix the JSON if it's not parsable.
I had the problem that it changed the layout of the JSON file: {"data": [...]} or {"products":[...]}.

In your first example, you told GPT what data structure you expected. I added this to my prompt, and now it produces the JSON Data consistently.

Any plans for making this open source?
The core ideas for extracting the information with GPT are already available in the blog post linked above. Those are exactly the prompts I'm using. The rest is just a pretty simple Nuxt web application. So I'm not sure if open sourcing my mediocre frontend code would be of any value. Is there anything in particular you would be interested in?