| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ComputerGuru 777 days ago

Amazing work, I'm impressed by the scope of your project!

I must say though, is it jina or bge-3/flag - the embeddings (and tokenizer?) do not do a good job on tech topics. It's fine for natural words, but searching for tech concepts like "xaml", "simd", etc cause it fall back to tokenizing the inputs and tries to grab similar sounding words.

Also, just some constructive feedback, if there were some way to stop it from showing the same "hn leaderboard" of results when there are no results because a topic is too niche would be nice. I get a lot of "Stephen Hawking has died" when searching for words the embeddings aren't familiar with.

Edit: I'm not so sure how well the sentiment analysis is working. I had the feeling that there was too much negative sentiment that didn't match up to reality, so I tried looking up things HN would feel overwhelmingly positive about like "Mr Rogers", I mean, who could feel negatively about him? The results show some serious negative spikes. Look up "Carter" and there's a massive negative peak associated with the passing of Rosalynn Carter. It was an HN submission talking about all the wonderful things the Carters did.

Also, I think the "popularity over time" needs to be scaled by the median number of votes a story got that month/year, because the trend lines just go up and up if you plot strictly the number of posts. Look at the popularity of "diesel" and you'll see what I mean - this is a term that peaked ten years ago! Or perhaps it should be some sort of keyword incidence rate or number of items with a cosine similarity index of less than x from the query rather than post score, maybe?

Edit2: The dynamic "click a post to remove and recalculate similarity threshold" is awesome.

1 comments

tarasglek 776 days ago

How does one tell programmatically that any given embedding model doesn't recognize a term or word?

link