| Some excellent points in there, thanks! Really drawn to the idea of storing the final recommendations in a separate file. Didn't write it that way initially because most SSGs handle "data" files in their own unique way, and one of my goals was to be as "light touch" as possible. But I guess that could be solved with some config options (set `path/to/data.json` or similar). I agree that "79% match" doesn't mean anything on it's own (and is, yes, totally arbitrary in a lot of ways), but it does provide some context when browsing across the whole site. It's a way to indicate that a "90% match" is more similar than a "60% match". Felt like useful info to me, so that's why I included it. As for "why only top two?" - I'm constantly paranoid about adding too many bells and whistles to my blog and overloading the patience of whoever's taken the time to read. If I had official rules, they'd read "no carousels, and as few Calls To Action on a page as possible". I'm not super strict about it, but 1 recommendation felt too stingy and 3 felt like too many. BM25 is new to me, and I'm mostly a n00b when it comes to "proper" search. But I'll definitely do some more reading. Currently setting up a head-to-head with an embedding index and a fuzzy-search library, but don't have any scientific way to measure the results. Sounds like you may have pointed me in the direction of a missing piece of the puzzle. Thanks! |
(And you do need a JS library, it's not just a line or two of throwaway code. Client-side transclusion is a bit tricky to get right for use-cases as advanced and general-purpose as ours - we use it for lots of things. Transclude other pages, transclude sections of pages, transclude section ranges, recursive transclusions... Needs to make sure styles get applied, render it off page for acceptable performance, rewrite paths inside the transcluded HTML so links go where you expect them to - that sort of thing.)
The percent match is also misleading because there is no sense in which it is a percentage. It just isn't. '79% match' is not 1% more similar than '78% match'. My finding with the OA embedding is that a distance of 0.01 actually corresponds to a pretty large semantic distance and after a few more increments, the suggestions are worthless. Also consider this: a distance of 0 (ie. itself) may arguably be '100%' (hard to get more similar than itself!), but then what is a distance like 1? (And can't the cosine distance go higher?) Can you really be '0% similar', never mind '-10% similar'? It is true that 80% is better than 79%, but that's all that means, and you can present that by simply putting them in a list by distance, as you do already.