| HN Mirror

Oh yeah I could tinker forever, it's an amazing dataset that I think needs more attention from the ML community. Glad to see the working team at https://www.ecfr.gov/ finally making their search better, as Cornell Law has been the defacto go to forever (for me at least).

I think an amazing eCFR search experiment would be transformer vectors in a graph, using the hierarchy, citations, and references as edges to (sub)paragraph and section nodes - perhaps even using a modified HNSW somehow. The graph that exists there now isn't leveraged enough.

Per this dataset itself, I already output to Vespa formatted JSON (as noted in https://github.com/maxdotio/ecfr-prepare )...and the resulting vectors from the inference get appended to the original JSON doc as a field.

I have a Vespa schema hat I need to upload (that doesnt include the vector field yet but can be added using the Vespa vector search walkthroughs). It's been a busy day but I'll quickly try to find a place to put it for now :)

--EDIT-- Pushed the schema to the above repo, and some bash. You'll need Docker and to follow the Vespa MSMARCO instructions first at https://docs.vespa.ai/en/tutorials/text-search-semantic.html to get used to the engine.