Oh yeah I could tinker forever, it's an amazing dataset that I think needs more attention from the ML community. Glad to see the working team at https://www.ecfr.gov/ finally making their search better, as Cornell Law has been the defacto go to forever (for me at least).
I think an amazing eCFR search experiment would be transformer vectors in a graph, using the hierarchy, citations, and references as edges to (sub)paragraph and section nodes - perhaps even using a modified HNSW somehow. The graph that exists there now isn't leveraged enough.
Per this dataset itself, I already output to Vespa formatted JSON (as noted in https://github.com/maxdotio/ecfr-prepare )...and the resulting vectors from the inference get appended to the original JSON doc as a field.
I have a Vespa schema hat I need to upload (that doesnt include the vector field yet but can be added using the Vespa vector search walkthroughs). It's been a busy day but I'll quickly try to find a place to put it for now :)
Yes, I was happy to see the modern changes, I agree Cornell Law had done a better job, although I think a lot of people use Google as the search tool and then link to their prefered site, since they are always the first two.
My experience has been with 14 CFR and 21 CFR. I would love to see any tool you come up with in the future and would be happy to give you feedback.
I think an amazing eCFR search experiment would be transformer vectors in a graph, using the hierarchy, citations, and references as edges to (sub)paragraph and section nodes - perhaps even using a modified HNSW somehow. The graph that exists there now isn't leveraged enough.
Per this dataset itself, I already output to Vespa formatted JSON (as noted in https://github.com/maxdotio/ecfr-prepare )...and the resulting vectors from the inference get appended to the original JSON doc as a field.
I have a Vespa schema hat I need to upload (that doesnt include the vector field yet but can be added using the Vespa vector search walkthroughs). It's been a busy day but I'll quickly try to find a place to put it for now :)
--EDIT-- Pushed the schema to the above repo, and some bash. You'll need Docker and to follow the Vespa MSMARCO instructions first at https://docs.vespa.ai/en/tutorials/text-search-semantic.html to get used to the engine.