Hacker News new | ask | show | jobs
by ryrobes 4570 days ago
OP: Thanks everyone for the feedback!

Just a short explanation of the (still very rudimentary) query "system" (using the term loosely here)...

Tab file gets scraped, broken down into individual passages based on how it's written (aka the "riffs", even though they might not technically be)..

   P.M.---|  h     P.M.  h      
   |---------------------------|
   |---------------------------|
   |--------7^8--7-------------|
   |--------------------7^8--7-|
   |-0---0-----------0---------|
   |---------------------------|
becomes normalized / encoded to something like

   "5a 5a 3h 3i 3h 5a 4h 4i 4h" 
and inserted into an ElasticSearch cluster, using a non-word analyzer for indexing (simplified a bit here for sake of argument - but I also save all spacing, symbol markup, bar sections and palm muting they just are not being utilized in search currently).

   "settings": {
       "index.analysis.analyzer.nonword.type": "pattern",
       "index.analysis.analyzer.nonword.pattern": "[^\\w]+"
     }...
Upon search - the same encoding function is then applied to the incoming text, exploded and thrown in an ordered SPAN query with diff levels of 'slop'...

   "query": {
    "span_near": {
      "clauses": [
        {
          "span_term": {
            "riff_code": "5a"
          }
        },
        {
          "span_term": {
            "riff_code": "5a"
          }
        },
        {
          "span_term": {
            "riff_code": "3h"
          }
        },
        {
          "span_term": {
            "riff_code": "3i"
          }
        }
      ],
      "slop": 6,
      "in_order": true
    } ....
I cut the score off at a >1.1 or something so that it doesn't show things that are way off.

At the time it seemed like the best way to detect patterns that are mostly similar and look decent. I also experimented with MoreLikeThis and FuzzyLikeThis query variants, but ultimately the span query gave closer results to what one would EXPECT to see (but still has some scoring and clustering problems).

Any Lucene / ElasticSearch gurus feel free to suggest differently.