| HN Mirror

Sure thing!

I'll see if we can make a blog post, but here are a bunch of things that jump out (I work at a company that sells tickets to live events, so our users search for 'live events' like sporting events / concerts and 'performers' like teams and musicians):

- word order matters _a lot_. we did a lot of fiddling with the n-gram tokenzier (https://www.elastic.co/guide/en/elasticsearch/reference/curr...).. we ended up making word order matter a good amount (e.g., 'new york' vs 'york new' return very different results... considering them the same resulted in a lot of noise)

- where the user is searching from is pretty important -- we would fetch the 25 best results and then boost (i.e., reorder) them based on the user's distance from the event venue or the sports team's home venue. we also experimented with fetching more and more results (up to 250) and then boosting from this larger result set. note that ES couldn't take location into account out of the box -- we had to manually boost on the ES output

- we set up versioning with our autocomplete endpoint so we could more easily A/B test variants (highly recommend this)

- we built a system so non-technical employees could create "synonyms." for example, "nyc" could expand to "New York City." we also worked with our data science team to get a list of bad queries that might need synonyms to improve them. (we also automatically triggered a real-time re-index on synonym creation)

- we similarly had an "expectations" tool for bug reporting and finding patterns from common bugs

- we had to add a bunch of other metadata / suffixes to our documents. for example, we might want to return a 1pm Yankees game on August 4 when someone queries "august yankees afternoon game". so we have to interpret the time and add the month to what's being queried. similarly, we want this event to return when someone queries 'nyc baseball', so we need to ensure the league/sport is associated with the event document

- we also had to add "stop words" that we ignored when querying. these include 'game(s)', 'versus', 'concert(s)', 'tickets', etc

- we have an internal definition of performer or event "popularity", and needed to normalize this so ES's "match score" made more sense. (we had limited success here)

- their documentation describes fuzziness as: `fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string` which is overly simplistic and really messy to override (we decided against it)

- because we had two different entities in our results ('events' and 'performers'), we had to figure out how to compare different entities (it was generally easier to compare results within entities) based on what was returned, time to event, location of event, and home location of the performer. we also added additional entities / pages on an ad-hoc basis which further complicated things

- we also needed to exclude low quality performers and events from our catalog (e.g., performers with no events, events with no tickets for sale)

In addition to configuring ES, it was pretty difficult to settle on a KPI because it's not that easy to put searches in the context of the entire user session... we could see if a given query resulted in: the user clicking on a result, or no search results, or the user deleting everything in the box and starting over, but we had a hard time following the user and seeing if the click led to a purchase.

Also, as a disclaimer, I didn't actually write any code for this project (I'm a product manager). But I did take a computational linguistics class in college and worked very closely with the developer :)