| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amelius 3526 days ago
	Yes, but the downside is that (afaik) you can't use an index to quickly retrieve the matches in order. You really have to scan your complete dataset on every search. Besides, related to this, does anybody know of a good Javascript implementation of a 3-way merge of strings, and perhaps also of JSON-like structures?

3 comments

peff 3526 days ago

You can store the values in a trie (e.g., with one node per character in the string). Exact lookup in the trie is O(string_length), like a hash table. Inexact lookup can similarly walk the tree, but explore side branches within a certain budget.

So if your string is "abc", you'd follow the node for "a", then the one for "b", but _also_ the one for "c", at a cost of 1 (because dropping the "b" incurs an edit distance of 1).

link

DAllison 3526 days ago

> Exact lookup in the trie is O(string_length), like a hash table

It's worth noting that some standard libraries (Java's JDK for one [1]) will cache the value of String.GetHashCode(), meaning string lookup in a HashTable is constant time average (but O(n) worst-case due to collisions).

[1]: http://mindprod.com/jgloss/hashcode.html

link

lorenzhs 3526 days ago

There is a variety of approximate string matching algorithms that speed up search by using an index. https://arxiv.org/abs/1008.1191 is one that should be fairly easy to implement. Levenshtein automata are another approach that makes the rounds on HN every now and then, but are a tough beast to implement and I wouldn't really recommend them in practice.

link

justin66 3526 days ago

> Yes, but the downside is that (afaik) you can't use an index to quickly retrieve the matches in order. You really have to scan your complete dataset on every search.

I'm not sure I see what you're driving at there. If you had a finite set of strings that you might have to compare, you could (for example) populate a graph or something with weighted edges representing the Levenshtein distance between strings (vertices). Offhand, it seems like your search could basically use a hash table to find the position of the vertex representing your string on an already-populated adjacency list.

It'd be big, but in reality you'd probably only populate the edges with especially high or low weights, depending on the application?

link

amelius 3526 days ago

But how would you perform the lookup in the hash table when the string you're searching for is not (exactly) in the hash table?

link

justin66 3526 days ago

Searching a new string - adding a string to the set - would require comparing against all items in the set, yes. That's a much less daunting prospect than what you said: You really have to scan your complete dataset on every search.

Building the index, or adding to it, is not fast (without some heuristics applied) but searching using the index isn't bad.

link

amelius 3526 days ago

Okay, now I understand what you meant.

But what if a search needs to be fast regardless of whether the string has been searched for before?

link

justin66 3526 days ago

I thought the thing peff briefly described above sounded pretty good. The worst case complexity, if I'm visualizing it right, would be the length of the string you were comparing.

I remember some discussion of this in previous HN threads but I don't know what it was about...

link