[cfe-dev] Spell Correction Efficiency

Sun Nov 21 11:58:02 PST 2010

Hi Doug,

2010/11/9 Douglas Gregor <dgregor at apple.com>

> Hi Matthieu,
>
> On Nov 8, 2010, at 3:05 PM, Matthieu Monrocq wrote:
>
> Hi Doug,
>
> We had a discussion a month ago I think about the Spell Correction
> algorithm which was used in CLang, notably for auto-completion, based on the
> Levenstein distance.
>
> It turns out I just came upon this link today:
> http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html
>
> The idea is to use bigrams (2-letters parts of a word) to build an index of
> the form (bigram > all words containing this bigram), then use this index to
> retrieve all the words with enough bigrams in common with the word you are
> currently trying to approximate.
>
> This drastically reduces the set of identifiers on which computing the
> Levenstein distance, especially if we directly trim those which have a
> length incompatible anyway from the beginning. Furthermore, the result may
> be ordered by the number of common bigrams, and thus the Levenstein distance
> may be computed first on the most promising candidates in order to have the
> maximum edit distance drop quickly.
>
> I have implemented a quick algorithm in python (~70 lines, few comments
> though), to test it out, and I find the results quite promising.
>
> I have used the following corpus: http://norvig.com/big.txt  (~6 MB) which
> can be found on Peter Norvig's page: http://norvig.com/spell-correct.html
>
> For example, looking for the word "neaer":
>
> Number of distinct Words from the corpus: 48911
> Number of distinct Bigrams from the corpus: 1528
> Number of results: [(1, 34), (2, 1133)]
> Most promising results: (1, ['Kearney', 'NEAR', 'nearing', 'linear',
> 'learner', 'uneasy', 'nearest', 'neat', 'lineage', 'leaned', 'Nearer',
> 'Learned', 'Nearly', 'cleaner', 'cleaned', 'neatly', 'nearer', 'earned', 'n
> eared', 'nearly', 'learned', 'nearby', 'Nearest', 'near', 'meanest',
> 'earnest', 'Near', 'beneath', 'gleaned', 'Beneath', 'kneaded', 'weaned',
> 'Freneau', 'guineas'])
>
> Or for the word "sppose":
>
> Number of distinct Words from the corpus: 48911
> Number of distinct Bigrams from the corpus: 1528
> Number of results: [(1, 14), (2, 136), (3, 991)]
> Most promising results: (1, ['disposal', 'opposed', 'opposing', 'Oppose',
> 'suppose', 'Dispose', 'dispose', 'apposed', 'disposed', 'oppose',
> 'supposed', 'opposite', 'supposes', 'Suppose'])
>
>
> Do you think the method worth investigating ?
>
>
> Yes, I do. Typo correction performance is very important (and has
> historically been very poor).
>
> Note that there are also more-efficient algorithms for computing the
> Levenstein distance (e.g., d*min(m, n) rather than m*n). We should also
> consider those.
>
> - Doug
>

The diagonal optimization you are talking about seems to bring a *2 to *3
speed-up, which is quite nice, however the algorithm is tricky and I am not
sure yet that I have the correct distance in every case. I am going to
settle and implement some fuzzy testers this week to help find bugs in the
various algorithms.

Anyway I've been exploring a few options regarding typo correction these
past weeks and I now have a little ecosystem of various solutions, however I
currently lack 3 things:
- a real lexicon (mine is about ~18k words) --> I would appreciate if
someone pointed me to a more representative lexicon, otherwise I may try and
index a subset of the boost library and see what I can come up with
- a real benchmark: I have no idea how to "select" good (typo-ed) words for
testing / benchmarking the typo correction algorithms. I have started by
always using the same word (at a distance of 1) but it clearly isn't
representative and my knowledge of statistics is rather lacking so if you
have an approach to suggest...
- a good Trie implementation, or hints in how to implement one. A Trie seems
to be the best index (quite compact yet ordered), however I only have a
naive implementation at the moment.

My best concurrent so far seems to be based on CS Language theory and
involves extracting the set of words recognized by two automatons, one being
the Trie, which recognize all the known words, and one being crafted for the
word we are looking for, and recognize all words in a certain distance.
However for maximal efficiency, this requires some structures to be
precomputed, and this structure grows exponentially with the distance. It
works for distance 1 to 3, is already 2.8MB for distance 4 (and takes about
70s to be computed on my machine, though I suppose it would be faster to
deserialize a precomputed version), and I haven't dared trying further as it
seems to be a geometrical serie with a factor of ~28 !

Would it be considered a loss to limit ourselves to words within distance
min(word.length()/4+1, 3) ?

It occured to me that one transformation (the transposition of adjacent
characters) was quite a common typo. I have toyed with the
Damereau-Levenshtein distance but this is quite harder to implement the
optmizations with it. Without it a simple transposition costs 2
transformations (one addition and one deletion) so in this case 3 is rather
limiting, but still I don't know if looking much further is very
interesting.

I'd like to get some feedback before delving further :)

Also, I haven't considered how to actually plug this in the llvm/clang code
base yet, I am just prototyping at the moment.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20101121/e4d163bd/attachment.html>