[cfe-dev] Spell Correction Efficiency
Douglas Gregor
dgregor at apple.com
Mon Nov 8 15:13:56 PST 2010
Hi Matthieu,
On Nov 8, 2010, at 3:05 PM, Matthieu Monrocq wrote:
> Hi Doug,
>
> We had a discussion a month ago I think about the Spell Correction algorithm which was used in CLang, notably for auto-completion, based on the Levenstein distance.
>
> It turns out I just came upon this link today: http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-spelling-correction-1.html
>
> The idea is to use bigrams (2-letters parts of a word) to build an index of the form (bigram > all words containing this bigram), then use this index to retrieve all the words with enough bigrams in common with the word you are currently trying to approximate.
>
> This drastically reduces the set of identifiers on which computing the Levenstein distance, especially if we directly trim those which have a length incompatible anyway from the beginning. Furthermore, the result may be ordered by the number of common bigrams, and thus the Levenstein distance may be computed first on the most promising candidates in order to have the maximum edit distance drop quickly.
>
> I have implemented a quick algorithm in python (~70 lines, few comments though), to test it out, and I find the results quite promising.
>
> I have used the following corpus: http://norvig.com/big.txt (~6 MB) which can be found on Peter Norvig's page: http://norvig.com/spell-correct.html
>
> For example, looking for the word "neaer":
>
> Number of distinct Words from the corpus: 48911
> Number of distinct Bigrams from the corpus: 1528
> Number of results: [(1, 34), (2, 1133)]
> Most promising results: (1, ['Kearney', 'NEAR', 'nearing', 'linear', 'learner', 'uneasy', 'nearest', 'neat', 'lineage', 'leaned', 'Nearer', 'Learned', 'Nearly', 'cleaner', 'cleaned', 'neatly', 'nearer', 'earned', 'n
> eared', 'nearly', 'learned', 'nearby', 'Nearest', 'near', 'meanest', 'earnest', 'Near', 'beneath', 'gleaned', 'Beneath', 'kneaded', 'weaned', 'Freneau', 'guineas'])
>
> Or for the word "sppose":
>
> Number of distinct Words from the corpus: 48911
> Number of distinct Bigrams from the corpus: 1528
> Number of results: [(1, 14), (2, 136), (3, 991)]
> Most promising results: (1, ['disposal', 'opposed', 'opposing', 'Oppose', 'suppose', 'Dispose', 'dispose', 'apposed', 'disposed', 'oppose', 'supposed', 'opposite', 'supposes', 'Suppose'])
>
>
> Do you think the method worth investigating ?
Yes, I do. Typo correction performance is very important (and has historically been very poor).
Note that there are also more-efficient algorithms for computing the Levenstein distance (e.g., d*min(m, n) rather than m*n). We should also consider those.
- Doug
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20101108/43fd5be2/attachment.html>
More information about the cfe-dev
mailing list