[cfe-dev] Spell Correction Efficiency

Matthieu Monrocq matthieu.monrocq at gmail.com
Mon Nov 22 05:28:57 PST 2010


Hi Dave,

actually I try not to think :)

Truth to be told I have little experience in the domain, but I have time at
my disposal, therefore rather than thinking I simply brute-force the tests
and see what comes best. I'd rather not assume, especially performance-wise.

For the prefix / suffix implementation, I'm waiting for a "typical" lexicon
to see what comes out first. I'm in the hope than it should not matter,
unless I have a glaring bottleneck in the code, but as I said i prefer to
measure. Frankly it's the least of my worries since it suffices to reverse
the strings to get from one to the other so there is no technical difficulty
there.

I know that my trie implementation is probably lacking (as I said, it's
naive), so I am trying to see what is the state of the art on storing
dictionaries and notably how to "pack" tries to get a smaller memory
foot-print and a faster query. The idea is to gather the various approaches,
implement them and benchmark them, and I'll see what comes on top.

I'd really like to see this hash-trie implementation if you could send it to
me. I don't mind having to read C / C++ code as long as it's not obfuscated,
in fact I'll probably feel much more at ease than decoding the scientific
paper I got on the automaton theory and which took me a good week to
implement :)

Thanks,
Matthieu.

2010/11/22 David Tweed <david.tweed at gmail.com>

> On Sun, Nov 21, 2010 at 7:58 PM, Matthieu Monrocq
> <matthieu.monrocq at gmail.com> wrote
> >> We had a discussion a month ago I think about the Spell Correction
> >> algorithm which was used in CLang, notably for auto-completion, based on
> the
> >> Levenstein distance.
>
> I presume you've thought about whether suffix-tries and prefix-tries
> would work for the case of typos on computer identifiers?
>
> > - a good Trie implementation, or hints in how to implement one. A Trie
> seems
> > to be the best index (quite compact yet ordered), however I only have a
> > naive implementation at the moment.
>
> I've got a hash-trie implementation (the datastructure that is used,
> eg, to store the dictionary of hypenation exceptions in TeX) in C++
> that's heavily designed for small alphabets and integer "payloads", so
> it's unlikely to be directly suitable, but you can have  a look at it
> if you like. (Drop me a personal email.)
>
> --
> cheers, dave tweed__________________________
> computer vision reasearcher: david.tweed at gmail.com
> "while having code so boring anyone can maintain it, use Python." --
> attempted insult seen on slashdot
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20101122/4de11e2d/attachment.html>


More information about the cfe-dev mailing list