[llvm-bugs] [Bug 28780] New: "edit distance" algorithm is wrong

via llvm-bugs llvm-bugs at lists.llvm.org
Sat Jul 30 09:02:57 PDT 2016


https://llvm.org/bugs/show_bug.cgi?id=28780

            Bug ID: 28780
           Summary: "edit distance" algorithm is wrong
           Product: compiler-rt
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: compiler-rt
          Assignee: unassignedbugs at nondot.org
          Reporter: jiangg at mail.ustc.edu.cn
                CC: llvm-bugs at lists.llvm.org, nicolasweber at gmx.de
    Classification: Unclassified

In "llvm/include/llvm/ADT/edit_distance.h:ComputeEditDistance()" function, if
two strings(or arrays) are both empty, the result would be a random value, but
expected zero.
The current version on 2016.07.29 is:

    template<typename T>
    unsigned ComputeEditDistance(ArrayRef<T> FromArray, ArrayRef<T> ToArray,
                                 bool AllowReplacements = true,
                                 unsigned MaxEditDistance = 0) {
      // The algorithm implemented below is the "classic"
      // dynamic-programming algorithm for computing the Levenshtein
      // distance, which is described here:
      //
      //   http://en.wikipedia.org/wiki/Levenshtein_distance
      //
      // Although the algorithm is typically described using an m x n
      // array, only one row plus one element are used at a time, so this
      // implementation just keeps one vector for the row.  To update one
entry,
      // only the entries to the left, top, and top-left are needed.  The left
      // entry is in Row[x-1], the top entry is what's in Row[x] from the last
      // iteration, and the top-left entry is stored in Previous.
      typename ArrayRef<T>::size_type m = FromArray.size();
      typename ArrayRef<T>::size_type n = ToArray.size();

      const unsigned SmallBufferSize = 64;
      unsigned SmallBuffer[SmallBufferSize];
      std::unique_ptr<unsigned[]> Allocated;
      unsigned *Row = SmallBuffer;
      if (n + 1 > SmallBufferSize) {
        Row = new unsigned[n + 1];
        Allocated.reset(Row);
      }

      for (unsigned i = 1; i <= n; ++i)
        Row[i] = i;

      for (typename ArrayRef<T>::size_type y = 1; y <= m; ++y) {
        Row[0] = y;
        unsigned BestThisRow = Row[0];

        unsigned Previous = y - 1;
        for (typename ArrayRef<T>::size_type x = 1; x <= n; ++x) {
          int OldRow = Row[x];
          if (AllowReplacements) {
            Row[x] = std::min(
                Previous + (FromArray[y-1] == ToArray[x-1] ? 0u : 1u),
                std::min(Row[x-1], Row[x])+1);
          }
          else {
            if (FromArray[y-1] == ToArray[x-1]) Row[x] = Previous;
            else Row[x] = std::min(Row[x-1], Row[x]) + 1;
          }
          Previous = OldRow;
          BestThisRow = std::min(BestThisRow, Row[x]);
        }

        if (MaxEditDistance && BestThisRow > MaxEditDistance)
          return MaxEditDistance + 1;
      }

      unsigned Result = Row[n];
      return Result;
    }

The above algorithm was first introduced on 2015.07.13: git-svn-id:
https://llvm.org/svn/llvm-project/llvm/trunk@242069
91177308-0d34-0410-b5e6-96231b3b80d8. And the last right version is git-svn-id:
https://llvm.org/svn/llvm-project/llvm/trunk@240390
91177308-0d34-0410-b5e6-96231b3b80d8.

The above `ComputeEditDistance` function is used in
"llvm/lib/Support/StringRef.cpp:StringRef::edit_distance()". So you can
reproduce the bug via calling `StringRef::edit_distance()`. NOTE that the
returned random value may be zero.

How to fix it? The simple solution is to initialize the stack-based array
explictly: `unsigned SmallBuffer[SmallBufferSize]{};`. The safer solution is to
use RAII-style container, however, at the cost for allocating memory dynamicly
even if small number of elements.

Accordingly, unit test can be strengthened in
"llvm/unittests/ADT/StringRefTest.cpp":

    diff --git a/unittests/ADT/StringRefTest.cpp
b/unittests/ADT/StringRefTest.cpp
    index 66e5944..306dc23 100644
    --- a/unittests/ADT/StringRefTest.cpp
    +++ b/unittests/ADT/StringRefTest.cpp
    @@ -390,6 +390,11 @@ TEST(StringRefTest, Count) {
     }

     TEST(StringRefTest, EditDistance) {
    +  StringRef StrNull;
    +  StringRef StrEmpty("");
    +  EXPECT_EQ(0U, StrNull.edit_distance(""));
    +  EXPECT_EQ(0U, StrEmpty.edit_distance(""));
    +
       StringRef Str("hello");
       EXPECT_EQ(2U, Str.edit_distance("hill"));
     }

That's all.

Gang JIANG
jiangg at mail.ustc.edu.cn
http://justme0.com/
University of Science and Technology of China

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160730/194f1424/attachment-0001.html>


More information about the llvm-bugs mailing list