Lexicographically least circular substrings |
This page describes one of the errors I have made in my published research.
In 1980, I published an algorithm to compute the lexicographically least circular shift (LCS) of a string.
The need for the algorithm arose in work I was doing (with George Lueker) on efficiently testing whether two PQ-trees were isomorphic. In the isomorphism algorithm, labels were generated that needed to be compared but labels were considered to be the same if one was a circular shift of the other, so the LCS algorithm was developed to convert each label to a canonical form (the LCS).
In September 2014, Radomír Polách wrote to tell me that there was an error in the algorithm as published (a greater-than-or-equal comparison that should have been a greater-than comparison). He provided the string “alfa” as an example of when the algorithm as published fails. It is worth noting that the faulty if-test was not actually necessary for the algorithm ‐ it was a shortcut to save a few comparisons when they were not required.
I am grateful to Radomír Polách for pointing out the error in the published version of the algorithm, which has been corrected in the C++ code below.
Subsequently, in August 2021, Tomasz Miąsko pointed out that the corrected version fails on the family of strings aaba, aaaba, aaaaba, aaaaaba, . . . (an+2ba) by returning 0 instead of n-1. In March 2023, Andrew Helwer pointed out the same error with 0010 as an example of a string for which the corrected version fails.
There is a Wikipedia article that includes a description of my algorithm if you want to catch up on what has been done to improve on the algorithm since my paper was published.
The Wikipedia article notes that “The correctness of the algorithm is somewhat difficult to understand, but it is easy to implement.” As the article notes, my algorithm is based on the Knuth-Pratt-Morris (KMP) pattern-matching algorithm. One of my favorite articles, by Richard De Millo, Richard Lipton, and Alan Perlis (see pick hits on my home page), published in 1979 the year before my LCS article, includes a great anecdote about how the KMP algorithm had been used in the text editor on a timesharing system at UC Berkeley but had eventually been removed because the code was too hard to understand so it was replaced with the naive algorithm whose worst-case time is proportional to the product of the lengths of the pattern and the string rather than sum of the lengths.
The following is a translation of the code in my original IPL article “Lexicographically least circular substrings” from Algol-60 to C++. It is based on the code in the Wikipedia article, but rather than copying the string after itself to avoid using modular arithmetic in the subscripts (indices) to access individual characters in the string, we use modular subscripts as in the original article but hide away the arithmetic in a subclass modularString of the C++ std::string class. A good compiler might optimize out most of the extra instructions required for modular arithmetic, but even without optimization these would not change the asymptotic running time of the algorithm.
/* ========================================================================
The first if-test has been corrected to be > instead of >= (as pointed
out by Radomír Polách). Additional tests by Miąsko and Helwer suggest
that the shortcut needs further correction.
IPL_VERSION controls whether the shortcut if-test will be used and if
it will be the original (incorrect) version or the corrected version.
Kellogg S. Booth
2014 September 05
======================================================================== */
#define IPL_VERSION 2 // 0 no shortcut, 1 as in IPL paper, 2 with correction
#include <string>
#include <iostream>
using namespace std;
// Encapsulate modular arithmetic within a subclass of string.
class modularString : public string
{
private:
int modIndex( int index ) const
{ return index % this->length(); }
public:
modularString( const string& initial ) : string( initial ) {}
char operator[] (int index) const
{ return string::operator[]( modIndex( index ) ); }
char& operator[] (int index )
{ return string::operator[]( modIndex( index ) ); }
};
int LCS( const modularString& b )
{
int n = b.length();
int f[ 2*n ];
for ( int i=0; i<2*n; i++ ) f[i] = -1;
int k = 0;
for ( int j=1; j<2*n; j++ )
{
#if IPL_VERSION==1
if ( j-k >= n ) // The published version uses >=, which is wrong!
return k;
#elif IPL_VERSION==2
if ( j-k > n ) // The correct version uses > instead.
return k;
#else
// The algorithm works just fine without the shortcut.
#endif
int i = f[ j-k-1 ];
while ( ( i != -1 ) && ( b[ j ] != b[ k+i+1 ] ) )
{
if ( b[ j ] < b[ k+i+1 ] )
k = j-i-1;
i = f[ i ];
}
if ( ( i == -1 ) && ( b[ j ] != b[ k+i+1 ] ) )
{
if ( b[ j ] < b [ k+i+1] )
k = j;
f[ j-k ] = -1;
}
else f[ j-k ] = i+1;
}
return k;
}
int main()
{
// The first test case is Radomír Polách's example
modularString test( "alfa" );
do
{
cout << test << endl;
int k = LCS( test );
for ( int i=0; i < k; i++ )
cout << " ";
cout << "^" << endl;
// Accept additional test cases from stdin until done
cin >> test;
} while ( cin );
}