Suffix arrays

Please refer to Wikipedia for what suffix array is.

This sections describes how to use Python APIs from fasttextsearch to create suffix arrays.

Caution

The above Wikipedia article assumes the sentinel letter $ is lexicographically smaller than any other character, while in fasttextsearch we assume it is larger than all other characters and we call the sentinel letter EOS.

The following code shows how to create the suffix array for the string banana:

Listing 1 Create suffix array for the string banana.
#!/usr/bin/env python3

import numpy as np
import textsearch

a = np.fromstring("banana", dtype=np.int8)
print(a)

suffix_array = textsearch.create_suffix_array(a)
print(suffix_array)

for i in suffix_array[:-1]:
    print(a[i:].tobytes().decode("utf-8") + "$")

"""
The output is:

[ 98  97 110  97 110  97]
[1 3 5 0 2 4 6]

anana$
ana$
a$
banana$
nana$
na$
"""

Different from the example in https://en.wikipedia.org/wiki/Suffix_array, $ is the largest symbol in fasttextsearch, so the first smallest substring is anana$ instead of $.

Please see textsearch.create_suffix_array() for details.