Suffix arrays
Please refer to Wikipedia for what suffix array is.
This sections describes how to use Python APIs from fasttextsearch to create suffix arrays.
Caution
The above Wikipedia article
assumes the sentinel letter $
is lexicographically smaller than any other
character, while in fasttextsearch we assume it is larger than all other characters
and we call the sentinel letter EOS
.
The following code shows how to create the suffix array for the string banana
:
#!/usr/bin/env python3
import numpy as np
import textsearch
a = np.fromstring("banana", dtype=np.int8)
print(a)
suffix_array = textsearch.create_suffix_array(a)
print(suffix_array)
for i in suffix_array[:-1]:
print(a[i:].tobytes().decode("utf-8") + "$")
"""
The output is:
[ 98 97 110 97 110 97]
[1 3 5 0 2 4 6]
anana$
ana$
a$
banana$
nana$
na$
"""
Different from the example in https://en.wikipedia.org/wiki/Suffix_array,
$
is the largest symbol in fasttextsearch, so the first smallest
substring is anana$
instead of $
.
Please see textsearch.create_suffix_array()
for details.