Find close matches

Assuming the suffix array was created from a text where the first query_len positions represent the query text and the remaining positions represent the reference text, textsearch.find_close_matches() returns a list indicating, for each suffix position in the query text, the two suffix positions in the reference text that immediately precede and follow it lexicographically.

The following gives an example about textsearch.find_close_matches().

#!/usr/bin/env python3

import numpy as np
import textsearch

query = "hi"
document = "howareyou"
full_text = np.fromstring(query + document, dtype=np.int8)

suffix_array = textsearch.create_suffix_array(full_text)

close_matches = textsearch.find_close_matches(
    suffix_array=suffix_array,
    query_len=len(query),
)

print("n\t\tpos\t\ttype\t\tsubstring")
print("-" * 65)
for i in range(suffix_array.size - 2):
    t = "query" if suffix_array[i] < len(query) else "document"
    sub = full_text[suffix_array[i] :].tobytes().decode("utf-8")
    print(i, suffix_array[i], t, sub, sep="\t\t")

print(close_matches)

"""
The output is:

n               pos             type            substring
-----------------------------------------------------------------
0               5               document                areyou
1               7               document                eyou
2               0               query           hihowareyou
3               2               document                howareyou
4               1               query           ihowareyou
5               9               document                ou
6               3               document                owareyou
7               6               document                reyou
8               10              document                u
9               4               document                wareyou
[[7 2]
 [2 9]]
"""

We have the query string hi and the document string howareyou.

For the first character h from the query, we can see that the first substring preceding it from the document is eyou at position 7 in the full_text and the first substring following it is howareyou at position 2 in the full_text, so the close match for h is (7, 2).

Similarly, for the second character i from the query, we can see that the first substring preceding it from the document is howareyou at position 2 in the full_text and the first substring following it is ou at position 9 in the full_text, so the close match for h is (2, 9).