Find close matches
Assuming the suffix array was created from a text where the first
query_len positions represent the query text and the remaining positions
represent the reference text, textsearch.find_close_matches()
returns a list indicating, for each suffix position in the query text, the two
suffix positions in the reference text that immediately precede and follow it lexicographically.
The following gives an example about textsearch.find_close_matches().
#!/usr/bin/env python3
import numpy as np
import textsearch
query = "hi"
document = "howareyou"
full_text = np.fromstring(query + document, dtype=np.int8)
suffix_array = textsearch.create_suffix_array(full_text)
close_matches = textsearch.find_close_matches(
suffix_array=suffix_array,
query_len=len(query),
)
print("n\t\tpos\t\ttype\t\tsubstring")
print("-" * 65)
for i in range(suffix_array.size - 2):
t = "query" if suffix_array[i] < len(query) else "document"
sub = full_text[suffix_array[i] :].tobytes().decode("utf-8")
print(i, suffix_array[i], t, sub, sep="\t\t")
print(close_matches)
"""
The output is:
n pos type substring
-----------------------------------------------------------------
0 5 document areyou
1 7 document eyou
2 0 query hihowareyou
3 2 document howareyou
4 1 query ihowareyou
5 9 document ou
6 3 document owareyou
7 6 document reyou
8 10 document u
9 4 document wareyou
[[7 2]
[2 9]]
"""
We have the query string hi and the document string howareyou.
For the first character h from the query, we can see that the first
substring preceding it from the document is eyou at position 7 in the
full_text and the first substring following it is howareyou at
position 2 in the full_text, so the close match for h is (7, 2).
Similarly, for the second character i from the query, we can see that the first
substring preceding it from the document is howareyou at position 2 in the
full_text and the first substring following it is ou at
position 9 in the full_text, so the close match for h is (2, 9).