Find close matches
Assuming the suffix array was created from a text where the first
query_len
positions represent the query text and the remaining positions
represent the reference text, textsearch.find_close_matches()
returns a list indicating, for each suffix position in the query text, the two
suffix positions in the reference text that immediately precede and follow it lexicographically.
The following gives an example about textsearch.find_close_matches()
.
#!/usr/bin/env python3
import numpy as np
import textsearch
query = "hi"
document = "howareyou"
full_text = np.fromstring(query + document, dtype=np.int8)
suffix_array = textsearch.create_suffix_array(full_text)
close_matches = textsearch.find_close_matches(
suffix_array=suffix_array,
query_len=len(query),
)
print("n\t\tpos\t\ttype\t\tsubstring")
print("-" * 65)
for i in range(suffix_array.size - 2):
t = "query" if suffix_array[i] < len(query) else "document"
sub = full_text[suffix_array[i] :].tobytes().decode("utf-8")
print(i, suffix_array[i], t, sub, sep="\t\t")
print(close_matches)
"""
The output is:
n pos type substring
-----------------------------------------------------------------
0 5 document areyou
1 7 document eyou
2 0 query hihowareyou
3 2 document howareyou
4 1 query ihowareyou
5 9 document ou
6 3 document owareyou
7 6 document reyou
8 10 document u
9 4 document wareyou
[[7 2]
[2 9]]
"""
We have the query string hi
and the document string howareyou
.
For the first character h
from the query, we can see that the first
substring preceding
it from the document is eyou
at position 7 in the
full_text
and the first substring following
it is howareyou
at
position 2 in the full_text
, so the close match for h
is (7, 2)
.
Similarly, for the second character i
from the query, we can see that the first
substring preceding
it from the document is howareyou
at position 2 in the
full_text
and the first substring following
it is ou
at
position 9 in the full_text
, so the close match for h
is (2, 9)
.