If I have a file like this:
foo
bar
bat
hukarz
foo
bar
bat
Then I would like to be made aware that there is one region that is identical to another region:
foo
bar
bat
The reason is that I have have some large text files and I have identical regions, often more than one time. I want to clean them up.
Lingo4G and the Carrot2 engine defines this as Document Overlap and Pairwise Similarity, as in how to identify identical text fragments in documents and returning information useful for visualization of such regions.
I’ve tried to look at Carrot2, but it seems to add a lot of complexity. I was thinking to ask here if there are other alternatives to look at.