text processing – How can I find regions of identical content in a file?


If I have a file like this:

foo
bar
bat
hukarz
foo
bar
bat

Then I would like to be made aware that there is one region that is identical to another region:

foo
bar
bat

The reason is that I have have some large text files and I have identical regions, often more than one time. I want to clean them up.

Lingo4G and the Carrot2 engine defines this as Document Overlap and Pairwise ​Similarity, as in how to identify identical text fragments in documents and returning information useful for visualization of such regions.

Carrot2 engine identifying identical or similar regions in a file

I’ve tried to look at Carrot2, but it seems to add a lot of complexity. I was thinking to ask here if there are other alternatives to look at.



Source link

Related Posts

About The Author

Add Comment