Since I started to blog, I presented different methods, techniques and tricks to solve problems related to Data mining, and text mining (... I'm still puzzled about the difference between text mining and text analytics).Today I would like to present a different approach to solve a common problem in the text analytics area: the document comparison problem.
Usually if you want to compare two documents in order to check if they are pretty much the same you can apply some easy tricks in order to check frequency and position of the words.
Sometime the documents require an OCR processing in order to get the "text rendition" for the documents, and such process usually is a very expensive in term of computational time (often it takes more than 10 seconds per page): what if you have to perform many comparison checks and you are short in time?
...In that case I suggest to consider the documents as images!
Image processing approach
This branch of the computer science evolved quite fast in the last decade and it achieves a level of complexity extremely high.
I'm not an expert at all, and many libraries (often free as opencv) are available and they offer several functionalities: image comparison, image recognition and so on.
Unfortunately I'm stubborn and curious and I like try to do experiments :).
So I decided to implement a very easy algorithm able to compare with high accuracy documents very similar.
1. Resize the image (in order to have images having the same number of pixels)
2. Binarize the image
3. compute the histogram distribution
4. perform the comparison considering the Kullback distance between the two empirical distributions got from the histograms.
- such algorithm is not robust to linear transformation/distortions of the image (but this is not case when you compare documents...).
- it's very fast respect the solutions that require OCR processing.
- it could be easily optimized considering sub-block of the images (and give higher weights to blocks in the middle of the image).
- through some assumptions based on eigenvectors it could be helpful also in case of image linear transformations.
I selected 10 random images (pictures, medical analysis, text, screenshot):
|Original sample set|
|Histograms for the above images|
|Noised sample set.|
|Histograms for the noised images|
I compared each possible possible combination of the pairs composed by original image and noised image and I ranked the result of the comparison.
...The test has been passed with an accuracy of 100%.
I have to say that the images are quite heterogeneous among each other, but further tests (I'll publish them in one of the next posts) highlighted that the method is quite robust (with all the limitations explained above).
Here is some of the best matching found:
|Some of the best matching found|
...I have to say that the image processing is a very fascinating science!