Since I started to blog, I presented different methods, techniques and tricks to solve problems related to Data mining, and text mining (... I'm still puzzled about the difference between text mining and text analytics).
Today I would like to present a different approach to solve a common problem in the text analytics area: the document comparison problem.The problem
Usually if you want to compare two documents in order to check if they are pretty much the same you can apply some easy tricks in order to check frequency and position of the words.
Sometime the documents require an OCR processing in order to get the "text rendition" for the documents, and such process usually is a very expensive in term of computational time (often it takes more than 10 seconds per page): what if you have to perform many comparison checks and you are short in time?
...In that case I suggest to consider the documents as images!
Image processing approach
This branch of the computer science evolved quite fast in the last decade and it achieves a level of complexity extremely high.
I'm not an expert at all, and many libraries (often free as opencv) are available and they offer several functionalities: image comparison, image recognition and so on.
Unfortunately I'm stubborn and curious and I like try to do experiments :).
So I decided to implement a very easy algorithm able to compare with high accuracy documents very similar.
Algorithm:
1. Resize the image (in order to have images having the same number of pixels)
2. Binarize the image
3. compute the histogram distribution
4. perform the comparison considering the Kullback distance between the two empirical distributions got from the histograms.
Considerations:
- such algorithm is not robust to linear transformation/distortions of the image (but this is not case when you compare documents...).
- it's very fast respect the solutions that require OCR processing.
- it could be easily optimized considering sub-block of the images (and give higher weights to blocks in the middle of the image).
- through some assumptions based on eigenvectors it could be helpful also in case of image linear transformations.
I selected 10 random images (pictures, medical analysis, text, screenshot):
Original sample set |
Histograms for the above images |
Noised sample set. |
Histograms for the noised images |
Results:
I compared each possible possible combination of the pairs composed by original image and noised image and I ranked the result of the comparison.
...The test has been passed with an accuracy of 100%.
I have to say that the images are quite heterogeneous among each other, but further tests (I'll publish them in one of the next posts) highlighted that the method is quite robust (with all the limitations explained above).
Here is some of the best matching found:
Some of the best matching found |
...I have to say that the image processing is a very fascinating science!
Stay tuned
cristian
Stay tuned
cristian
Hi, Cristian.
ReplyDeleteAlthough you like to make your experiments yourself, what I appreciate, it would be nice if you have the opportunity to check Mathias Lux blog (www.semanticmetadata.net). He maintains Lire, a Java tool that makes use of Lucene for indexing and searching, on a wide variety of image representations which include color histograms. Lire provides image search in a query-by-example fashion, and one funny thing is that you cand fill a canvas with colors and shapes and get similar images to your picture.
The point is that Lire features more than a dozen of image representation schemas (see e.g. http://www.semanticmetadata.net/2013/03/12/updates-on-lire-svn-rev-39/), this can give you some ideas.
Good post, I will be waiting for the next one.
Hi jose,
ReplyDeletethx a lot for your comment and the link you suggest: I'l look at it!.
cheers
c.
Useful information you have shared in this article.Outsource data entry services India
ReplyDeleteNice post.
ReplyDeleteAmazing content.
ReplyDeleteData Mining Process
According to sites like Comparably, eventprofs ZipRecruiter, and Salary.com, the average wedding planner salary ranges from $14 per hour to $420,000 per year, which doesn’t really answer the question on the minds of anyone trying to get into the industry: How much do wedding planners make, exactly? thanks for birthday wishes and happy heavenly birthday
ReplyDelete