Tuesday, July 10, 2012

Data Mining: Tools and Certificates

As member of many Linkedin groups related to data mining & text mining I read many threads related to certificates that should help either in job seeking and consolidating curriculum, and many other threads about miraculous tools able to solve whatever problem.
Is being certified really worth it?
In my experience I think that a certificate in a specific data mining tool could be a positive point on the curriculum, but it doesn't really help to improve your knowledge on the field.
Let me explain better (which is not easy with my bad English): The certificates system is a market and its target is to generate profit or to promote products.

Data mining tools
My question is: do you really think that can exist a tool able to embrace all aspects related to data mining?

I guess that the number of problems data mining related are so high that maybe we could use the Cantor diagunalization to proof that are uncountable :)

In my opinion is too naive the common thought that through a software, clicking here and there you can obtain tangible benefits in mining your data.
The "data mining" definition has been created by marketing industries just to summarize in a buzz word  technics of applied statistics and applied mathematics to the data stored in your hard disk.
I don't want say that tools are useless, but it should be clear that tools are only a mean to solve a problem, not the solution.
  • In the real world the problems are never standard and really seldom you can take an algorithm as is to solve them! ...maybe I'm unlucky but I never solved a real problem through a standard method.
  • The tool X is able to load Terabyte of data. And so what? A good data miner should know that you cannot consider the entire population to analyze a phenomena, you should be able to sample your population in order to ensure the required margin of accuracy! ... this technic is simply called Statistic!
  • If you really want to claim "I know very well this approach", you must be able to implement it by your self: only implementing it by your self you can deeply understand in which context the algorithm works, under which conditions it performs better than other tools and so on. Don't rely only on one paper that compare few techniques: if you change just one of the conditions the results are terrible different.
  • Without theory you cannot go deep: Let's consider a tool as Mathematica or R or ... These tools allow the user to have access to a large library of pre defined algorithms and routine, they provide visualization functions to show results in a fancy way, and last but not least they provide a complete language programming to code whatever you want. I love them, but I couldn't do anything without the theory behind the problem. Mathematica can provide me the algorithm to cluster a data set through k-means: but how can I be sure that it is the right algo for your problem? (click here to have a demo).
Actually I would prefer attend a course to deepen some aspects of multivariate statistic or seminars on new methodology to solve some problem respect pay plenty money to know every single detail of a tool, that maybe will not be in the market in the next 5 years.

I know that the companies often are looking for certified guys on a famous tool just because they bought it and they need to reduce the time to "integrate" a new resource on a team. Fair enough! ...but I think it is ridiculous require certificates as strict requirement!
I'm really curious to know your experiences and opinions.


  1. As an individual attempting to hire truly bright minds in the data mining space, I approach certifications with a heavy dose of skepticism. Too many times have I seen individuals with multiple certifications in a single popular tool and yet have no idea about the theory behind the practice. Anyone can press a button and make something magical happen, but few actually know what just happened.

    1. ... so I could try to drop you my CV!
      cheers & thx for your contribute!

  2. Hi Cristian, how as you going?

    Great post, I would like to give my 2 cents about this issue...

    Your post is very pertinent in nowadays, because there's a boom of tools and software vendors want more to people to catechize and maintain several clients in a locking structure for your tools.

    I think DM is more than tools and I venture to say that tool is less important than methodology used to run a project.



  3. I have worked in 'research' for many years - long enough to have witnessed the progression on the PC and 'data thinking'. The PC put tools into the hands of those in the office - primarily spreadsheet, word processing and presentation oriented. These tools increased productivity in the workplace immeasurably.

    But other tools were also provided - in particular, analytic tools. Now you can put a hammer and saw in peoples hands also - and most will smash a finger or worse cut one off. A small percentage of the population could fashion anything professional from those tools provided. And, not surprisingly, I watched for 10 years as people ran analytic routines they knew next to nothing about.

    Now, I see efforts in data to automate big data collection. I see firms actually advertise about the unimportance of form or origin of data - lets just collect it, amalgamate and summarise (or worse - analyze).

    So, after years of people running routines they don't understand - now it seems they want to collect and run data in those routines knowing next to nothing about the data as well.

    It astonishes me!

    I am not a mathematician or statistician. But unfortunately I know just enough to spot when things should not be done the way they are being done.

    If you do not understand your data and you do not have a methodology - YOU HAVE NOTHING! - you certainly do not have analytical statistics. Worse - you may have damaging information.

    But in the end - the models derived to illuminate usually fall far short of handling the real world - the exogenous influences dominate. In this world, right information or wrong information is almost irrelevant - outcomes are dictated largely but uncontrolled and unmeasured circumstance. Here, what the 'players' do, whether
    successes or failures, always finds rationalization.

  4. This is the nice information about data mining tool. There are many Data Mining tool like as rapidminer, R-Programming, KNIME, NLTK etc. If You want to know more about Data Mining process, Data Mining Service and type then visit on: http://www.loginworks.com/data-mining-services-various-type/