Wednesday, November 23, 2011

Neural Nets Tips and Tricks: add recall Output neuron

After my former post about the Neural Networks frameworks, my email box has been literally flooded!
Let me clarify something: I didn't develop my "home-made" app to compete with the majors company in this field! My experiment has been done just to proof that there are so many different algorithms to train neural nets, so many models designed to work in specific domains, that in my opinion the best approach to work with this elegant machine learning algo is to implement it in house. that's it!
Anyway, in these email someone suggests really state of art models, dynamic momentum, cost function entropy based and so on: I'm really happy of the collection of paper I got for free :)

As mentioned, in many cases the customization of algorithms is the only way to achieve the target, but sometimes, some tricks can help to improve the learning even without changes of learning strategy!
Consider for example our XOR problem solved through neural networks.
Let's see how we can reduce considerably the epochs required to train the net.
As you know the Back Propagation is based on the famous delta rule, and for the "hidden to output" layer the delta is equal to:
Delta rule for the "hidden to output" layer
How can we reinforce  to speed up the learning process without modify the delta rule?
One of the possible answers is: duplicating the output neuron!!
So instead of change the strategy, we slightly modify the topology of net to obtain the following:
The modified Network: notice how changes the delta for the neuron h1 

The neuron O2 has the same identical target T of the neuron O1: basically the heuristic consists in duplicate the target to reinforce the delta contribute.
In the above image you can see the new contribute provided by new delta for the neuron O2; when the network finds a good way in the gradient descending, the output of O1 will be similar to O2 and the delta for h1 will receive a double correction because delta of O1 will be pretty much the same of the delta of O2.
...I've done my best to explain "by practical means" and from a theoretical prospective someone could stick his nose (I apologize for that... but the soul of the blog is to privilege the intuitive explanation).
As usual let me show the effects:
I did 10 tests to compare the original configuration and the network modified with the "recall neuron" using exactly the same number of hidden neurons, and the same param. configuration.
In 7 cases the new net reduced the learning phase by 210%. 
In 2 cases the new net took the same time of the original net.
In 1 case the new net didn't find a solution (due to oscillation problems).
Here you are some representative examples:
Original Net: convergence after 800 cycles.

Net with recall neuron: convergence after 120 cycles (notice how fast the error slumps) 
Here you are another example:
Another trial with original configuration (convergence after 600 cycles)

Another example of the effectiveness of the method: convergence obtained around at cycle 200) 
Just to conclude, let me show how quickly slumps the error from the synapses prospective. Below the most representative error surface for the last trial shown above.
Error surface plot for the most important connections (referred to the above net): notice again how the error slumps very quickly.
Stay tuned
cristian

Sunday, November 13, 2011

Buy or build, a practical example to explain my point of view

I was working on the post related to the relation between IFS fractals and Analytical Probability density, when an IT manager contacted me asking an informal opinion about a tool for Neural Networks.
He told me: “we are evaluating two different tools to perform general purpose data mining tasks, and we are oriented toward the tool xxx because the neural network module seems more flexible”.
My first reaction was: “sorry, but you have to solve a specific problem (the problem is a little bit complex to explain but it is a classical inversion function problem) and you are looking for a generic suite: I’m sincerely confused!”
He justified himself telling “If I buy a generic suite I can reuse it in a next time for other problems!”: bloody managers! Always focused to economize 1 $ today to loose 10$ tomorrow :)            

As you can imagine the discussion fired me, so I started to investigate about this “flexibility capabilities” (this is for me the key point to evaluate a solution) and I began long emails exchange with this enterprising manager.
Let me sum up the main questions/answers about this module.
Q: What kind of Neural Network does the suite support?
A: Back Propagation (just for multilayer perceptron) and SOM.
The suite has a nice GUI where you can combine the algorithms like lego-bricks. (it is very similar to WEKA’s gui).
Q: What kind of error function does it support  (limited to BP nets)?
A: I don’t Know, but I suppose just the delta based on target vs output generated;
Q: What kind of pruning algorithms does it support?
A: It supports a weight-based algorithm, but there is no control on it.
Q: How many learning algorithms does the BP net support?
A: It supports just BP with momentum.
Q: What kind of monitors does it support to understand the net behavior?
A: It has a monitor to follow the output error on the training set.
Q: What is the cost of this suite?
A: The suite (actually it is pretty famous data mining suite ...don’t ask more about it…) price is around 10 thousand $ for limited number of users (it doesn’t embrace neither the support costs nor integration costs nor dedicate hardware).
Q: do you have the chance to visualize the error surface?
A: I’m a manager, what the hell does this question mean?

My recommendations: don’t buy it.
Here you are the key points:

1. No words, Bring real evidences!
As usual, instead of thousands words I prefer get direct and concrete proof about my opinion, because my motto is : “don’t say I can do it in this way or in this way, but just do it!”  (…hoping that it makes sense in English!).
So during this weekend I decided to implement in Mathematica a custom tool to play with Back Propagation Neural Networks having at least the same features described above.
The general-purpose Neural Net Application built with mathematica: different colors for neurons to depict different energy levels., monitor to follow the learning process, fully customization for energy cost function, activation functions, ...
My “suite” (of course, it is not really useful for productive systems) has the following features:

  • Graphical representation of the net;
  • Neurons dynamically colored by energy importance (useful to implement a pruning algo);
  • Dynamical cost function updating;
  •  Learning coefficient decay (really useful but seldom present in the common suites);
  • Activation functions monitor;
  • Automatic setting of input/output neurons layers in compliance with training set.

2. General-purpose algorithm doesn’t mean “every one can use it to solve whatever problem”!
There is a misbelief that a  general-purpose algorithms like SVM, NN, Simulated annealing, decision trees and so on, can solve every kind of problem  without human interactions, customization and expertise.
For the XOR problem, maybe also a newbie can find the right setting for a neural net, but in the real world the problem are so complex that the overall accuracy is a combination of many factors, and sometimes the algorithm choice has a limited influence on the entire picture!


Let's start to show a sample toy the XOR problem (as you know is a classical example brought to explain problems without linear solutions).
I instantiated a network having 1 hidden layer and 2 hidden neurons pro layer. I fixed the Beta param  for the activation function to 1 and eta learning equal to .1 and i Randomized the weights in the range [-1,1]  (as reported in the main tutorials ).
here you are the network able to solve the problem and the related behavior for the cost function:
Neural Network for XOR problem: 2 input neurons , 2 hidden neurons, threshold neuron and the output. 
The Square Error decreasing for the above network. As you can see a stable configuration is achieved after 30000 learning cycles!!
Notice that with the standard configuration, the problem takes around 30000 cycles to be solved!!!
With a little bit of experience, the same results can be achieved with less than 300 cycles. Indeed, changing the eta learning, and the network configuration, we obtain:

Another configuration to solve the XOR problem: 1 hidden layer and 7 hidden neurons: as you can see after 300 cycles the learning the error is negligible.
Observe the hidden neurons energy bar:  as expected only one hidden neuron (H1,2) gives a concrete contribute to the network (this is compliance with the theory that only one hidden neuron is required to solve the XOR problem!!).
A novel data miner could say: "It is easy: I increase the number of hidden layers and the numbers of neurons and the training will be faster!": TOTALLY FALSE!!
Indeed, (especially for the neural network based on the classical Werbos delta rule) the complexity of the net, often obstacles the learning producing oscillatory  behaviors due to neuron saturation and overfitting.
A Neural network composed by 2 hidden layers and 8 hidden neurons trained to solve the XOR problem. As you can see  even if the network is more complex than the former example, it is not able to solve the problem after 1000 cycles (the former net solved the problem after 300 cycles).

We have seen that just with a trivial problem there are many variables to consider to obtain a good solution. In the real world, the problems are much complex and the expectation that a standard BP can fit your needs is actually naive!  

3. Benefits & thoughts.
How much time does take an implementation built in house?
I spent more or less 6 hours to replicate the standard BP, with the "neurons colored version"... and in this slot of time, my 4 years old daughter tried to "help" me to find the best color range and the most fancy GUI :) typing randomly the keys of my laptop....


  • The effort to replicate the model in a faster language (like java) could take 4 times the effort to code this in Mathematica: that is no more then 1 week development.
  • The customization of the standard BP for the specific problem, for example changing the objective function, could be really expensive and often not feasible for a standardize suite.
  • The integration of the bought solution takes almost always at least 50% of the development efforts, so build in house can mitigate these costs because your solution is developed directly on your infrastructure and not adapted to work with it!
  •  The advantage to develop in house a product requiring high competences lays in the deep knowledge that the company acquires, and the deep control of the application under every different aspect: integration, HW requirements, business requirements and so on. The knowledge is the key to win the competitive challenges in these strange world. Of course, train in house team having deep scientific knowledge oriented to the "problem solving" is initially expensive and risky, but in my belief the attempt to delegate this core "components" outsourcing totally these expertise to external companies will produce an impoverishment of the value of a company.

The BP simulator I implemented is totally free for students, and/or scholastic institutions: contact me at: cristian.mesiano@gmail.com.
...We will use this simulator to highlight interesting points in the amazing world of data mining.
Stay Tuned,
Cristian