Elements of IBM Watson – Machine Learning, an introduction

Machine learning forms core part of technology which forms IBM Watson. I discussed this as part of my earlier posts on IBM Watson. Thus this new post on this core topic of machine learning. There is not a very clear definition of machine learning and its been changing with times, as it should be! One of the first definitions were by Arthur Samuel. In 1959, Arthur Samuel defined Machine Learning as a “Field of study that gives computers the ability to learn without being explicitly programmed”. Interestingly, there is an anecdote on Arthur Samuel. In 1949 Samuel joined IBM’s Poughkeepsie Laboratory and worked on IBM’s first stored program computer, the 701. He completed the first checker program on the 701, and when it was about to be demonstrated, Thomas J. Watson Sr., the founder and President of IBM, remarked that the demonstration would raise the price of IBM stock 15 points. It did.
A lot of us might think that computers can’t do anything that they’re not explicitly programmed to. Usually in our experience we write a financial accounting program and that is exactly what the computer does for us, financial accounting; or maybe employee payroll or core banking. There are very clear numerous applications which have been programmed and computers does those exactly the way they have been programmed. How can computers do something which they have NOT been programmed for?  Well, Machine Learning is all about that! Arthur Samuel managed to write a checkers program that could play checkers much better than he personally could, and this is an instance of maybe computers learning to do things that they were not programmed explicitly to do.
The obvious question that might come to our mind will be, why do we need such possibility of ability of computers to learn without being explicitly programmed? Why do we need machine learning? The answer is easy, by explicitly programming, we are limiting the computer’s capability. By adding the capability to learn we can leverage computers to perform even better than us as demonstrated by Arthur Samuel’s checkers program. The checkers program could play checkers much better than he personally could. The same reason why IBM Watson’s win in Jeopardy is so important. The impact of machine learning is seen on simple things like enhance email spam, face recognition and could lead to discovering new things in scientific research, medical diagnosis, investment strategies, laws for judicial systems etc.
There are typically two types of most commonly used Machine learning algorithms, supervised and unsupervised. Quite obviously supervised is where you teach the program and unsupervised is where you let it learn by itself.
A Supervised Learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Lets say, you want to predict if someone will have a heart attack within a year. You collect a learning data set on people, including age, weight, height, blood pressure, etc who had an heart attack within a year of data being collected. Supervised machine learning is combining all the existing data into a model that can predict the required analysis.
Supervised learning splits into two broad categories of classification and regression. Classification is classify examples into given set of categories based on certain classification rule. Classification can have just a few known values, such as ‘true’ or ‘false’. Classification algorithms apply to nominal, not ordinal response values. Examples of classification algorithms are spam filtering, market segmentation or natural-language processing. Regression algorithms are for estimating the relationships among variables; the estimation target is a function of the independent variables called the regression function. Regression for responses that are a real number, such as miles per gallon for a particular car. Example of simple regression is prediction of auto sales in relation with family income in a neighborhood.
The are another two classifications of learning algorithms, Discriminative learning algorithm & generative learning algorithm. Discriminative algorithms usually process examples to find a large margin hypothesis separating the two classes. Generative algorithms for learning classifiers use training data to separately estimate a probability model for each class. Generative models contrast with discriminative models, in that a generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the observed variables. Thus a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, whereas a discriminative model allows only sampling of the target variables conditional on the observed quantities. However, for tasks such as classification and regression that do not require the joint distribution, discriminative models can yield superior performance. On the other hand, generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks. In addition, most discriminative models are inherently supervised and cannot easily be extended to unsupervised learning. Application specific details ultimately dictate the suitability of selecting a discriminative versus generative model.
Coming to Unsupervised Learning, systems can learn to represent particular input patterns in a way that reflects the statistical structure of the overall collection of input patterns. There are many methods and algorithms by which unsupervised learning could be achieved. More often than not, multiple algorithms are used instead of one. I will discuss Clustering &  self-organizing feature map (SOFM) in this blog post.
The most common unsupervised learning method is Clustering, which is used for exploratory data analysis to find hidden patterns or grouping in data. It has a long history, and used in almost every field, e.g., medicine, psychology, marketing, insurance, libraries, etc. In recent years, due to the rapid increase of online documents, text clustering becomes important. Clustering is a technique for finding similarity groups in data, called clusters. It groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters.  There are three aspects of clustering; clustering algorithm(s), distance (similarity, or dissimilarity) function & Clustering quality. There are many clustering models and algorithms like  hierarchical clustering,  k-means algorithm, DBSCAN, density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc. Clustering is hard to evaluate, but very useful in practice. Clustering is highly application dependent and to some extent subjective. One must remember, All the clustering algorithms only group data. Clusters only represent one aspect of the knowledge in the data.
Self organizing feature map ( SOM or SOFM) is a type of artificial neural network (ANN) It provides a topology preserving mapping from the high dimensional space to map units. Map units, or neurons, usually form a two-dimensional lattice and thus the mapping is a mapping from high dimensional space onto a plane. The property of topology preserving means that the mapping preserves the relative distance between the points. Points that are near each other in the input space are mapped to nearby map units in the SOM. The SOM can thus serve as a cluster analyzing tool of high-dimensional data. Also, the SOM has the capability to generalize. Generalization capability means that the network can recognize or characterize inputs it has never encountered before. A self-organizing map consists of components called nodes or neurons. Associated with each node is a weight vector of the same dimension as the input data vectors and a position in the map space. At first the network is initialized. There are three different types of network initializations, random initialization, initialization using initial samples and linear initialization. The next step is training. Training is an iterative process through time. It requires a lot of computational effort and thus is time-consuming. The training consists of drawing sample vectors from the input data set and “teaching” them to the SOM. The teaching consists of choosing a winner unit by the means of a similarity measure and updating the values of codebook vectors in the neighborhood of the winner unit. This process is repeated a number of times. In one training step, one sample vector is drawn randomly from the input data set. This vector is fed to all units in the network and a similarity measure is calculated between the input data sample and all the codebook vectors. The best-matching unit (BMU) is chosen to be the codebook vector with greatest similarity with the input sample. The similarity is usually defined by means of a distance measure. After finding the best-matching unit, units in the SOM are updated. During the update procedure, the best-matching unit is updated to be a little closer to the sample vector in the input space. The topological neighbors of the best-matching unit are also similarly updated. This update procedure streches the BMU and its topological neighbors towards the sample vector. The codebook vectors tend to drift there where the data is dense, while there tends to be only a few codebook vectors where data is sparsely located. In this manner, the net tends to approximate the probability density function of the input data. The Self-Organizing Map is an approximation to the probability density function of the input data. It can be used in the next step which is visualization. Before a model can be reliably used, it must be validated. Validation means that the model is tested so that we can be sure that the model gives us reasonable and accurate values.

These were the various concepts associated with machine learning. I understand IBM Watson team has used many machine learning algorithms to achieve the results including some of the ones mentioned above.

2 responses to “Elements of IBM Watson – Machine Learning, an introduction

  1. Jaswinder Singh 2014/03/14 at 8:01 PM

    interesting one

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: