Category Archives: Analytics

IBM Watson Analytics – Powerful Analytics for Everyone

Update – Watson Analytics Beta is available now. Signup and try now @ http://www.ibm.com/analytics/watson-analytics/

On16 Sep 2014 IBM announced Watson Analytics, a breakthrough natural language-based cognitive service that can provide instant access to powerful predictive and visual analytic tools for businesses. Watson Analytics is designed to make advanced and predictive analytics easy to acquire and use.

Watson Analytics Sample 1

Watson Analytics Sample 1

Watson Analytics offers a full range of self-service analytics, including access to easy-to-use data refinement and data warehousing services that make it easier for business users to acquire and prepare data – beyond simple spreadsheets – for analysis and visualization that can be acted upon and interacted with. Watson Analytics also incorporates natural language processing so business users can ask the right questions and get results in terms familiar to their business that can be read and understood or interacted with e.g.Who are my most profitable customers and why?  or Which customers are showing signs that they might be considering defecting to a competitor? or Which sales opportunities are likely to turn into wins?  or Why are these products being returned?

I think the most amazing aspect is that Watson Analytics not only can show what is likely to happen but also what you can do about it e.g. What actions could I take that would improve my chances of closing specific deals in the pipeline? Where should we be focusing our loyalty programs?

Surely Watson Analytics is a powerful tool for any business. To make it even easier for businesses to use this technology, its being made available over cloud.  It will be hosted on SoftLayer and available through the IBM Cloud marketplace. IBM also intends to make Watson Analytics services available through IBM Bluemix to enable developers and ISVs to leverage its capabilities in their applications. Certain Watson Analytics capabilities will be available for beta test users within 30 days, and offered in a variety of freemium and premium packages starting later this year.

This is where Watson Analytics distinguishes itself from the crowded analytics marketplace. IBM makes really advanced business analytics accessible to any business user anywhere and that too without a cost! In the freemium business model a product or service is provided free to a large group of users, but a premium is charged for access to more data sources, data storage and enterprise capabilities.

Beta for the service is expected to launch September, with availability later this year. Don’t wait! Check out IBM Watson Analytics!

 

Elements of Watson – Natural Language Processing, an introduction

I would say the core component of IBM Watson is the Natural Language Processing, without which, surely Watson will not even be able to understand the question, let alone answer. NLP allows Watson to really derive the meaning of the question. How many parts are there in the question and how to interpret meaning and relation between each part.  This semantics of the question, obviously, will have direct implication on the correctness of answer for the question.

Natural Language Processing or NLP as its called, is a huge area in artificial intelligence field concerned with Human Computer interactions (HCI). I do not think of myself as knowledgeable on this subject matter but lately I did try understand its depth in respect to my interests in Watson. This blog post simply documents that study to some extent.

ELIZA written at MIT by Joseph Weizenbaum around 1964-66, named after the protagonist in George Bernard Shaw’s Pygmalion, is one of the earliest examples of NLP. ELIZA is in some sorts great grandmother of Apple’s Siri J

The first thing to know to begin with NLP will be the linguistic concepts. Grammar is at the core of any language. Grammar of a language is how sentences are constructed. It is interesting to note history of grammar that, the first systematic grammars originated in Iron Age India, with Yaska (6th century BC), Pāṇini (4th century BC) and his commentators Pingala (c. 200 BC), Katyayana, and Patanjali (2nd century BC). Tolkāppiyam is the earliest Tamil grammar is mostly dated to before the 5th century AD.

In NLP, multiple text processing components are used in sort of a pipeline of tasks performed, in order to provide value from text. Text processing components are tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and co-reference resolution.

Tokenization is breaking up a stream of text into smaller meaningful parts called tokens. Tokens are usually words, phrases or sentences demarcated by punctuation. In English, words are mostly separated from each other by blanks or white space as they are called in IT world. Thing to remember thought is that not all white space is equal e.g. “San Francisco” or “fast food” is supposed to taken as a single word. On other hand “I’m” should be two words “I am”. Further challenges to consider are abbreviations, acronyms, hyphenated words and most importantly numerical and special expressions such as telephone numbers, date format, vehicle license number.

Sentence segmentation is simply put dividing the stream of text into sentences. Let us not jump into conclusion that its basically looking for a period or other end of sentence punctuation. Remember, period is used in many more context and not only end of sentence. A period may denote an abbreviation, decimal point, or an email address. Question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Thus, we need specialized algorithms to really identify the end of sentence. In fact, there is special name for such algorithms. Its called Sentence Boundary Detection. I read this very interesting paper by Dan Gillick of UC-Berkley on this. https://code.google.com/p/splitta/ He claims this includes proper tokenization and models for very high accuracy sentence boundary detection (English only for now). The models are trained from Wall Street Journal news combined with the Brown Corpus which is intended to be widely representative of written English. Error rates on test news data are near 0.25%.

Another very interesting component of text processing is part of speech tagging. The basics of a language are the words. Linguist classify the words into various classes or ‘parts of speech’ (POS). In English language, they noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. The process of assigning one of the parts of speech to the given word, rather token, is called Parts Of Speech tagging. POS Tagger is an algorithm or program which does POS tagging to tokens. POS Tagger marks tokens with their corresponding word type based on the token itself and the context of the token. The very important part of context of the token. Wiki has a very interesting example in sentence “The sailor dogs the hatch.”. “dogs” is usually thought of as just a plural noun, in this context is a verb. There are variety of algorithms which are used to do POS tagging. Some of them are Viterbi algorithm, Brill Tagger, Constraint Grammar, Baum-Welch algorithm. Methods like Hidden Markov Models or visible Markov model have been used. Many machine learning methods have also been applied to the problem of POS tagging e.g. SVM, Maximum entropy classifier, maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), Perceptron, and Nearest-neighbor.

Further classification by labeling sequences of words in text processing as names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. is named entity extraction. There are many approaches, linguistic grammar-based techniques as well as statistical models, for Named entity extraction or Named entity recognition (NER), as it is some times called. Conditional random fields (CRFs) are a class of statistical modeling method often applied for NER.

Parsing is the task of analyzing the grammatical structure of natural language. Parser finds which groups of words go together (as “phrases”) and which words are the subject or object of a verb. Parser generates a parse tree of a given sentence. Parse trees are something like the “sentence diagram” we learnt when we were learning English grammar in school. Parse trees can be Constituency-based or Dependency-based. A constituent is a word or a group of words that functions as a single unit within a hierarchical structure in a phrase structure grammar; on other hand, the dependency relation views the (finite) verb as the structural center of all clause structure. All other syntactic units (e.g. words) are either directly or indirectly dependent on the verb.

Chunking is also called shallow parsing and it’s basically the identification of parts of speech and short phrases e.g. noun phrase or verb phrase. Full parsing is expensive, Partial or shallow parsing can be much faster, may be sufficient for many applications and can also serve as a possible first step for full parsing.

Last but not the least, we discuss co-reference resolution. Co-reference resolution is the task of finding all expressions that refer to the same entity in a discourse. E.g. ‘Sheila said she fell down’. Here She is referring to Sheila is the co-reference resolution. There are many types of co-references, like anaphora, cataphora, split antecedents, co-referring noun phrases, etc. Anaphora is when proform follows the expression to which it refers e.g. ‘Sheila said she fell down’ She follows Sheila, to whom it refers to. Cataphora is when proform precedes the expression to which it refers e.g. e.g. ‘she fell down, Sheila said’, She precedes Sheila, to whom it refers to. Algorithms intended to resolve co-references commonly look first for the nearest preceding individual that is compatible with the referring expression. Some are deterministic or multi-pass sieve algorithms.

So these were some of the very basic components of Natural language processing which forms the core of IBM Watson.

Elements of IBM Watson – Machine Learning, an introduction

Machine learning forms core part of technology which forms IBM Watson. I discussed this as part of my earlier posts on IBM Watson. Thus this new post on this core topic of machine learning. There is not a very clear definition of machine learning and its been changing with times, as it should be! One of the first definitions were by Arthur Samuel. In 1959, Arthur Samuel defined Machine Learning as a “Field of study that gives computers the ability to learn without being explicitly programmed”. Interestingly, there is an anecdote on Arthur Samuel. In 1949 Samuel joined IBM’s Poughkeepsie Laboratory and worked on IBM’s first stored program computer, the 701. He completed the first checker program on the 701, and when it was about to be demonstrated, Thomas J. Watson Sr., the founder and President of IBM, remarked that the demonstration would raise the price of IBM stock 15 points. It did.
A lot of us might think that computers can’t do anything that they’re not explicitly programmed to. Usually in our experience we write a financial accounting program and that is exactly what the computer does for us, financial accounting; or maybe employee payroll or core banking. There are very clear numerous applications which have been programmed and computers does those exactly the way they have been programmed. How can computers do something which they have NOT been programmed for?  Well, Machine Learning is all about that! Arthur Samuel managed to write a checkers program that could play checkers much better than he personally could, and this is an instance of maybe computers learning to do things that they were not programmed explicitly to do.
The obvious question that might come to our mind will be, why do we need such possibility of ability of computers to learn without being explicitly programmed? Why do we need machine learning? The answer is easy, by explicitly programming, we are limiting the computer’s capability. By adding the capability to learn we can leverage computers to perform even better than us as demonstrated by Arthur Samuel’s checkers program. The checkers program could play checkers much better than he personally could. The same reason why IBM Watson’s win in Jeopardy is so important. The impact of machine learning is seen on simple things like enhance email spam, face recognition and could lead to discovering new things in scientific research, medical diagnosis, investment strategies, laws for judicial systems etc.
There are typically two types of most commonly used Machine learning algorithms, supervised and unsupervised. Quite obviously supervised is where you teach the program and unsupervised is where you let it learn by itself.
A Supervised Learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Lets say, you want to predict if someone will have a heart attack within a year. You collect a learning data set on people, including age, weight, height, blood pressure, etc who had an heart attack within a year of data being collected. Supervised machine learning is combining all the existing data into a model that can predict the required analysis.
Supervised learning splits into two broad categories of classification and regression. Classification is classify examples into given set of categories based on certain classification rule. Classification can have just a few known values, such as ‘true’ or ‘false’. Classification algorithms apply to nominal, not ordinal response values. Examples of classification algorithms are spam filtering, market segmentation or natural-language processing. Regression algorithms are for estimating the relationships among variables; the estimation target is a function of the independent variables called the regression function. Regression for responses that are a real number, such as miles per gallon for a particular car. Example of simple regression is prediction of auto sales in relation with family income in a neighborhood.
The are another two classifications of learning algorithms, Discriminative learning algorithm & generative learning algorithm. Discriminative algorithms usually process examples to find a large margin hypothesis separating the two classes. Generative algorithms for learning classifiers use training data to separately estimate a probability model for each class. Generative models contrast with discriminative models, in that a generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the observed variables. Thus a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, whereas a discriminative model allows only sampling of the target variables conditional on the observed quantities. However, for tasks such as classification and regression that do not require the joint distribution, discriminative models can yield superior performance. On the other hand, generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks. In addition, most discriminative models are inherently supervised and cannot easily be extended to unsupervised learning. Application specific details ultimately dictate the suitability of selecting a discriminative versus generative model.
Coming to Unsupervised Learning, systems can learn to represent particular input patterns in a way that reflects the statistical structure of the overall collection of input patterns. There are many methods and algorithms by which unsupervised learning could be achieved. More often than not, multiple algorithms are used instead of one. I will discuss Clustering &  self-organizing feature map (SOFM) in this blog post.
The most common unsupervised learning method is Clustering, which is used for exploratory data analysis to find hidden patterns or grouping in data. It has a long history, and used in almost every field, e.g., medicine, psychology, marketing, insurance, libraries, etc. In recent years, due to the rapid increase of online documents, text clustering becomes important. Clustering is a technique for finding similarity groups in data, called clusters. It groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters.  There are three aspects of clustering; clustering algorithm(s), distance (similarity, or dissimilarity) function & Clustering quality. There are many clustering models and algorithms like  hierarchical clustering,  k-means algorithm, DBSCAN, density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc. Clustering is hard to evaluate, but very useful in practice. Clustering is highly application dependent and to some extent subjective. One must remember, All the clustering algorithms only group data. Clusters only represent one aspect of the knowledge in the data.
Self organizing feature map ( SOM or SOFM) is a type of artificial neural network (ANN) It provides a topology preserving mapping from the high dimensional space to map units. Map units, or neurons, usually form a two-dimensional lattice and thus the mapping is a mapping from high dimensional space onto a plane. The property of topology preserving means that the mapping preserves the relative distance between the points. Points that are near each other in the input space are mapped to nearby map units in the SOM. The SOM can thus serve as a cluster analyzing tool of high-dimensional data. Also, the SOM has the capability to generalize. Generalization capability means that the network can recognize or characterize inputs it has never encountered before. A self-organizing map consists of components called nodes or neurons. Associated with each node is a weight vector of the same dimension as the input data vectors and a position in the map space. At first the network is initialized. There are three different types of network initializations, random initialization, initialization using initial samples and linear initialization. The next step is training. Training is an iterative process through time. It requires a lot of computational effort and thus is time-consuming. The training consists of drawing sample vectors from the input data set and “teaching” them to the SOM. The teaching consists of choosing a winner unit by the means of a similarity measure and updating the values of codebook vectors in the neighborhood of the winner unit. This process is repeated a number of times. In one training step, one sample vector is drawn randomly from the input data set. This vector is fed to all units in the network and a similarity measure is calculated between the input data sample and all the codebook vectors. The best-matching unit (BMU) is chosen to be the codebook vector with greatest similarity with the input sample. The similarity is usually defined by means of a distance measure. After finding the best-matching unit, units in the SOM are updated. During the update procedure, the best-matching unit is updated to be a little closer to the sample vector in the input space. The topological neighbors of the best-matching unit are also similarly updated. This update procedure streches the BMU and its topological neighbors towards the sample vector. The codebook vectors tend to drift there where the data is dense, while there tends to be only a few codebook vectors where data is sparsely located. In this manner, the net tends to approximate the probability density function of the input data. The Self-Organizing Map is an approximation to the probability density function of the input data. It can be used in the next step which is visualization. Before a model can be reliably used, it must be validated. Validation means that the model is tested so that we can be sure that the model gives us reasonable and accurate values.

These were the various concepts associated with machine learning. I understand IBM Watson team has used many machine learning algorithms to achieve the results including some of the ones mentioned above.

Monetizing Telco Big Data: Location, Location, Location

This post is based on article with same name by Dr Sambit Sahu, a manager and senior research scientist at the IBM T.J. Watson Research Center. His current research focuses on Cloud and Big Data analytics for Telco and Smarter Cities.

As I read the above mentioned article, I felt that future has come to present in case of broadcasting!

Broadcasting is more than a century old,  starting towards end of 19th century, with data services offered by stock telegraph companies and “the theatre phone”, a telephonic distribution system that allowed the subscribers to listen to opera and theatre performances over the telephone lines.   The business model for broadcasting has been limited to very few ways or a combination of these, like funding, subscription, paid programming or the most popular one, advertising. There has been obviously tremendous growth in Radio and Television broadcasting industry in the past few decades. From the typical broadcast, with advent of internet and mobile, Radio and TV has option to become interactive too; yet, the traditional plain broadcast is still the most popular.

In the advertising world, contextual advertising is gaining ground. Though I am not aware of any studies which suggest effectiveness of this mode of advertising but logically it does seem that this will be likely the most effective. Contextual advertising is a form of targeted advertising where, the advertisements themselves are selected and served by automated systems based on the context of the user. For example, per Wikipedia, if the user is viewing a website pertaining to sports and that website uses contextual advertising, the user may see advertisements for sports-related companies, such as memorabilia dealers or ticket sellers. Contextual advertising is also used by search engines to display advertisements on their search results pages based on the keywords in the user’s query.

That was a lot of ‘context’ 🙂 to the article; coming to Big Data and monetizing Telco Big Data; Telcos have been the one industry which has bloomed exponentially in the last 2 decades; growing their susbscriber base through landlines, mobile telephony, internet and broadcasting services. This kind of provisioning in Telco world is called Triple play. Per Wikipedia, In telecommunications, triple play service is a marketing term for the provisioning, over a single broadband connection, of two bandwidth-intensive services, high-speed Internet access and television, and the latency-sensitive telephone.  This implies telcos have huge volume of data which can give ‘context’ to acitvities of their subscriber. This could lead to some indications of ‘behavior’ of the subscribers. This information along with location of the subscriber can potentially open lock to a huge treasure of knowledge.

Before we move forward on the monetizing, we need to also keep in mind the security and privacy issues related to usage of this data. The good news is that there exists tested and proven IBM technologies, which can surely give solutions in these areas. That, I am not covering in this blog post; but surely something to not just ponder but even decisive criteria.

In order to monetize the huge treasure of data in context of location; it needs to be converted into information which can be judiciously used to not only derive contexual advertising but also ensure that the efficiency of the advertisement peaks. Dr Sambit Sahu mentions two use-cases which is team has prepared and are going to demonstrate them in the ongoing Mobile World Congress. Below are the use-cases he mentions in his article

 In the first, the analytics platform is being used to create targeted Internet Protocol Television (IPTV) advertisements based on a customer’s profile. So, instead of every TV watcher seeing the same ad at the same time, during the same program, opt-in participants would see tailored advertisements that best match their profiles. Even individual family members would see different advertising based on knowing who is at home via cell phone location data, in conjunction with what programming they’re watching and their specific profile attributes.

The second use case, seeks to explore hyper-local targeted advertisements that will be delivered to mobile phones. In this use case, targeted advertisements and coupons will be delivered to a customer based on a better understanding of their profile, as well as current and predicted locations.

The play of the analytics technology is extremely core to these use-cases. Its the big data analytics technology which can from understanding of customer’s profile, infer intent, current and predicted locations or even predict future behaviors! This move from infering intent to predicting future behavior is which can lead to much better levels of effectiveness of the advertisement.

The model of Clow and Baack speaks of the six steps of objectives of an advertising campaign; Awareness, Knowledge, Liking, Preference, Conviction, Purchase. With real contextual advertising, telcos can move forward from branding methods of Awareness, Knowledge, Liking & Preference, closer to actual selling and sales revenues by Conviction, Purchase!