Showing posts from 2012

Introduction to the K-Nearest Neighbor (KNN) algorithm

In pattern recognition, the K-Nearest Neighbor algorithm (KNN) is a method for classifying objects based on the closest training examples in the feature space.  KNN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification.  The KNN algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small).  If k = 1, then the object is simply assigned to the class of its nearest neighbor [Source: Wikipedia].
In today's post, we explore the application of KNN to an automobile manufacturer that has developed prototypes for two new vehicles, a car and a truck.  Before introducing the new models into its range, the manufacturer wants to determine which of the existing models in the marketplace are most l…

Feature selection for efficient modeling

Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection is the technique of selecting a subset of relevant features for building robust learning models (Source: Wikipedia).  Data mining problems may involve hundreds, or even thousands, of variables that can potentially be used as inputs.  As a result, a great deal of time and effort may be spent examining which variables to include in the model.  Feature selection allows us to identify the most important variables to be used in the modeling process which can lead to the following benefits: Speed up model building: by using only the most important variables in model building, feature selection enables us to significantly reduce processing times thereby speeding up model building.  The greater the number of potential "input" variables, the greater is the improvement in model building speed by using feature selection.Reduction in time and cost of model building: c…

An Introduction to Social Network Analysis

Social network analysis (SNA) is the methodical analysis of social networks.  Social network analysis views social relationships in terms of network theory, consisting of nodes (representing individual actors within the network) and ties (which represent relationships between the individuals).  These networks are often depicted in a social network diagram, where nodes are represented as points and ties are represented as lines.

Example of a social network diagram

Relationships in a network can either be directional or nondirectional.  In a directional relationship, one person is the initiator (or source of the relationship) while the other is the receiver (or destination of the relationship).  For example, in the diagram above, node 1269 is the source while node 3777 is the destination.  Relationships can also be described as dichotomous or valued.  A dichotomous relationship is one where the only information that exists is whether or not a relationship exists between two people, where …

Optimizing Direct Mail Campaigns

Direct Mail Campaigns (and their online equivalents) continue to be a popular method to promote a company's offer to potential customers.  All of us have received letters from retail stores, financial institutions and other companies with special offers that prompt us to take speedy action to avail of a discount, a bonus or similar attractive proposition.  In most cases, I tend to discard these letters without opening them and in rare cases I open them before deciding that they don't apply to me.  My behavior is not unique; typical response rates for direct mail campaigns hover around 2% - 3% which means most folks who receive these direct mailers tend to discard them.  For companies, this is obviously not good news.  Each mailer costs them money (printing and postage) so it is in their best interest not to send out a direct mail to someone that is not likely to respond positively to a campaign.
So for example, a company sends out 1,000,000 direct mailers and gets a response …

An Introduction to Text Analytics

Text analytics, sometimes alternately referred to as text data mining or text mining, refers to the process of deriving high-quality information from text.  High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.  Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output.  Typical text mining tasks include text categorization, text clustering, concept / entity extraction, production of granular taxonomies, sentiment analysis, document summarization and entity relation modeling (i.e., learning relations between named entities).  The overarching goal is, essentially, to turn text into data for analysis via application of natural language processing (NLP) and analytical methods.  A typical application is to scan a set of documents written in a natural language and either model the document…

Using decision trees in evidence based medicine

In today's post, we explore the use of decision trees in evidence based medicine.  In 1996 David Sackett wrote that "Evidence-based medicine is the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients" [Source: Wikipedia].
For our analysis, we start with a data set which contains data about a number of patients all of whom suffered from the same illness.  Each of these patients responded well to one of five medications.  We will use a decision tree to understand what factors in each patients history led to them responding well to one specific medication over the others.  We will then use our findings to generate a set of evidence based rules or policies that can be followed by doctors to treat this illness in future patients.  As part of our analysis, we will also explore how to interpret decision trees.
Let us first look at our data set:

As can be seen, the data set contains information about the age a…