Posts

Showing posts from 2011

Data understanding part 1

Image
In today's post we will focus on "data understanding", which is a crucial aspect of all data mining projects.  Data understanding comes immediately after business understanding in the CRISP-DM methodology: Per IBM SPSS Modeler Help, the data understanding phase of CRISP-DM involves taking a closer look at the data available for mining.  It  involves accessing the data and exploring it using tables and graphics.  This enables you to determine the quality of the data and describe the results of these steps in the project documentation. To get started, I used a csv file that was sent to me recently.  I dragged a Var. File node onto the modeling canvas, attached the csv file to that node and then output the results into a table node.  On reviewing the results, it was clear to me that csv format was not working as desired - the data was not coming through in the correct columns as in the source file.  I then saved the source file as an Excel Workbook (2007, 2010) and repeat

SPSS Deployment: Collaboration & Deployment Services

Image
SPSS Deployment is all about "operational-izing" the predictive models developed using SPSS Modeler.  Broadly, this includes Collaboration & Deployment Services (C&DS) and Decision Management (DM).  In today's blog, we will focus on C&DS. C&DS provides a secure foundation for analytics.  It provides the technology infrastructure to manage analytic assets (predictive models), share them securely throughout the enterprise and automate processes.  This enables an organization to make better decisions consistently. There are three aspects to C&DS: 1) Collaborate - C&DS makes it possible to share and re-use assets efficiently, protect them in ways that meet internal and external compliance requirements and publish results so that a greater number of business users can view and interact with results. 2)  Automate -  Automation enables the organization to make analytics a core component of the daily decision-making processes. You can construct flexi

Scoring a model in SPSS

Image
In the previous post "predicting payment days for accounts receivables at hospitals", we created several model nuggets that could predict the payment days based on inputs such as age, income, employment, marital status, etc and on the observed values of the actual number of days that we taken for payment.  In today's post, we will use one of those model nuggets to to predict the number of days for a new set of data.  This process, called "scoring the model" in SPSS is really the main objective of the entire modeling exercise. In order to do this, we first need a new set of data that contain the same fields as the data that was used in developing the model.  An easy way to test this is by using the same file that was used to train the model but removing the observed data from the file.  From the image below, you will see that the field "payment days" is missing: We then run this data through the Auto Data Prep node.  This is important because we used

Predicting payment days for accounts receivables at hospitals - part II

Image
In part I of this post, we were able to improve the accuracy of the predictive model by running the data through the auto data prep node.  As a result, we were able to obtain a linear correlation of 0.872 between "payment days" and "$XR-payment days".  In this post, we will try to further improve the accuracy of the model. We first attach an Anomaly Detection node to the type node.  Anomaly detection models are used to identify outliers, or unusual cases, in the data.  Unlike other modeling methods that store rules about unusual cases, anomaly detection models store information on what normal behavior looks like.  This makes it possible to identify outliers even if they do not conform to any known pattern, and it can be particularly useful in applications, such as fraud detection, where new patterns may constantly be emerging.  Anomaly detection is an unsupervised method, which means that it does not require a training dataset containing known cases of fraud to us

Predicting payment days for accounts receivables at hospitals - part I

Image
In today's post, we will attempt to create a model to predict payment days for accounts receivables at hospitals. In an environment where self insured and uninsured populations are increasing, liquidating accounts receivables is a major challenge.  If you knew in advance how soon each person was likely to pay, decisions could be made about the time, effort and resources that needed to be expended in order collect that person's debt.  Hospitals spend time and money attempting to collect debts before passing them on to debt collection agencies.  A predictive model would help the hospital in determining which debts they should try to collect themselves versus which debts they should hand over directly to a debt collection agency without spending anytime on it themselves.  This and similar decisions will help hospitals reduce the costs of debt collection, make the debt collection process more efficient and lead to several other benefits. As with any model, we first start with dat

Why neural net models are great at making predictions when the exact nature of the relationship between inputs and the output is not known

Image
In today's post, we explore the use of data mining algorithms in creating a SKU (Stock Keeping Unit) level sales forecast.  There are several ways of creating a sales forecast including time series forecasting, simulation and scenario building.  Ignoring these methods, we will instead use a neural net model to create a SKU level sales forecast.  We will then test the accuracy of the forecast against actual data and try to explain why neural net models are great at at making predictions when the exact nature of the relationship between inputs and the output is not known.  This is important because in developing our model, in addition to historical sales data, we will use data about customer demographics as well as unemployment and inflation data.  At the outset, while high level conclusions can be drawn about the impact that demographics and other data can have on the sales forecast, the exact nature of the relationship between these inputs (historical sales data, demographics, infl

Predicting academic performance of students

Image
Academic performance of students in schools and colleges is an important factor in determining their overall success and sustainability.  Traditionally, schools and colleges have measured this after the fact i.e. after students go through tests, exams, etc and are assigned grades based on their performance.  With data mining tools, schools and colleges could predict academic performance of students before the fact.  By mining data about historical performance of students, their demographics, etc educational institutions could create predictive models to determine whether a specific student (with a unique profile) is likely to pass or fail an exam.  Universities offering admissions to prospective students could determine before hand whether a student will likely succeed or not in the program that they propose to enroll in.  This could be used by Admissions Committees to improve the quality of students that they offer admissions to and therefore improve the standing of the school. I re