Predicting academic performance of students
I recently obtained data about student registrations, demographics and grades from a university with the objective of building a predictive model to determine academic performance. Using IBM SPSS Modeler, I first prepared the data for modeling. Given that the information about registrations, academic performance, demographics, etc were in different csv files, I used the inner join by keys option in the merge node to create a comprehensive file with all the relevant information. Given that there were upwards of 600,000 records, I used the sample node to choose a random sample of 5% of the records with which to build my model.
I used a field called "Final Grade" as the determinant of academic performance. This field had a number of different grades (A, B, F, etc) depicting different levels of student performance. In order to simplify my analysis, I used the derive node to create a new field called "PassedCourse" as follows:
Based on the formula above, the new field PassedCourse returns a value of True when the Final Grade field has either A, B or C in it else a False value is returned.
I then attached a Type node to set the new PassedCourse field to type "Target" and set the Final Grade field to type "None" since I was going to the PassedCourse field as my predictor instead of the Final Grade field.
Given that the comprehensive file contained a number of fields that I deemed irrelevant for my predictions (such as mode of registration, the course code, the subject code, etc), I used a Filter node to remove these fields. I could also have used the Feature Selection node to help determine which fields were relevant to my prediction.
In order to clean and transform the data, I used the Auto Data Prep node and then my data was ready for modeling. I used the Auto Classifier node to create my first predictive model with this data. This node can be used for any data where the Target is a Flag field and there is at least one input field. The results from the model showed that the two best models from an Overall Accuracy standpoint were CHAID and C5. I finally chose the CHAID model since it had the highest level of accuracy. The final modeling canvas was as follows:
Having created the model, I used an Analysis node to determine the quality of the model. The Analysis node showed the model had a reasonably high degree of predictive accuracy: it could accurately predict academic performance of students approximately 85% of the time. That means for every new student who had enrolled in a particular course, the model could predict with an 85% accuracy rate whether that student would pass or fail that course. Returning to my earlier example, an Admissions Committee could predict with an 85% accuracy rate how a student would fare in the program based on his or her profile. Some caveats: obviously the quality of the prediction depends on the quality of the data. Also, the utility of a model depends on the circumstances in which it were being used and therefore while interpreting results, it is essential that someone who understands the business and the context must be involved.
A brief note on the two data mining algorithms referred to above:
CHAID or Chi-squared Automatic Interaction Detection, is a classification method for building decision trees by using chi-square statistics to identify optimal splits. CHAID first examines the crosstabulations between each of the input fields and the outcome, and tests for significance using a chi-square independence test. If more than one of these relations is statistically significant, CHAID will select the input field that is the most significant (smallest p value). If an input has more than two categories, these are compared, and categories that show no differences in the outcome are collapsed together. This is done by successively joining the pair of categories showing the least significant difference. This category-merging process stops when all remaining categories differ at the specified testing level (Source: IBM SPSS Modeler Help). The chi-squared distribution (also chi-square or χ²-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. It is one of the most widely used probability distributions in inferential statistics, e.g., in hypothesis testing or in construction of confidence intervals. The chi-squared distribution is used in the common chi-squared tests for goodness of fit of an observed distribution to a theoretical one, the independence of two criteria of classification of qualitative data, and in confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation (Source: Wikipedia).
C5.0 is a sophisticated data mining tool for discovering patterns that delineate categories, assembling them into classifiers, and using them to make predictions (Source: rulequest.com). A C5.0 model works by splitting the sample based on the field that provides the maximum information gain. Each subsample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute significantly to the value of the model are removed or pruned (Source: IBM SPSS Modeler Help).