Predicting payment days for accounts receivables at hospitals - part I
As with any model, we first start with data. The data I have is stored in an excel spreadsheet. So after opening up IBM SPSS Modeler, I first drag an Excel source node onto the modeling canvas. I attach the data file to this node and then output the results in a table node. My canvas looks like this:
While most the fields are self explanatory, the following fields merit some explanation:
1) Primary payer: An insurance policy that pays first when a person is covered by more than one insurance plan. This column indicates the amount that the primary payer has paid.
2) Org bal: This column reflects the final amount that is due from the patient.
3) Cur bal: This column reflects any amount due after the payment has made the payment. In most cases, this amount is zero.
4) Dish date: This column reflects the date the patient was discharged and therefore the date from when the amount is due.
5) Zero bal date: This column reflects the date when the patient settled the invoice.
6) Payment days: This column should reflect the time elapsed between the dish date and the zero bal date.
I then attach a Type node to the excel source node and make changes to the field types as follows:
1) The marital status field is set to measurement type "flag". The flag measurement type is used for data with two distinct values that indicate the presence or absence of a trait, such as true and false, Yes and No or 0 and 1. The values used may vary, but one must always be designated as the "true" value, and the other as the "false" value. Data may be represented as text, integer, real number, date, time, or timestamp (Source: IBM SPSS Help). In this data set, the marital status field contains only two values: 1 or 2 which indicate whether the person is married or not.
2) Education and primary payer fields are set to measurement type "nominal". The nominal field type is used to describe data with multiple distinct values, each treated as a member of a set, such as small/medium/large. Nominal data can have any storage—numeric, string, or date/time (Source: IBM SPSS Help).
3) Employment and income fields are set to measurement type "ordinal". The ordinal field type is Used to describe data with multiple distinct values that have an inherent order. For example, salary categories or satisfaction rankings can be typed as ordinal data. The order is defined by the natural sort order of the data elements. For example, 1, 3, 5 is the default sort order for a set of integers, while HIGH, LOW, NORMAL (ascending alphabetically) is the order for a set of strings (Source: IBM SPSS Help).
4) The role of the "payment days" field is set as "Target" since that is the field that we wish to predict. All other fields are treated as "Inputs" into the prediction.
I then add an Auto Numeric modeling node to the type node and run the node. The Auto Numeric node estimates and compares models for continuous numeric range outcomes using a number of different methods, allowing you to try out a variety of approaches in a single modeling run. You can select the algorithms to use, and experiment with multiple combinations of options (Source: IBM SPSS Help). I then attach an Analysis node to the modeling nugget and obtain the following result:
Among other results, the analysis node indicates that there is a linear correlation of 0.468 between "payment days" and "$XR-payment days", which is the field generated by the predictive model. I then add a Field Reorder node to change the order in which the fields appear in the output to make the "payment days" and "$XR-payment days" fields appear next to each other at the end of the table. I also attach a Derive node to create a new field called "Variance" which calculates the variance between "payment days" and "$XR-payment days". I output these results to a table which displays as follows:
As can be seen from the variance column above, there is a significant difference between the actual number of days that the patients paid in and the days predicted by the model. Conclusion: the quality of the model is poor and can be improved.
In order to improve the predictive quality of the model, we start with the data. I attach an Auto Data Prep node to the type node and run the node such that accuracy is optimized. I then attach a type node to the auto data prep node and re-run the auto numeric node. The analysis node now provides the following result:
The analysis node now indicates that there is a linear correlation of 0.872 between "payment days" and "$XR-payment days", which is the field generated by the predictive model. This is significantly better than the 0.468 result returned by the first model. Let us review the table output from this model: