### Predicting payment days for accounts receivables at hospitals - part I

In today's post, we will attempt to create a model to predict payment days for accounts receivables at hospitals. In an environment where self insured and uninsured populations are increasing, liquidating accounts receivables is a major challenge. If you knew in advance how soon each person was likely to pay, decisions could be made about the time, effort and resources that needed to be expended in order collect that person's debt. Hospitals spend time and money attempting to collect debts before passing them on to debt collection agencies. A predictive model would help the hospital in determining which debts they should try to collect themselves versus which debts they should hand over directly to a debt collection agency without spending anytime on it themselves. This and similar decisions will help hospitals reduce the costs of debt collection, make the debt collection process more efficient and lead to several other benefits.

As with any model, we first start with data. The data I have is stored in an excel spreadsheet. So after opening up IBM SPSS Modeler, I first drag an Excel source node onto the modeling canvas. I attach the data file to this node and then output the results in a table node. My canvas looks like this:

The data contained in the source file can be viewed by "run"ning the table node and is as follows:

While most the fields are self explanatory, the following fields merit some explanation:

1) Primary payer: An insurance policy that pays first when a person is covered by more than one insurance plan. This column indicates the amount that the primary payer has paid.

2) Org bal: This column reflects the final amount that is due from the patient.

3) Cur bal: This column reflects any amount due after the payment has made the payment. In most cases, this amount is zero.

4) Dish date: This column reflects the date the patient was discharged and therefore the date from when the amount is due.

5) Zero bal date: This column reflects the date when the patient settled the invoice.

6) Payment days: This column should reflect the time elapsed between the dish date and the zero bal date.

I then attach a Type node to the excel source node and make changes to the field types as follows:

1) The marital status field is set to measurement type "flag". The flag measurement type is used for data with two distinct values that indicate the presence or absence of a trait, such as true and false, Yes and No or 0 and 1. The values used may vary, but one must always be designated as the "true" value, and the other as the "false" value. Data may be represented as text, integer, real number, date, time, or timestamp (Source: IBM SPSS Help). In this data set, the marital status field contains only two values: 1 or 2 which indicate whether the person is married or not.

2) Education and primary payer fields are set to measurement type "nominal". The nominal field type is used to describe data with multiple distinct values, each treated as a member of a set, such as small/medium/large. Nominal data can have any storage—numeric, string, or date/time (Source: IBM SPSS Help).

3) Employment and income fields are set to measurement type "ordinal". The ordinal field type is Used to describe data with multiple distinct values that have an inherent order. For example, salary categories or satisfaction rankings can be typed as ordinal data. The order is defined by the natural sort order of the data elements. For example, 1, 3, 5 is the default sort order for a set of integers, while HIGH, LOW, NORMAL (ascending alphabetically) is the order for a set of strings (Source: IBM SPSS Help).

4) The role of the "payment days" field is set as "Target" since that is the field that we wish to predict. All other fields are treated as "Inputs" into the prediction.

I then add an Auto Numeric modeling node to the type node and run the node. The Auto Numeric node estimates and compares models for continuous numeric range outcomes using a number of different methods, allowing you to try out a variety of approaches in a single modeling run. You can select the algorithms to use, and experiment with multiple combinations of options (Source: IBM SPSS Help). I then attach an Analysis node to the modeling nugget and obtain the following result:

Among other results, the analysis node indicates that there is a linear correlation of 0.468 between "payment days" and "$XR-payment days", which is the field generated by the predictive model. I then add a Field Reorder node to change the order in which the fields appear in the output to make the "payment days" and "$XR-payment days" fields appear next to each other at the end of the table. I also attach a Derive node to create a new field called "Variance" which calculates the variance between "payment days" and "$XR-payment days". I output these results to a table which displays as follows:

As can be seen from the variance column above, there is a significant difference between the actual number of days that the patients paid in and the days predicted by the model. Conclusion: the quality of the model is poor and can be improved.

In order to improve the predictive quality of the model, we start with the data. I attach an Auto Data Prep node to the type node and run the node such that accuracy is optimized. I then attach a type node to the auto data prep node and re-run the auto numeric node. The analysis node now provides the following result:

The analysis node now indicates that there is a linear correlation of 0.872 between "payment days" and "$XR-payment days", which is the field generated by the predictive model. This is significantly better than the 0.468 result returned by the first model. Let us review the table output from this model:

The Variance2 column also looks a lot better than the Variance1 column from the table above. Conclusion: this model is superior to the first model that we developed. Can we improve the model further?

As with any model, we first start with data. The data I have is stored in an excel spreadsheet. So after opening up IBM SPSS Modeler, I first drag an Excel source node onto the modeling canvas. I attach the data file to this node and then output the results in a table node. My canvas looks like this:

The data contained in the source file can be viewed by "run"ning the table node and is as follows:

While most the fields are self explanatory, the following fields merit some explanation:

1) Primary payer: An insurance policy that pays first when a person is covered by more than one insurance plan. This column indicates the amount that the primary payer has paid.

2) Org bal: This column reflects the final amount that is due from the patient.

3) Cur bal: This column reflects any amount due after the payment has made the payment. In most cases, this amount is zero.

4) Dish date: This column reflects the date the patient was discharged and therefore the date from when the amount is due.

5) Zero bal date: This column reflects the date when the patient settled the invoice.

6) Payment days: This column should reflect the time elapsed between the dish date and the zero bal date.

I then attach a Type node to the excel source node and make changes to the field types as follows:

1) The marital status field is set to measurement type "flag". The flag measurement type is used for data with two distinct values that indicate the presence or absence of a trait, such as true and false, Yes and No or 0 and 1. The values used may vary, but one must always be designated as the "true" value, and the other as the "false" value. Data may be represented as text, integer, real number, date, time, or timestamp (Source: IBM SPSS Help). In this data set, the marital status field contains only two values: 1 or 2 which indicate whether the person is married or not.

2) Education and primary payer fields are set to measurement type "nominal". The nominal field type is used to describe data with multiple distinct values, each treated as a member of a set, such as small/medium/large. Nominal data can have any storage—numeric, string, or date/time (Source: IBM SPSS Help).

3) Employment and income fields are set to measurement type "ordinal". The ordinal field type is Used to describe data with multiple distinct values that have an inherent order. For example, salary categories or satisfaction rankings can be typed as ordinal data. The order is defined by the natural sort order of the data elements. For example, 1, 3, 5 is the default sort order for a set of integers, while HIGH, LOW, NORMAL (ascending alphabetically) is the order for a set of strings (Source: IBM SPSS Help).

4) The role of the "payment days" field is set as "Target" since that is the field that we wish to predict. All other fields are treated as "Inputs" into the prediction.

I then add an Auto Numeric modeling node to the type node and run the node. The Auto Numeric node estimates and compares models for continuous numeric range outcomes using a number of different methods, allowing you to try out a variety of approaches in a single modeling run. You can select the algorithms to use, and experiment with multiple combinations of options (Source: IBM SPSS Help). I then attach an Analysis node to the modeling nugget and obtain the following result:

Among other results, the analysis node indicates that there is a linear correlation of 0.468 between "payment days" and "$XR-payment days", which is the field generated by the predictive model. I then add a Field Reorder node to change the order in which the fields appear in the output to make the "payment days" and "$XR-payment days" fields appear next to each other at the end of the table. I also attach a Derive node to create a new field called "Variance" which calculates the variance between "payment days" and "$XR-payment days". I output these results to a table which displays as follows:

As can be seen from the variance column above, there is a significant difference between the actual number of days that the patients paid in and the days predicted by the model. Conclusion: the quality of the model is poor and can be improved.

In order to improve the predictive quality of the model, we start with the data. I attach an Auto Data Prep node to the type node and run the node such that accuracy is optimized. I then attach a type node to the auto data prep node and re-run the auto numeric node. The analysis node now provides the following result:

The analysis node now indicates that there is a linear correlation of 0.872 between "payment days" and "$XR-payment days", which is the field generated by the predictive model. This is significantly better than the 0.468 result returned by the first model. Let us review the table output from this model:

The Variance2 column also looks a lot better than the Variance1 column from the table above. Conclusion: this model is superior to the first model that we developed. Can we improve the model further?