INTRODUCTION
There
are a large variety of quantitative techniques available to assist in the
development of mathematical models. But the seasoned practitioner
understands that they all do basically the same thing: they help us search
for a set of variables, weights and operators in the form of an equation.
When that equation is applied to a set of decision data, it enhances the
performance of our decision making.
The algorithms behind
our model development effort are seeking those variables that have
information content relative to the goals we have defined. Our data, and
the information content it contains, is the source of enhanced performance.
Successful
practitioners typically spend 75% to 80% of their overall modeling effort
preparing data. These efforts deal with issues such as understanding the
context of the available data fields, handling of missing data, identifying
and correcting data errors, identification and representation of interaction
effects between variables, mathematical transformation of data to obtain
different perspectives on the information content, and data representation
schemes appropriate for the type of data being utilized.
Practitioners new to
predictive analytics often overlook this last issue. The physical
representation of the data in their data set can often have significant
impact on the information content presented to the modeling technique. This
article presents a brief discussion comparing two approaches: common data
representation, and an enhanced approach for certain types of data.
DATA TYPES
Just as quantitative techniques
have strengths and weaknesses, so does our data. When considering the
context of our data, it is also important to understand the mathematical
capabilities of our data. It is obviously trivial to point out that the
mean and standard deviation of variables such as zip code is meaningless at
best. However, many practitioners overlook more serious considerations and
miss important data representation issues as a result.
Each variable in your
data set should be clearly identified as being either quantitative or
qualitative in nature. The characteristic of importance here is ‘order’.
There is no inherent order in a qualitative variable. Quantitative
variables, on the other hand, have an underlying order. It is beyond the
scope of this article to consider the types of mathematics that are
appropriate for the various types of quantitative variables (ordinal,
interval and continuous). Rather, we will focus on the implications of the
characteristic of ‘order’, and data representation schemes that are of use
to enhance the extraction of information content.
QUALITATIVE
VARIABLES
A qualitative variable is
typically simply a variable that describes a set of categories. The
variable will have two or more values, each representing a category meeting
a particular set of conditions. An example of a qualitative variable is
marital_status.
For this discussion,
let’s assume that marital_status has the following values:
|
Marital
Status |
|
Married |
|
Single |
|
Divorced |
|
Widowed |
|
Separated |
|
Other |
The values of the
variable marital_status have relative order. We can easily
rearrange them in any other order with no impact on the information content.
However, from a
predictive analytics perspective, we still have many questions that need to
be addressed for a field of this type.
-
Are the values exhaustive? Have they captured all possible
circumstances? The value ‘other’ takes care of this for us.
-
Are the values mutually exclusive? Given that this is an individual’s
current status rather than any value that may have ever applied,
suggests exclusivity. However, the values as stated suggest that
‘Separated’ is mutually exclusive from ‘Married’. Is that context what
is truly desired?
-
Should this variable be represented in our modeling data as one variable
with six values, or are there other alternatives that should be
considered?
-
Do we need six values for the variable marital_status ?
COLLAPSING VALUES
For the marital_status
, we have identified six values. Is this the appropriate number of
categories? It is important to understand that there is no “right” answer
to this question generally. The answer is always going to be contingent on
what the context of usage is. For some decision environments, this is going
to be the most appropriate representation.
-
Is it sufficient to use on the values ‘Married’ and ‘Other”.
-
Do we need to combine ‘Divorced’ and ‘Widowed’? Is it sufficient to
know only that the individual was at one time married, but no longer is?
-
Should we combine ‘Single’, ‘Divorced’ and ‘Widowed’? Is it sufficient
to know that the person is not currently married?
These are empirical questions. They can only be answered in the context of
the particular decision environment we are exploring. How many values to
use, and how to collapse the values, are best answered by testing each of
the combinations and measuring the impact that the representation has on
performance.
DATA REPRESENTATION ALTERNATIVES
We must also consider the impact
of different data representation schemes. In this case there are two
alternatives:
-
A
single variable with six values, as above, and
-
Six variables, one for each of the values, using binary representation
for each: commonly referred to as a 1 of N representation.
|
|
Married |
Single |
Divorced |
Widowed |
Separated |
Other |
|
Married |
1 |
0 |
0 |
0 |
0 |
0 |
|
Single |
0 |
1 |
0 |
0 |
0 |
0 |
|
Divorced |
0 |
0 |
1 |
0 |
0 |
0 |
|
Widowed |
0 |
0 |
0 |
1 |
0 |
0 |
|
Separated |
0 |
0 |
0 |
0 |
1 |
0 |
|
Other |
0 |
0 |
0 |
0 |
0 |
1 |
The 1 of N
representation allows for more flexibility. Some of our modeling techniques
may identify relationships differently than others. Some may focus only on
one of the values. Others may use more than one, but not all of the
values. Still others may use all six values. This inherent flexibility
makes the 1 of N representation appropriate for virtually all qualitative
variables.
QUANTITATIVE DATA
Let’s explore another example:
Education_Level.
|
Education_Level |
|
< High
School |
|
High
School |
|
Some
College |
|
Bachelor’s
Degree |
|
>
Bachelor’s Degree |
Education_Level
is an example of
quantitative data. While it isn’t represented by numeric values, ‘order’ is
a significant characteristic. This is, in fact, an ordinal variable. It
would be inappropriate to compute any type of mathematical calculations,
even if the data were represented numerically, since there is an
inconsistent interval in the values.
Just as we considered
collapsing the values in the variable Marital_Status, above,
the same considerations apply here. The number of values appropriate for
Education_Level, is purely determined by empirical testing in the
decision environment in which we are working.
The data
representation issues are also similar. We can obtain a number of
advantages by using a 1 of N representation for Education_Level.
|
|
< High School |
High School |
Some College |
Bachelor’s Degree |
> Bachelor’s Degree |
|
< High School |
1 |
0 |
0 |
0 |
0 |
|
High School |
0 |
1 |
0 |
0 |
0 |
|
Some College |
0 |
0 |
1 |
0 |
0 |
|
Bachelor’s Degree |
0 |
0 |
0 |
1 |
0 |
|
> Bachelor’s |
0 |
0 |
0 |
0 |
1 |
While this 1 of N
representation allows for the flexibility advantages discussed above, it
does not capture the ‘order’ characteristics of the variable
Education_Level. If this representation were used as an output
variable, for instance, your answers would either be correct or incorrect.
You would be unable to assess the degree of incorrectness, as the data
representation scheme does not capture that information.
On the other hand,
consider a different representation scheme: a Thermometer Representation.
|
|
< High School |
High School |
Some College |
Bachelor’s Degree |
> Bachelor’s Degree |
|
< High School |
1 |
0 |
0 |
0 |
0 |
|
High School |
1 |
1 |
0 |
0 |
0 |
|
Some College |
1 |
1 |
1 |
0 |
0 |
|
Bachelor’s Degree |
1 |
1 |
1 |
1 |
0 |
|
> Bachelor’s |
1 |
1 |
1 |
1 |
1 |
The logic of a
Thermometer Representation is very straightforward. An individual in the
category High_School, has all of the attributes of someone in
the category <High_School... plus something else. An
individual in the category Some_College, has all of the
attributes of someone in the category High_School... plus
something else. And, so on.
The Thermometer
Representation allows us to capture ‘order’ in our values and, as a result,
allows us to consider degree of incorrectness.
While it would be
physically possible to use a Thermometer Representation on the
Marital_Status variable, discussed above, it would not make sense to
do so. A qualitative variable has no ‘order’. On the other hand,
restricting our data representation method for a quantitative variable to a
1 of N representation misses an important characteristic of the information
content available.
It is worth noting,
that a Thermometer Representation also allows us to control the direction of
error. In the representation above, the logic reinforces the building of
levels. As a result, this representation scheme will have a tendency to
underestimate the value.
Is this what we want?
Again, it depends. If we are in a decision environment where we would
prefer to have overestimation when we are incorrect, we simply need to
invert the Thermometer Representation to achieve that result.
|
|
< High School |
High School |
Some College |
Bachelor’s Degree |
> Bachelor’s Degree |
|
< High School |
1 |
1 |
1 |
1 |
1 |
|
High School |
0 |
1 |
1 |
1 |
1 |
|
Some College |
0 |
0 |
1 |
1 |
1 |
|
Bachelor’s Degree |
0 |
0 |
0 |
1 |
1 |
|
> Bachelor’s |
0 |
0 |
0 |
0 |
1 |
CONCLUSION
Take the time to carefully
consider the attributes of your data fields. Creatively match data
representation schemes with the characteristics of the variable in use.
This effort can have a dramatic impact on the performance of your models.
Enhanced model
performance comes from extracting as much information content as possible…
relative to the specific performance metrics you are using to measure
success.
ABOUT THE AUTHOR
THOMAS
A. "TONY" RATHBURN is a Senior Consultant with The Modeling Agency. Tony
has worked with commercial and government clients to develop data mining
solutions to significant business applications since the mid 1980’s. Mr.
Rathburn delivers custom workshops, keynote presentations and consults on a
wide range of commercial assignments -- many involving predictive CRM
analytics. He holds extensive data mining experience in the banking,
insurance, and financial industries.
Mr. Rathburn’s Experience
includes seven years teaching MIS and Statistics at both the graduate and
undergraduate level while an instructor in the College of Business at Kent
State University. Tony’s experience covers a broad range of practical
experience in addition to his teaching background. His consulting expertise
has been concentrated in the business utilization of advanced knowledge
discovery techniques. He served as Vice President of Applied Technologies
for NeuralWare, Inc., a neural network tools and consulting company. He was
also the Research Coordinator for LakeShore Trading, Inc., a successful
futures and options trading firm on the Chicago Board of Trade. Tony may be
reached at
tony@the-modeling-agency.com
All Rights Reserved by The Modeling Agency.
Copyright © 2007