direct
281.667.4200

training
888.742.2454

fax
281.652.5721

email
send a message
 
newsletter
Receive quarterly training schedule updates and informative articles

 
 
 
 
 
 
 

 
The Modeling Agency Quarterly Newsletter
2008-Q2 Release
 

[ April 15, 2008  |  This Edition: ]

1.  Training Schedule Update:  Want to learn how to get started in predictive analytics and take the spike out of the learning curve? Attend TMA's "Data Mining: Levels I, II & III in  Washington, DC
June 2 - 6, 2008 or San Diego September 29 - October 3

2.  Feature Article:  "A Conceptual Foundation for the Formulation of Business Predictive Analytics Projects", Thomas A. "Tony" Rathburn, Senior Consultant, The Modeling Agency

3.  Announcement:  DM Radio News, Next Topic: "Putting the Context Around Text Mining", TMA's Dean Abbott contributes

4.  Announcement:  The Data Warehousing Institute's World Conference in Chicago, Illinois, May 11 - 16, 2008

5.  Newsletter Summary

 
 

1.  TRAINING SCHEDULE UPDATE 

 

  
LEARN HOW EXPERTS MINE DATA IN WASHINGTON, DC OR SAN DIEGO, CA

The next offering of The Modeling Agency's vendor-neutral, application-oriented data mining courses is scheduled for June 2 to 6 in Washington, DC and September 29 to October 3, 2008.  Participants will enjoy a balanced, broad and non-promotional presentation of predictive modeling without restriction to a particular tool method or product.

 

Attendees will learn about data mining capabilities, limitations, best practices, strategies, methods, tools, techniques and applications while enjoying all the entertainment and seasonal weather that Las Vegas has to offer.  Those in attendance will leave with a comprehensive binder of notes, illustrations and references to valuable resources.  Don't leave a powerful competitive advantage untapped: harness the valuable information and profits hidden in your data. 

The last three years' offerings of the June DC courses sold out months in advance -- and only a few spaces remain as of today.   Be sure to check the remaining capacity updated daily on the course schedule page and reserve your space early.  If you're not yet ready to formalize your registration, you may submit an unofficial registration without obligation or penalty and reserve your space today while your training request is processed.
 

CHOOSE THE TRAINING THAT'S RIGHT FOR YOU
The Modeling Agency offers three data mining courses with distinct objectives.  The courses are designed to be attended independently, or as a progressive series.  While the three levels are staged as a progression, they should not be viewed simply as "introductory, intermediate and advanced."  Refer to the table below to ensure that your experience, situation and objectives align properly with the intent, scope and depth of each offering:

Course

Focus

Scope

Geared To

Data Mining: Level I Strategy An intensive overview of strategy, best practices and case studies Project leaders,
Stakeholders,
Functional Managers
Data Mining: Level II Methods A tactical drill-down of the data mining process, methods, techniques and resources Business Analysts,
Functional Analysts,
IT Professionals
Data Mining: Level III Application A hands-on application workshop as an extension to Data Mining: Level II Practitioners,
Model-builders,
Decision Support
Developers

 
FULL COURSE DETAILS

Detailed course outlines, instructor background, site information, a secure registration form and other course descriptions offered by TMA may be obtained through the links that follow.
 

Since The Modeling Agency is not a tools vendor, participants enjoy a balanced, broad and
non-promotional perspective of predictive analytics at desirable venues throughout the USA.
 
 

DATA MINING: LEVEL I
An Intensive Overview of Strategy, Best Practices
and Case Studies for Predictive Analytics

CourseDetailed Outline 
InstructorTony Rathburn
Registration On-Line Form
 
SCHEDULE AND SITE DETAILS 
 June 2 & 3, 2008: Washington, DC 
September 29 & 30: San Diego, CA
December 8 & 9, 2008: Las Vegas, NV
Duration and Fee:
2 Days, 1.2 CEUs
$1295 USD
 
Package Price:
$1995 Levels I & II
 
 
 

DATA MINING: LEVEL II
A Tactical Drill-Down of the Data Mining
Process, Methods, Tools and Techniques

Course: Detailed Outline 
InstructorDean Abbott
Registration: On-Line Form
 
SCHEDULE AND SITE DETAILS
June 4 & 5, 2008: Washington, DC 
October 1 & 2, 2008: San Diego, CA
December 10 & 11, 2008: Las Vegas, NV  
Duration and Fee:
2 Days, 1.2 CEUs
$1295 USD
 
Package Price:
$1995 Levels I & II
 
 
 

DATA MINING: LEVEL III
A Hands-On Application Workshop
for Data Mining Practitioners

Course: Detailed Outline 
InstructorDean Abbott
Registration: On-Line Form
 
SCHEDULE AND SITE DETAILS
June 6, 2008: Washington, DC  
October 3, 2008: San Diego, CA
December 12, 2008: Las Vegas, NV  
Duration and Fee:
1 Day, 0.6 CEUs
$695 USD
 
Package Price:
$595 With Level II
 
 
 

 
Courses May Be Delivered At Your Site

Call (888) 742-2454 or send an email inquiry to receive a value-based
spreadsheet quotation for training at your site.


Government Buyers
TMA is a CCR Registered Veteran-Owned Small Business and accepts EFT.
 

 
 
 

 

2.  FEATURE ARTICLE
 

A Conceptual Foundation for the Formulation
of Business Predictive Analytics Projects

by  Thomas A. "Tony" Rathburn
Senior Consultant, The Modeling Agency

 
PREFACE
 
The author approaches building models using Predictive Analytics as a strategy for playing a ‘game’.  The game is defined by the business unit.  All rules, strategies, constraints, and score keeping are defined by those playing the game.  There is no ‘right’ answer.  There are only decisions and impacts.   

Predictive Analytics is an approach for developing quantitative models, based on historical data, for the purpose of making ‘better’ decisions in a business environment.

As we are dealing with human behavior, there is no perfect model to be developed.  Instead, we are challenged to define groups of people who have a probabilistic expectation of displaying/not-displaying a behavior.

This article explores the development of the major approaches to these types of problems as a foundation for conceptualizing Predictive Analytics problems effectively.  

 
FRAMING BUSINESS PREDICTIVE ANALYTICS AS A GAME
A game is played by two or more participants for the purpose of winning a prize.  That prize can be recognition, property, titles or any other thing of value to the players.  Most typically the prize is money. 

Games are played by a set of rules, within a set of boundaries, and have an established way of keeping score… all mutually agreed upon by the players.

Predictive Analytics is the goal directed development of mathematical models, based on historical data, to support decision making.  As such, it is especially well suited to the discovery of enhanced strategies for playing the game we call “business”. 

This discussion specifically addresses issues related to the formulation of projects involving decision making in a business environment.  None of the insights contained in this discussion are “right”.  Rather, the author describes a best-practices approach for utilizing predictive analytics in a business environment.  This approach can be generalized successfully to virtually all business projects.
 

MATHEMATICAL MODELS AND KEEPING SCORE
A mathematical model consists of a set of variables, associated weights and operators to describe the relationships between the variables.  Mathematical models are developed for the purpose of estimating the value of a variable of interest. 

In and of themselves, mathematical models have few interesting qualities.  It is only when we put them in context that they have the potential to have value.

Does the expression y = a + b(w1) – c(w2)2 seem particularly exciting or provocative?  Probably not.  However, if I told you that this was an exceptionally effective model for managing an aspect of your personal finances, explained the variables involved, and the rules for utilizing the model, you are much more likely to find this bit of math of interest.

It is only in the context of the game that we find mathematical models of interest and useful.  Their advantage comes from providing a reliable way of evaluating a situation.  By adding a set of rules for utilizing the model, we have a strategy for consistently making decisions.

That is not to say that we have a good way of making decisions… only a reliable and consistent way of making decisions.

The usefulness of mathematical models must be evaluated in the context of the game we are playing.  Do they successfully fall within the constraints of the rules and boundaries we have defined for playing the game?  And, more importantly, do they provide a strategy for playing the game that gives us a higher level of performance -- or more success -- than we are currently achieving?

Every mathematical model, along with its associated usage rules, can be evaluated in this way.  It is important to note that every time we change the mix of our variables in our models, by either adding or deleting a variable, we have changed the level of performance we will achieve by using the model to play our game.   The same can be said for changing any of the weights or operators in our model, and for changing any rule associated with the implementation of the model.

Our attraction in predictive analytics, therefore, is not to any particular mathematical model, but to the process of searching for a combination of variables, weights, operators, and rules for usage that improves the decision making in the game we are playing.  That improvement is evidenced by an increase in the “score” we achieve by using the model versus using our existing decision making approach. 
 

USES OF MATHEMATICAL MODELS
In their simplest use, we describe a particular set of circumstances by providing a model with a set of values for each of its respective variables and complete the calculations necessary to compute the expected value for our target outcome. 

Unfortunately, this expected value rarely has any inherent value on its own.  Only when we combine a set of rules for the utilization of the model with this outcome do we now have a system capable of assisting us with decision making.

With the development of the model, we now have a general approach to evaluating a wide range of possibilities by varying the value of one or more of the model’s component variables.  By computing a reasonably large number of these scenarios, and connecting the derived points, we can see the visualization of a line in one dimensional output space, a plane in two dimensional space, or a hyper-plane in n-dimensional space.

These structures, whether simple or complex, have only two possible uses.  We can use the structure as a boundary between categories in a classification problem, or we can determine our location on the structure itself in a forecasting problem.

In our thinking about model development -- which is nothing more than the processes for determining the component variables, weights and operators -- we should be strongly influenced by how to conceptualize the way we intend to use the output from the model.   Is our intention to do classification, or forecasting?

In general, it is much easier to develop models for classification than it is for forecasting.  This is easily demonstrated simply by considering the precision requirements for each type of problem.

If we assume any n-dimensional space that is populated with a pattern of of X’s and O’s, assume that our game is to find a way to build a model that defines a boundary between the X-subspace and the O-subspace.  Our score keeping metric is ‘percent correct classification’.

In general, our search is for a way of building that boundary that will improve our “score”.  However, it is important to note that there are a large number of model-generated boundaries that achieve the same score.  Each of these boundaries has its own combination shape, slope and intercept.  

No one of these models is better, or worse, than any of the others so long as it achieves the same long-term score.  The advantage of adopting a classification approach to a problem of interest is this inherent availability of multiple models.  There are no “right” models.  We simply need to find one of the many models that out performs the model we are currently using.

In contrast, if we approach our game with the strategy of generating a forecast, our search for a model with a higher level of performance becomes much more difficult.  In forecasting, our precision requirements are much higher.  We are seeking an accurate value… a location on a continuous plane.  This requires accurate construction of the surface of interest.  Then we must determine where we are on that surface.  This is always a much more difficult approach.  The question becomes whether that level of precision is necessary, or is it sufficient to use a classification approach to the problem.
 

PHYSICAL SYSTEMS VERSUS HUMAN BEHAVIORAL MODELING
The use of mathematics and computers has generated a mindset that leads us to expect precision and explainability in all aspects of our decision making. 

Just as we addressed, above, that there are only two ways in which a model may be utilized: classification, and forecasting.  And there are only two types of problems to which these models may be applied: physical systems, and behavioral systems.

In the development of a model for physical systems, our focus in generally on finding a way to describe the way the system: the process.  The key characteristic of this type of model development is that there is a right answer.  A physical system works in a particular way.  It may be simple, or very complex, but it is governed by a set of characteristics, laws and drivers that function in a consistent, reliable manner.

Human behavior, on the other hand, is inherently messy.  Individuals are inconsistent and unreliable in the patterns of behavior they display.  Two seemingly identical individuals, based on the values of the variables we have available, may display very different behaviors.  In fact, the same individual, in even slightly different time frames, is likely to display differing behavior patterns.

Recognition of this characteristic has a significant impact on the model development process.  In the development of a physical systems model, we are searching for a “right” answer.  In a behavioral model, the best we can hope to achieve is the development of an accurate probabilistic expectation that a behavior will be displayed by a defined group of people.

It is important to note that in a physical system, we often consider the variables in a correct model to be drivers -- the attributes that describe and control the process.  However, in a behavioral model, no such drivers exist.

There is no causality in the variables contained in a behavioral model.  Rather, the variables in a behavioral model are simply a set of attributes for describing a group.  By considering these attributes, and their relationship to each other, we have defined a group that displays a particular behavior of interest at a measurable rate.

In both the development of our behavioral models, and in their use, it is important to keep in mind that we can not specifically anticipate the behavior of any specific member of the group.   Rather, we are limited to having an expected probability of seeing the behavior, based on the individual’s status as a member of the group.

Additionally, once a group has been defined for a particular behavior of interest, and the expected probability of the behavior of interest has been determined, our modeling efforts shift to determining a degree of belief as to whether or not an individual is, or is not a member of the group.

We may be interested in determining whether or not an individual is part of a group we have called ‘Respondents’.  From our analysis, we have determined that we can describe Respondents based on the relationship between a set of variables, weights and mathematical operators.  For the group described in this way, we have determined that this group displays their behavior of interest, responding, at a rate of 2%.

First, the variables in our model are not the attributes that cause any member of the group to display this behavior.  They are simply a way of describing the group.

Second, once the group is described, our attention shifts to determining whether or not an individual is a member of the group thus described.  It is important to keep in mind that our models are not determining whether or not the person will display the behavior of interest (responding in this case).  Rather, we are determining our belief that the individual is a member of the group.  If we do, in fact, decide that a specific individual is a member of the group, we may only anticipate the behavior of interest being displayed at the expected probability of the group.  In other words, our model is determining whether or not an individual is a member of the group that displays the behavior of responding at a 2% response rate.
 

DISTINGUISHING BETWEEN STATISTICS AND PREDICTIVE ANALYTICS
Much has been said, and written, about the apparently competitive nature of statistics and predictive analytics.  This is unfortunate, since used appropriately, these are highly complimentary fields of application.

Statistics and predictive analytics use many of the same techniques.  The distinction between the two fields is best considered not by the techniques utilized, but rather by the purpose for using them.

Statistics tends to focus on the description of a population.  Most often, this description is general.  It describes the central tendency, and a measure of spread.  The most common metrics employed are mean and standard deviation.

Before any work can be effectively completed in predictive analytics, we must understand the general description of the group we are working with.  In fact, our work in predictive analytics cannot come to a successful result without a reasonably accurate estimate of the general behavior of our population.

Predictive analytics is an extension of traditional statistics.  Our work lies in the belief that not all members of our group display their behavior of interest at the same rate.  Our belief is in the existence of definable sub-groups who display the behavior of interest at a rate different from the group as a whole.  And our work is focused on identifying the description of those sub-groups and assigning individuals appropriately.

In our efforts to achieve a higher score in the game we are playing, we typically look for those sub-groups who have the greatest impact on our performance.  This performance impact may be either positive, or negative.  Once successfully identified, these sub-groups will receive different treatment than is administered to the group in general.

While this may seem simplistic on the surface, this is the key strategy to the successful implementation of predictive analytics for behavior modeling in a business environment.

By completing our statistical analysis competently, we have achieved the ability to treat the group as a whole, in a particular way.  By accurately identifying sub-groups, deriving a way to assign at least some individuals to these sub-groups under a particular set of circumstances, and treating the individuals assigned to the sub-groups in a manner different from members of the general group, we have an approach that allows us to vary our game strategy and achieve improved performance.

The use of mathematical techniques for this purpose is simply a way of implementing the identification of the sub-groups.  Unfortunately, far too many practitioners approach predictive analytics with the perspective that the importance the techniques is what is critical.
 

EVALUATING PERFORMANCE IN PREDICTIVE ANALYTICS MODELS
From a practical perspective, it is not uncommon to spend close to 50% of your calendar time on a project developing and refining a project definition.  The performance of a predictive analytics model is evaluated on the basis of the user’s particular needs in the decision environment in which model will be utilized.  There is no “proper” set of performance metrics.

You, as the decision maker -- your group, your organization -- are the only people qualified to determine what your priorities are.

I doubt anyone in a business environment has ever received a raise, bonus or promotion based on R2.(the statistical “coefficient of determination” or overall model accuracy).   In selecting which of your models is most appropriate, it is critical that they be evaluated primarily on the basis of your business objectives.

These metrics are based on enhancing benefits or reducing negative aspects of your process.  Often, they are expressed as increasing profit or reducing expenses.

Your model, and its associated rules for usage, must also take into consideration the assumptions and constraints of both your organization, and the regulatory environment in which you function.

It is critical to understand that these are the rules by which you are playing your game, and how you keep score.  Only those issues that relate to your true objectives should be used in evaluating your models.  All analytic issues are secondary.

It is also critical to understand that these rules and metrics must be specified in your project definition, prior to the beginning of your model development effort.  All work subsequent to your project definition is done for the purpose of enhancing your strategy to achieving higher scores within the constraints of the rules laid out at the inception of the game.

Failure to do this is the most common reason predictive analytics projects fail.  With today’s advanced modeling tools, practitioners develop very good models that either can’t be implemented in their business environment, or don’t perform well based in the real world.

Take the time to develop your project definition in detail.  It should include your performance metrics, assumptions and constraints in which the model will operate in its live environment, current baseline levels of performance, and a listing of the resources and skill sets required to build, implement and use the model.

This is your blue print for the work to be completed.  Just as we wouldn’t consider breaking ground on the construction of a new building without a set of architectural plans, you shouldn’t begin your predictive analytics project without a project definition that lays out in detail what it is you are doing, how you will do it, and what you are attempting to achieve.

It is just as important that everyone who will be impacted, from the project sponsor, to the final users, to the functional area that the decision makers work in, to IT, all are in agreement with the plan before it commences.  To do otherwise virtually guarantees that you will have significant issues as to the viability of the project at some point in the future – regardless of the resulting model’s accuracy.
 

THE ROLE OF MATHEMATICAL TECHNIQUES IN PREDICTIVE ANALYTICS 
There are no mathematical techniques that are better, or worse, than others in general.  Our mathematical tools are simply algorithms for determining the variables, weights and mathematical operators that comprise our model.

That is not to say that different model development techniques do not have their own characteristics, their own appropriate and inappropriate uses, and their own strengths and weaknesses in particular uses.  Just as a hammer, a saw, and a screwdriver are all useful tools with attributes and particular applications in physical construction, our mathematical toolbox is comprised of tools with attributes and particular uses in model construction.

Linear regression is a commonly known tool that assumes normally distributed data, linear relationships, stable means and standard deviations, orthogonal inputs and is best used for forecasting problems.  There is noting inherently wrong with linear regression, anymore than there is something inherently wrong with a hammer.  As with a hammer, and contrary to common usage, linear regression is not the best tool for every job.

Virtually none of our real-world projects fully meet the assumptions and constraints of linear regression.  In business projects, our data is almost never normally distributed.  Many of our relationships have a non-linear component.  Behaviors change over time.  Therefore, our means and standard deviations are not stable and adjust with the change in behavior.  While we can develop orthogonal input variables in our models, doing so requires that we disregard additional variables with additional information content that may allow us to build models that perform better on our problem.  And almost all of our problems can be constructed as classification problems rather than forecasting problems.

This does not make linear regression a bad tool: it’s a highly efficient method in the right applications.  It simply means that it does not match well with most real-world problems that we encounter in the business environment.  It also means that there may be other tools and techniques that are better suited to our needs.

Logistic regression is well suited to use in classification problems, but is not particularly effective in forecasting.  Neural networks are suitable in both linear and non-linear solutions, and make no assumptions about the distribution of the data -- but are difficult to use, computationally intensive and difficult to explain.

Just as a carpenter selects from the tools available based on the particular task at hand, it is important for both the modeler, to have a variety of tools available and to know when and how to use each.
 

THE ROLE OF DATA IN PREDICTIVE ANALYTICS
Data are our raw materials for the construction of our models.  Let’s assume that we have already developed a comprehensive project definition.  We then have a conceptual understanding of exactly what we are trying to achieve.  We have a well defined set of rules for playing our game.  We have a set of tools for manipulating our data to find relationships to allow us to define sub-groups who display a behavior of interest at a rate different from the population as a whole.  Our purpose is to use our model in a decision environment to allocate our resources more effectively, so that we increase performance based on the way we keep score.

Typically, it is appropriate to budget 75% to 80% of your time on a modeling project to data-related activities.  This work is comprised of a number of tasks including collection of variables with potential information content, cleaning your data, dealing with missing variables, selecting variables, and determining appropriate data representations and transformation schemes to maximize the extraction of information content.

It is beyond the scope of this article to address these areas in detail.  However, it is important to note in the conceptualization of our model development plan, that performance enhancement come from extracting the information content from our data… not from using some new exotic tool or technique.

 
BASIC RESPONSE MODEL
As previously discussed, most business problems can be formulated as a classification problem where we have calculated the general propensity for a group to display a behavior of interest.  This is far and away, the easiest practical approach for the construction of predictive analytics projects.

Your application area may be respondents to a marketing campaign, attrition modeling, fraud detection, risk modeling, credit analysis, or any other behavior.  On a basic level, we are concerned with identifying sub-groups of the population who display the behavior at a different rate than appears in the population as a whole, and treating them in such a way to improve the performance as measured by our business objectives.

For our purposes, we will refer to these models as Response Models.  That is, individuals who comprise these sub-groups respond to a particular set of circumstances by displaying a behavior or interest at a defined rate, stated as an expected probability.

 
ONE-TAIL SOLUTIONS
In practice, the easiest way to identify a sub-group of interest is to build a model that measures a relative propensity to display the behavior of interest among the individuals of in our population.  We can then determine a reliable boundary for a sub-group that displays the behavior at a rate significantly different from the central tendency.  This allows us to classify future individuals as a member or non-member of the sub-group.

Marketing problems are a commonly used example of this type of solution.  We have determined that our population of potential customers responds to a direct mail campaign at a 2% response rate.  That is, they display our behavior of interest, purchasing, with an expected probability of 0.02. 

For the purpose of our example, we will assume that all purchases are of equivalent value, and that we are concerned only with individual mailings.  Our sole metric of performance is response rate.

Our modeling efforts are comprised of finding a mathematical model that will allow us to more effectively target which prospective customers to mail.

Our approach will consist of analyzing our historical data to develop a way of scoring the individuals in our data set based on their propensity to display the behavior of purchasing.

Our output variable is binary in form, where a 1 represents purchasers and a 0 represents non-purchasers.  This is a classification problem applied to human behavior.

Every combination of variables, weights and mathematical operators is a different model.  Each model represents a different way of scoring our individuals.  As such, each model derives its own level of performance based on our defined metric of response rate.

It is apparent that our modeling effort is geared toward finding a model that discriminates between purchasers and non-purchasers in such a way that our identified sub-group displays the behavior of purchasing at a rate significantly higher than the population’s general tendency of 2%.

Our efforts are comprised of two parts.  First building a model that consistently and reliably ranks our individuals based on their propensity to purchase.  And second, determining the cut-off score that acts as a boundary between the two groups.

Done well, this simple response model identifies a sub-group that displays the behavior of purchasing at a much higher rate than the group as a whole.

It is worth noting, that in most practical applications, the analysis of determining the cut-off score that acts as a boundary for membership in the group has a much more profound impact on performance than the raw ranking model.

Our business strategy is then based on allocating resources in such a way that we contact only those individuals who fall into the sub-group displaying the higher expected probability of purchasing.

While this example was based on marketing, the approach can be easily generalized into any type of behavior.  Additionally, viewed this way, a large variety of problems can now be conceptualized as identification of the sub-group, determination of boundaries, and appropriate allocation of resources to enhance performance.  Whether the impact of the variance in behavior is positive or negative is accounted for in the way resources are allocated.

 
TWO-TAILED SOLUTIONS
While a one-tailed solution often generates significant enhancements to performance, it is often incomplete.  Considering both tails of the distribution of propensity to display the behavior of interest generally provides an additional incremental improvement in performance.

From the above One-Tail Solution, let’s assume that we identified a sub-group that purchases at a 4% response rate.  For purposes of discussion, let’s assume that the sub-group consisted of the top two deciles of the individuals scored.

In a marketing example, based on new customers, we’d simply buy five times as many names, score the entire list, select those scoring in the top two deciles for mailing, and benefit from the enhanced selection technique.

If we modify our problem slightly, and consider a marketing campaign to existing customers, our one-tailed example isn’t completely practical.  It may not possible to collect five times as many individuals from a finite pool.

In such a case, we may still use the same approach to ranking our existing customers and setting our boundary for those individuals with a significantly higher propensity to purchase.  We want to ensure that we allocate resources appropriately to achieve the enhanced benefits these individuals provide us.

We can not afford to simply ignore the remaining individuals in our existing customer list.  Our goal becomes determining how to allocate our remaining resources most effectively.

A simple approach would be to move our boundary condition lower and lower on the ranking scale, contacting additional customers, until we have completely allocated available resources.

While this approach is practical, and may lead to results that out-perform current methods, the practical reality of predictive analytics is that the best results occur in the tails of the distribution.  That is, the more you approach the central tendency, the less reliable your results are likely to be.

How then, can we achieve the benefits of predictive analytics by focusing our attention in the tails of the distribution?  Generally, the easiest approach is to invert our initial logic.  In this case, we can identify a sub-group that is highly unlikely to purchase, and ensure that we do not allocate resources to those individuals.

This is simply the development of a separate one-tailed model, where the sub-group we are identifying as a 1 is non-purchasers.  Again, we complete our analysis by setting appropriate boundary conditions.

It is important to note that in our one-tailed solutions, we are scoring our individuals on a scale of zero to one, where a one is a strong likelihood of being a member of the set of interest.  The most common misunderstanding of this scoring is considering a zero to be a non-member of the group.  This is an inappropriate conclusion.

Remember, in human behavior modeling, we make no assumptions about causality in the variables in our model.  Our variables are simply one of many ways to describe a sub-group.  What is important is that the expected probability of the sub-group displaying the behavior is significantly different than for the group as a whole.

Individuals with a score close to zero display a low expectation of being a member of the sub-group, as we’ve defined it.  They do not necessarily display a low expected probability of displaying the behavior.  They simply are included in this sub-group.

The implication of this is that, if we want to find a sub-group that has a significantly lower than normal propensity to display the behavior of interest, we must develop, test and implement a model designed to capture that behavior of interest.

This approach to capturing both tails of a distribution, the tail that consists of a sub-group displaying a higher than normal rate of behaving in a particular way, and the inverse tail that displays the rate at a significantly lower than normal rate, generally results in an improvement in performance greater than that achieved by considering only one tail of the behavior distribution.


CONCLUSION
The conceptualization of business problems as a way of playing a game, using well defined rules and methods of keeping score, is especially well suited to the utilization of predictive analytics.

The approach of treating our decision processes as a sorting mechanism, creating groups and sub-groups for a particular purpose, assigning individuals to group membership, and allocating resources to the groups in an appropriate manner, is highly generalizable to many business scenarios.  It is also consistent with the attributes of human behavior we are attempting to anticipate.

The ranking of sub-groups, based on their expected probability to display a behavior with business impact, allows managers to allocate resources in a way that has a controllable impact on performance, and customized to the business decision maker’s priorities.

Predictive analytics is not magic.  It is not based on rocket science, and not necessarily based on extremely complex mathematical concepts.  It is based on a different way of thinking about problems, knowing clearly what you want to achieve, and manipulating your data to discover a strategy for achieving enhanced performance.
 

AUTHOR BIOGRAPHY
Thomas A. “Tony” Rathburn is a senior consultant with The Modeling Agency. Tony has worked with commercial and government clients to develop data mining solutions to significant business applications since the mid 1980’s.  Mr. Rathburn delivers custom workshops and consults on a wide range of commercial assignments -- many involving CRM applications.  He is the primary instructor of “Data Mining: Level I,” a vendor-neutral and best-practices approach to data mining as outlined at   He holds extensive data mining experience in the banking, insurance, and financial industries.   Tony may be reached at tony@the-modeling-agency.com

 
 

 

3.  ANNOUNCEMENT
 

 DM RADIO NEWS

Putting the Context Around Text Mining
TMA's Dean Abbott Contributes

Text mining can yield significant insight, but how does it work?

Tune into DM Radio to hear the April 16th broadcast from several industry leaders.  DM Radio hosts Eric Kavanagh and Jim Ericson will interview: Barry DeVille of SAS Institute, Jeff Catlin of Lexalytics, and Dean Abbott of The Modeling Agency.

The segment covered a range of issues, from best practices to embrace and pitfalls to avoid.  The live event has been archived.

 
Published with permission from DM ReviewCopyright © 2008.

 


4.
  ANNOUNCEMENT

 

 TDWI WORLD CONFERENCE

Sheraton Chicago Hotel and Towers
May 11 - 16, 2008

 

THE PREMIER EVENT FOR BUSINESS INTELLIGENCE AND DATA WAREHOUSING EDUCTION
Join leading visionaries and practitioners as they convene at the premier educational event in the business intelligence and data warehousing industry this spring in Chicago, Illinois. Take advantage of more than 50 full-day, half-day, and evening classes taught by the experts, one-on-one guru sessions, peer networking, and a hassle-free exhibit hall!  Register Now and Save!
 

KEYNOTE SPEAKERS
David Wells
– People First!—Creating a Business Intelligence Culture
Wayne Eckerson – The Myth of Self-Service BI

 
CONFERENCE BENEFITS

  • Gain practical knowledge that you can apply immediately

  • Interact with the most knowledgeable and experienced instructors in the industry

  • Get product information with a minimum of hype and hassle

  • Network and share best practices with your peers

  
For more information or to download a complete copy of the brochure.  Register before April 11, and receive the early registration discount!  Priority Code: IN03

By entering priority code IN03, you also will be entered to win $200 in American Express gift checks.

   


 
5.  NEWSLETTER SUMMARY
 

The Modeling Agency newsletter is a quarterly publication which provides course announcements, training schedule updates and informative articles.  This newsletter may be shared in its entirety and subscriptions are free. For additional information on TMA's training, consulting services and solutions, follow corresponding links at the top of this page.

This newsletter is shared with those who have activated a subscription, or have supplied their Email address to The Modeling Agency when requesting product information. If you wish not to receive future releases, simply send an empty email with cancel as he subject from the account which you were subscribed.

    address
One Oxford Centre
301 Grant St, Ste 4300
Pittsburgh, PA 15219 USA
 
phone: 281.667.4200
fax: 281.652.5721
training: 888.742.2454
Copyright © 2000 - 2008 The Modeling Agency. All rights reserved.