direct
281.667.4200

training
888.742.2454

fax
281.652.5721

email
send a message
 
newsletter
Receive quarterly training schedule updates and informative articles

 
 
   
 
 
 

 

 

BusinessIntelligenceSITE - Business Software Directory

 
 
The Modeling Agency Quarterly Newsletter
2007-Q4 Release
 

[ December 17, 2007  |  This Edition: ]

1.  Training Schedule Update:  Learn how experts mine data, and why building an internal predictive modeling practice is within your grasp.  Next up: Orlando January 28 to February 1, 2008
and Las Vegas April 7 - 11, 2008

2.  Feature Article:  "Taking the Temperature of Your Data" by Thomas A. "Tony" Rathburn, Senior Consultant, The Modeling Agency

3.  Announcement:  B-Eye-Network launches new Expert Channel for "Data Mining and Predictive Analytics" featured by The Modeling Agency's president, Eric A. King

4.  Announcement:  TDWI World Conference in Las Vegas, Nevada, February 17 to 22, 2008

5.  Newsletter Summary

 
 

1.  TRAINING SCHEDULE UPDATE 

 

  
LEARN HOW EXPERTS MINE DATA IN ORLANDO OR LAS VEGAS

The next offering of The Modeling Agency's vendor-neutral, application-oriented data mining courses is scheduled for January 28 to February 1 in Orlando and April 7 to 11 in Las Vegas.  Participants will enjoy a balanced, broad and non-promotional presentation of predictive modeling without restriction to a particular tool method or product.

 

Attendees will learn about data mining capabilities, limitations, best practices, strategies, methods, tools, techniques and applications while enjoying all the entertainment and seasonal weather that Orlando and Las Vegas have to offer.  Those in attendance will leave with a comprehensive binder of notes, illustrations and references to valuable resources.  Don't leave a powerful competitive advantage untapped: harness the valuable information and profits hidden in your data. 

Previous offerings sold out months in advance and the Orlando offering is limited to just 18 seats.  Be sure to reserve your space early.  A current status of remaining space may be viewed at TMA's training schedule page.   If you're not yet ready to formalize your registration, you may submit an unofficial registration without obligation or penalty and reserve your seat today while your training request is processed.
 

CHOOSE THE TRAINING THAT'S RIGHT FOR YOU
The Modeling Agency offers three data mining courses with distinct objectives.  The courses are designed to be attended independently, or as a progressive series.  While the three levels are staged as a progression, they should not be viewed simply as "introductory, intermediate and advanced."  Refer to the table below to ensure that your experience, situation and objectives align properly with the intent, scope and depth of each offering:

Course

Focus

Scope

Geared To

Data Mining: Level I Strategy An intensive overview of strategy, best practices and case studies Project leaders,
Stakeholders,
Functional Managers
Data Mining: Level II Methods A tactical drill-down of the data mining process, methods, techniques and resources Business Analysts,
Functional Analysts,
IT Professionals
Data Mining: Level III Application A hands-on application workshop as an extension to Data Mining: Level II Practitioners,
Model-builders,
Decision Support
Developers
 
 
FULL COURSE DETAILS

The featured course schedule for this section is outdated.  For current course dates, locations, pricing and detailed outlines, please visit the main training page.

web
http://www.the-modeling-agency.com/training

email
training@the-modeling-agency.com

phone
888-742-2454 (toll free)
281-667-4200 (direct)
281-652-5721 (fax)
 

 

 
Courses May Be Delivered At Your Site

Call (888) 742-2454 or send an email inquiry to receive a value-based
spreadsheet quotation for training at your site.


Government Buyers
TMA is a CCR Registered Veteran-Owned Small Business and accepts EFT.
 

 
 
 

 

2.  FEATURE ARTICLE
 

Taking the Temperature of Your Data

by
Thomas A. "Tony" Rathburn
Senior Consultant

The Modeling Agency

 
INTRODUCTION
There are a large variety of quantitative techniques available to assist in the development of mathematical models.  But the seasoned practitioner understands that they all do basically the same thing: they help us search for a set of variables, weights and operators in the form of an equation.  When that equation is applied to a set of decision data, it enhances the performance of our decision making.

The algorithms behind our model development effort are seeking those variables that have information content relative to the goals we have defined.  Our data, and the information content it contains, is the source of enhanced performance.

Successful practitioners typically spend 75% to 80% of their overall modeling effort preparing data.  These efforts deal with issues such as understanding the context of the available data fields, handling of missing data, identifying and correcting data errors, identification and representation of interaction effects between variables, mathematical transformation of data to obtain different perspectives on the information content, and data representation schemes appropriate for the type of data being utilized.

Practitioners new to predictive analytics often overlook this last issue.  The physical representation of the data in their data set can often have significant impact on the information content presented to the modeling technique.  This article presents a brief discussion comparing two approaches: common data representation, and an enhanced approach for certain types of data.
 

DATA TYPES
Just as quantitative techniques have strengths and weaknesses, so does our data.  When considering the context of our data, it is also important to understand the mathematical capabilities of our data.  It is obviously trivial to point out that the mean and standard deviation of variables such as zip code is meaningless at best.  However, many practitioners overlook more serious considerations and miss important data representation issues as a result.

Each variable in your data set should be clearly identified as being either quantitative or qualitative in nature.  The characteristic of importance here is ‘order’.  There is no inherent order in a qualitative variable.  Quantitative variables, on the other hand, have an underlying order.  It is beyond the scope of this article to consider the types of mathematics that are appropriate for the various types of quantitative variables (ordinal, interval and continuous).  Rather, we will focus on the implications of the characteristic of ‘order’, and data representation schemes that are of use to enhance the extraction of information content.
 

QUALITATIVE VARIABLES
A qualitative variable is typically simply a variable that describes a set of categories.  The variable will have two or more values, each representing a category meeting a particular set of conditions.  An example of a qualitative variable is marital_status.

For this discussion, let’s assume that marital_status has the following values:

Marital Status

Married

Single

Divorced

Widowed

Separated

Other

The values of the variable marital_status  have relative order.  We can easily rearrange them in any other order with no impact on the information content.

However, from a predictive analytics perspective, we still have many questions that need to be addressed for a field of this type.

  • Are the values exhaustive?  Have they captured all possible circumstances?  The value ‘other’ takes care of this for us.

  • Are the values mutually exclusive?  Given that this is an individual’s current status rather than any value that may have ever applied, suggests exclusivity.  However, the values as stated suggest that ‘Separated’ is mutually exclusive from ‘Married’.  Is that context what is truly desired?

  • Should this variable be represented in our modeling data as one variable with six values, or are there other alternatives that should be considered?

  • Do we need six values for the variable marital_status ?

 
COLLAPSING VALUES
For the marital_status , we have identified six values.  Is this the appropriate number of categories?  It is important to understand that there is no “right” answer to this question generally.  The answer is always going to be contingent on what the context of usage is.  For some decision environments, this is going to be the most appropriate representation. 

  • Is it sufficient to use on the values ‘Married’ and ‘Other”. 

  • Do we need to combine ‘Divorced’ and ‘Widowed’?  Is it sufficient to know only that the individual was at one time married, but no longer is?

  • Should we combine ‘Single’, ‘Divorced’ and ‘Widowed’?  Is it sufficient to know that the person is not currently married?

 
These are empirical questions.  They can only be answered in the context of the particular decision environment we are exploring.  How many values to use, and how to collapse the values, are best answered by testing each of the combinations and measuring the impact that the representation has on performance.
 

DATA REPRESENTATION ALTERNATIVES
We must also consider the impact of different data representation schemes.  In this case there are two alternatives:

  • A single variable with six values, as above, and

  • Six variables, one for each of the values, using binary representation for each: commonly referred to as a 1 of N representation.
     

 

Married

Single

Divorced

Widowed

Separated

Other

Married

1

0

0

0

0

0

Single

0

1

0

0

0

0

Divorced

0

0

1

0

0

0

Widowed

0

0

0

1

0

0

Separated

0

0

0

0

1

0

Other

0

0

0

0

0

1

The 1 of N representation allows for more flexibility.  Some of our modeling techniques may identify relationships differently than others.  Some may focus only on one of the values.  Others may use more than one, but not all of the values.  Still others may use all six values.  This inherent flexibility makes the 1 of N representation appropriate for virtually all qualitative variables.
 

QUANTITATIVE DATA
Let’s explore another example: Education_Level.

Education_Level

< High School

High School

Some College

Bachelor’s Degree

> Bachelor’s Degree

Education_Level is an example of quantitative data.  While it isn’t represented by numeric values, ‘order’ is a significant characteristic.  This is, in fact, an ordinal variable.  It would be inappropriate to compute any type of mathematical calculations, even if the data were represented numerically, since there is an inconsistent interval in the values.

Just as we considered collapsing the values in the variable Marital_Status, above, the same considerations apply here.  The number of values appropriate for Education_Level, is purely determined by empirical testing in the decision environment in which we are working.

The data representation issues are also similar.  We can obtain a number of advantages by using a 1 of N representation for Education_Level.

 

< High School

High School

Some College

Bachelor’s Degree

> Bachelor’s Degree

< High School

1

0

0

0

0

High School

0

1

0

0

0

Some College

0

0

1

0

0

Bachelor’s Degree

0

0

0

1

0

> Bachelor’s

0

0

0

0

1

While this 1 of N representation allows for the flexibility advantages discussed above, it does not capture the ‘order’ characteristics of the variable Education_Level.  If this representation were used as an output variable, for instance, your answers would either be correct or incorrect.  You would be unable to assess the degree of incorrectness, as the data representation scheme does not capture that information.

On the other hand, consider a different representation scheme: a Thermometer Representation.

 

< High School

High School

Some College

Bachelor’s Degree

> Bachelor’s Degree

< High School

1

0

0

0

0

High School

1

1

0

0

0

Some College

1

1

1

0

0

Bachelor’s Degree

1

1

1

1

0

> Bachelor’s

1

1

1

1

1

The logic of a Thermometer Representation is very straightforward.  An individual in the category High_School, has all of the attributes of someone in the category <High_School... plus something else.  An individual in the category Some_College, has all of the attributes of someone in the category High_School... plus something else. And, so on.

The Thermometer Representation allows us to capture ‘order’ in our values and, as a result, allows us to consider degree of incorrectness. 

While it would be physically possible to use a Thermometer Representation on the Marital_Status variable, discussed above, it would not make sense to do so.  A qualitative variable has no ‘order’.  On the other hand, restricting our data representation method for a quantitative variable to a 1 of N representation misses an important characteristic of the information content available.

It is worth noting, that a Thermometer Representation also allows us to control the direction of error.  In the representation above, the logic reinforces the building of levels.  As a result, this representation scheme will have a tendency to underestimate the value.

Is this what we want?  Again, it depends.  If we are in a decision environment where we would prefer to have overestimation when we are incorrect, we simply need to invert the Thermometer Representation to achieve that result.

 

< High School

High School

Some College

Bachelor’s Degree

> Bachelor’s Degree

< High School

1

1

1

1

1

High School

0

1

1

1

1

Some College

0

0

1

1

1

Bachelor’s Degree

0

0

0

1

1

> Bachelor’s

0

       0

0

0

1

 
CONCLUSION
Take the time to carefully consider the attributes of your data fields. Creatively match data representation schemes with the characteristics of the variable in use.  This effort can have a dramatic impact on the performance of your models.   

Enhanced model performance comes from extracting as much information content as possible… relative to the specific performance metrics you are using to measure success.

 
ABOUT THE AUTHOR
THOMAS A. "TONY" RATHBURN is a Senior Consultant with The Modeling Agency.  Tony has worked with commercial and government clients to develop data mining solutions to significant business applications since the mid 1980’s.  Mr. Rathburn delivers custom workshops, keynote presentations and consults on a wide range of commercial assignments -- many involving predictive CRM analytics.  He holds extensive data mining experience in the banking, insurance, and financial industries.   

Mr. Rathburn’s Experience includes seven years teaching MIS and Statistics at both the graduate and undergraduate level while an instructor in the College of Business at Kent State University. Tony’s experience covers a broad range of practical experience in addition to his teaching background. His consulting expertise has been concentrated in the business utilization of advanced knowledge discovery techniques. He served as Vice President of Applied Technologies for NeuralWare, Inc., a neural network tools and consulting company. He was also the Research Coordinator for LakeShore Trading, Inc., a successful futures and options trading firm on the Chicago Board of Trade.  Tony may be reached at tony@the-modeling-agency.com

All Rights Reserved by The Modeling Agency Copyright © 2007


 

 

3.  ANNOUNCEMENT

 
The Business Intelligence Network

Launches a New Expert Channel
for Data Mining and Predictive Analytics


featuring
Eric A. King
President
The Modeling Agency 


The Business Intelligence Network™ launched a new Expert Channel for Data Mining and Predictive Analytics featuring The Modeling Agency's president Eric A. King. The new Expert Channel covers the practical application of strategy, tactics and best practices for predictive modeling.  Topics and resources will focus on extending the value of business analytics with prospective intelligence derived through knowledge discovery and machine learning technology.

New content will be provided each month to guide business professionals in understanding how and when to get started in data mining, how to successfully maneuver a discovery process, interpret results, achieve actionable results and compare implementations across industries.  The channel contains articles, news, solution spotlights, events, links, channel resources and RSS article feeds.

SOLUTION SPOTLIGHT
As well, Eric was interviewed by Ron Powell, Editorial Director of The Business Intelligence Network™ for a Solution Spotlight last month.  The Solution Spotlight offers a short, but informative overview of The Modeling Agency's structure, focus, scope and experience, as well as its philosophy toward effective data mining implementation.   Be sure to tune into Eric's presentation of The Modeling Agency and predictive modeling trends and tips at The Business Intelligence Network's Solution Spotlight.
 

ABOUT THE BUSINESS INTELLIGENCE NETWORK
The Business Intelligence Network™ delivers industry-based content hosted by domain experts and industry leaders. The Business Intelligence Network includes horizontal technology coverage from the most respected thought leaders in Business Intelligence, Business Performance Management, Data Warehousing and Data Quality. The Business Intelligence Network serves these communities with unparalleled industry coverage and resources.

 


4.
  ANNOUNCEMENT
 

TDWI World Conference
The Premier Event for Business Intelligence
and Data Warehousing Education

February 17 - 22, 2008
Caesar's Palace, Las Vegas

 
 

REGISTER NOW AND SAVE
Priority Code: IN03

Join leading visionaries and practitioners as they convene at the premier educational event in the business intelligence and data warehousing industry this winter in Las Vegas, Nevada. Take advantage of more than 50 full-day, half-day, and evening classes taught by the experts, a two-day business executive summit for BI directors and BI sponsors, peer networking, one-on-one guru sessions, and a hassle-free exhibit hall! 

KEYNOTE SPEAKERS

Larry P. English – Information Management in the Realized Information Age

Nancy Williams and Bob Paladino – Like Yin and Yang—BI and the Balanced Scorecard for Holistic Performance Management

THE MODELING AGENCY'S TRACK
The Modeling Agency's Senior Consultant, Dean Abbott, will present "Data Mining Techniques, Tools and Tactics" at the TDWI World Conference in Las Vegas.


CONFERENCE BENEFITS
 

  • Interact with the most knowledgeable and experienced instructors
    in the industry
  • Gain practical knowledge that you can apply immediately
  • Bridge the knowledge and communication gaps between business and IT
  • Network and share best practices with your peers

CONFERENCE REGISTRATION

For more information, and to register for the
TDWI World Conference in Las Vegas, please visit:

TDWI World Conference in Las Vegas
 Register before January 18, and receive the early registration discount.
Priority Code: IN03

Produced with permission from The Data Warehousing Institute Copyright © 2007
   


 
5.  NEWSLETTER SUMMARY
 

The Modeling Agency newsletter is a quarterly publication which provides course announcements, training schedule updates and informative articles.  This newsletter may be shared in its entirety and subscriptions are free. For additional information on TMA's training, consulting services and solutions, follow corresponding links at the top of this page.

This newsletter is shared with those who have activated a subscription, or have supplied their Email address to The Modeling Agency when requesting product information. If you wish not to receive future releases, simply send an empty email with cancel as he subject from the account which you were subscribed.

    address
One Oxford Centre
301 Grant St, Ste 4300
Pittsburgh, PA 15219 USA
 
phone: 281.667.4200
fax: 281.652.5721
training: 888.742.2454
Copyright © 2000 - 2008 The Modeling Agency. All rights reserved.