direct
281.667.4200

training
888.742.2454

fax
281.652.5721

email
send a message
 
newsletter
Receive quarterly training schedule updates and informative articles

 
 
 
 
 
 
 
 
 
The Modeling Agency Quarterly Newsletter
2006-Q3 Release
 

[ August 15, 2006  |  This Edition: ]
 

1. Training Schedule Update: Learn How Experts Mine Data in
San Diego, September 25 - 29

2. Feature Article:  "Predictive Modeling at the Transaction Level: A Simple 'Policy Component'" by Terry Hipolito

3. Announcement:  The Data Warehousing Institute interviews TMA's president for a TDWI Radio News segment entitled "Delving Deeper into Data Mining"

4. Announcement:  "TDWI World Conference" in San Diego,
August 20 - 25, 2006

5. Newsletter Summary

 
 

1.  TRAINING SCHEDULE UPDATE 

 

  
COURSE SERIES ON DATA MINING STRATEGY, METHODS AND APPLICATION
Learn how experts mine data by attending The Modeling Agency's vendor-neutral, application-oriented data mining courses.  Participants will enjoy a balanced and broad presentation of predictive analytics without restriction through a particular tool or product.  Attendees will learn about data mining capabilities, limitations, methods, tools, strategies, techniques, applications, and costly pitfalls. Those in attendance will leave with a comprehensive binder of notes, illustrations and references to valuable resources.

Don't leave a powerful competitive advantage untapped: harness the valuable information and profits hidden in your data.  Each offering is limited to just 18 seats.  A current status of remaining space may be viewed at TMA's main training page.   Submit an unofficial registration and reserve your seat today while your training request is processed.  

Since The Modeling Agency is not a tools vendor, participants enjoy a balanced, broad and
non-promotional perspective of predictive analytics at desirable venues throughout the USA.

CHOOSE THE TRAINING THAT'S RIGHT FOR YOU
The Modeling Agency offers three data mining courses with distinct objectives.  The courses are designed to be attended independently, or as a progressive series.  While the three levels are staged as a progression, they should not be viewed simply as "introductory, intermediate and advanced."  Refer to the table below to ensure that your experience, situation and objectives align properly with the intent, scope and depth of each offering:

Course

Focus

Scope

Geared To

Data Mining: Level I Strategy An intensive overview of strategy, best practices and case studies Project leaders,
Stakeholders,
Functional Managers
Data Mining: Level II Methods A tactical drill-down of the data mining process, methods, techniques and resources Business Analysts,
Functional Analysts,
IT Professionals
Data Mining: Level III Application A hands-on application workshop as an extension to Data Mining: Level II Practitioners,
Model-builders,
Decision Support
Developers

 

 

FULL COURSE DETAILS

The featured course schedule for this section is outdated.  For current course dates, locations, pricing and detailed outlines, please visit the main training page.

web
http://www.the-modeling-agency.com/training

email
training@the-modeling-agency.com

phone
888-742-2454 (toll free)
281-667-4200 (direct)
281-652-5721 (fax)
 

Courses May Be Delivered At Your Site

Call (888) 742-2454 or send an email inquiry to receive a value-based
spreadsheet quotation for training at your site.

 

 

2.  FEATURE ARTICLE

 

PREDICTIVE MODELING AT
THE TRANSACTION LEVEL

"A Simple Policy Component"

by
Terry Hipolito
 

THE CASE
This is a case study of a simple “policy component,” plug-in software to automate, streamline and efficiently apply predictive and risk models into the work flow.  A policy component is: a small piece of plug-in software which implements institutional policy.  The definition and description should become clear in the case study.

A large insurance company processes thousands of scanned documents in a day.  Operators are trained to work specific kinds of transactions; many of these go into several exception queues before they are either completed or discarded.  The work is partially automated, but there are many manual touchpoints, because of technical difficulties with the documents, typographical errors, and incomplete policy information.

The insurance company operates as a servicing agent between mortgage holders, mortgage banks, and insurance companies of record, all of whom issue transactions either from branch offices, or (a fourth element to the puzzle) independent agents.  The incoming transactions must of course have accurate and conclusive identification of all parties.  Operators are evaluated on volume and are expected to process over one hundred transactions an hour.

There are many places where this complex work flow can break down.  A common one, indeed the most common one, is that of the ‘payee code’.  The payee code refers to the office, or merely to the department within the office, from or to which the transaction flows.  Each mortgage bank refers to these offices with its own coding system; so the same office (for example a State Farm agent in Des Moines, Iowa) may be identified in numerous ways, depending on whether the mortgage banker is Washington Mutual, Wells Fargo, or Ameriquest.   Of course the office does not change, but the code itself may be entirely different depending on the banker.

The payee code is, therefore, a crucial but slippery bit of data required for a transaction to be completed.  And very often this code does not come in with the transaction, or if it does it may be incomplete or inaccurate.  If the office issues the transaction, it is unlikely to know what code the banker uses to identify it.  If the bank issues the transaction, it may simply neglect to enter the information or get it wrong.  The payee code, for whatever variety of reasons, ends up being a major problem in pushing transactions efficiently through the system.  The company could well spend several thousand dollars each week tracking down payee codes.

Most transactions come in to the office as physical paper and are scanned into images and interpreted into text.  The image is available to the operator, and the text is used by software to prepare the transaction even further.  For example, other missing or incorrect data elements are frequently the borrower’s loan or policy number.  The borrower’s name, address, and social security number are, however, far more likely to be correct and complete.  This information often makes it possible to query a database and fill in the loan and policy numbers.  A loan or policy number acquired in this way may need to be verified, and the transaction is flagged accordingly for the operator to intervene as necessary.
 

THE POLICY COMPONENT
Software preparation and intervention is, in other words, already part of the established process.  It turned out that similar help was possible for the payee code with a simple predictive model.  Policy numbers sometimes have specific formats which may be very informative.  For example, some carriers embed groups of letters in their policy numbers; this particular policy: 6132HP200809 is likely to be identifiable by its pattern of digits and letters.  Indeed, the policy number was quite plausibly designed just so it could be easily identified. 

One problem in the bulk processing center is that there are so many such patterns flying around that it takes a very experienced and talented operator indeed to keep very many of them in mind, as other pressures are also being exerted.   An expert operator might recognize the pattern, know the mortgage banker quite well, recall that this is not an EDI transaction, and realize that therefore the missing payee code is most likely: 61074.  But relying on that expert knowledge happening is not  especially efficient if something better is available.

And something better is available.  We have learned already that a policy number of four digits, followed by two letters, followed by six digits is for this mortgage banker either guaranteed or quite likely to be (specifically) Nationwide Mutual.  This sort of knowledge is similar to the knowledge that a particular borrower name and address imply a particular loan and policy.  The knowledge about the payee code, however, is shakier than the loan and policy number knowledge.  But logically there is very little difference.  Here is the response of a prototype installation for this policy number:

  2 scores for 6132HP200809  
       
  Code Office Score
  61074 NATIONWIDE MUT FI INS CO 486
  EDINA NATIONWIDE MUT FI INS CO 340

 
The same office is involved in this case, but the distinction between manual and electronic transfer (which the different payee codes indicate) is not built into the model.  This lack of certainty needed to be recognized and dealt with.  It might well be that a policy pattern applies to several payees.  It is also true that some patterns of policy numbers carry more ‘information’ about the payee than others do.  For example a nine digit policy number with no letters or special characters might apply to several carriers as well as numerous payee codes, as in this example:

  7 scores for 921228387  
       
  Code Office Score
  60923 FI INS EXCHANGE 413
  60917 FARMERS INS CO INC 388
  60702 ALLSTATE INS CO 361
  62772 ALLSTATE INS 336
  60705 ALLSTATE INDEMNITY CO 299
  61184 STATE FARM FI & CAS CO 133
  61178 STATE FARM FI & CAS CO 95

 
Here a purely numeric policy leads to more ambiguous results, but it looks as though Farmers’, Allstate, or State Farm are likely candidates.  The operator may find this sort of information helpful or confusing.  Sorting all of this out takes some effort and careful analysis.

It might at first seem daunting to get one’s arms around how to grapple with this sort of problem.  But it merely needed a quite straightforward bit of data mining: 

  1. Make a collection of problem transactions and arrange by policy number and payee code.

  2. Inspect the policy numbers of the high volume exceptions and look for patterns.

  3. Prepare some pattern matching software in prototype to catch the policy patterns.

  4. Attempt to predict the payee code on a set of randomly selected transactions.

  5. Quantify the success rate as a confidence score.

  6. Iterate for optimal effect, with examples selected at random.

 
The resulting model employs simple statistics – percent of the pattern in the transaction population, percent of the payee in the transaction population, and percent of the pattern for the payee – to come up with a predictive score.  All of this goes to a database which is maintained in memory and is instantly available to any transaction.  Here are the two database records which provided the scores for the first example above:

Payee Code: 61074
Name: NATIONWIDE MUT FI INS CO
Score: 486
Regular Expression: [0-9]{4}[^a-z]{2}[0-9]{6}
Min. Length: 12
Max Length: 12
Pattern / Population: 0.021
Pattern / Payee: 0.8
Payee / Population: 0.017

Payee Code: EDINA
Name: NATIONWIDE MUT FI INS CO
Score: 340
Regular Expression: [0-9]{4}[^a-z]{2}[0-9]{6}
Min. Length: 12
Max Length: 12
Pattern / Population: 0.0298
Pattern / Payee: 1
Payee / Population: 0.0045

 
A policy component such as the payee code software is liable to require some handcrafting, but it was not in this case very labor-intensive, considering the potential payoff in improved transaction throughput and accuracy in a hectic and ambiguous work environment.  The prototype in the background of this discussion required about two weeks’ work for a single person.  Even if it had turned up exactly nothing, it might have been worth an investment of that size to discover where not to look further for relief.   

The results, however, were a bit better than that.  Whether they were worth production deployment was another question.  Fortunately the statistics which predict the effect of such a system are quite reliable.  Unknown, and unknowable from the perspective of the prototype, is the actual effect on the business, since that depends on variables outside of the data mining and prototyping.  Those have to do with volumes, training, and other environmental and economic factors.  Decisions concerning the purely business problems had to be supplied from business managers, but the costliness of the problem made them more than willing to consider them. 

This two-week effort, in other words, made it possible to create a reliable cost benefit analysis of deploying the prototype into production.  After the prototyping effort, nearly all risk involved with the technology had been surmounted.  The deployment strategy dictated where and how to implement the new facility within the workflow.  But the risk of having incorrect or impractical software had essentially been achieved with the conclusion of the prototype.
 

COST BENEFIT ANALYSIS
The cost benefit analysis for this project was not formal.  It was in fact a proverbial slam dunk.  Its numbers, however, are well worth reviewing here for two reasons: (1) readers of this description will not have the familiarity which the managers did, and (2) the consideration of ROI is a crucial piece of any policy component.

The plug-in nature of components generally is familiar to one and all as bits of technology.  Now consider the nature of the component as part of a business environment.  On the one hand there is a legacy system and process for transactions; on the other is a functional prototype which might plug in to the legacy and which would address a quantifiable bottleneck in the existing system.  Those of you who have managed IT projects, either from the technical or business sides, know the difficulty of measuring scope, risk, and the impact on the overall process at the beginning of a six-month campaign of software development.

Contemplate the component, all fabricated apart from the system by one or two people, essentially without need for management.  All that is required with such a prototype in hand is to measure what the component does exactly and see how much that would help or hurt.  In the case of the payee code, the help was considerable but not alas all that might have been hoped for.  It was estimated that for a typical batch of transactions, the components could accurately predict the payee code approximately 8% of the time and offer useful suggestions 33% of the time.

These numbers suggested that perhaps the component might reduce the payee code exception queue by 10%, increase customer satisfaction and reduce error handling on 5,000 transactions a week, thereby speeding up the flow, creating good will, increasing competitive edge (all without quantification), but (a hard number) saving nearly 100 hours of labor.

The following screen from the prototype shows the results of running 100 randomly selected policies against the patterns and evaluating the performance:

Scored 100 transactions in 16 milliseconds

Percent scored: 49.00%
Absolute percent accurate top score: 12.00%
Absolute percent accurate secondary score: 9.00%
Percent accurate top score of those processed: 24.49
Percent accurate secondary score of those processed: 18.37%

Average correct top score: 426.67
Average correct secondary score: 226.67

  
In this particular case nearly 50% of the policies received some sort of score, 12% of which were accurate and another 9% plausibly helpful.  The average score of the accurate predictions is exactly 200 points higher than those which seem merely helpful.  A series of such random selections offers the sort of quantification which should be helpful for making a confident decision. 

Meanwhile the component has been completed and technical risk eliminated.  Nevertheless coding is necessary to install the component into the legacy system, to make changes to workflow and to provide documentation and training.  Such things will vary of course and need not concern us, except to note that the estimates should be relatively accurate.  The remaining build-out resembles a plumbing project more closely than it does advanced technology.

 
DEPLOYMENT
These results have genuine meaning only when the deployment strategy has been thought out.  Other numbers might be more relevant to other deployment strategies.  The following seemed to be the most straightforward way to effect deployment in this transaction environment:

  1. Subject all transactions to pattern matching for payee code during preprocessing.  (The scoring process requires negligible computer resource, as is evident in the last screen shot, which averages slightly over 15 milliseconds to process 100 randomly selected policy numbers.)

  2. Write a database record for each prediction; there might be more than one per transaction, in which case they would be ordered by score.

  3. Predictions for each transaction, if they exist, are available to the operator.

  4. An icon or some unobtrusive signal is presented for selection if there are predictions.

  5. The operator may select the icon and receive a pop-up list which indicates the payee codes, the payee names, and their scores, sorted by highest score at the top.

  6. The operator considers these recommendations and chooses any one (or none) of them.

  7. Transaction logging reflects any of these choices in the history of the transaction for subsequent analysis and tuning.

 
This strategy might or might not be feasible or desirable in any given work environment.  The decision and strategy to deploy the prototype of a policy component requires a thorough review.  The strategy is available for review, correction, emendation, and the like.  A completely different strategy might require a different set of statistical results, which should not be difficult to produce.  In any event, concrete “policy” is now available for decision support.


MAINTENANCE
A final aspect of deployment involves maintenance.  In this example, payee codes are quite volatile; the mortgage bankers are liable to change, add, and delete them frequently.  When new mortgage bankers are added to the system, this subsystem needs to be maintained as well.  The prototype implements its scoring by loading a database table, with these columns:

  • a unique identifier for the mortgage banker,
  • the payee code,
  • the pattern as a regular expression,
  • a minimum length,
  • a maximum length,
  • the resultant score,
  • some statistical numbers for internal use, generated by the software.

 
Maintenance is therefore quite simple to effect physically.  The ‘logical’ maintenance is somewhat more difficult.  That requires the statistical routines to be applied.  The prototype has these in place; so running them on a schedule or as the situation demands is simple enough to effect, but does require some time, some planning, and some budgeting, and decisions.  The implemented system, however, need never require coding changes to keep pace with environmental changes. 

 
POLICY COMPONENTS IN SUMMARY
This example was selected for its simplicity.  It starkly profiles a policy component; a policy component has the following features:

  1. Implements policy as predictive modeling at any place in the enterprise from point of sale to boardroom;

  2. Results from specific analysis, often through data mining and statistical modeling;

  3. Employs a RAD prototyping methodology;

  4. Has a predictable ROI after prototype and prior to full implementation;

  5. Is capable of nearly any statistical functionality;

  6. Is “pluggable,” “embeddable” and extremely efficient.

  
Although this example was selected primarily for its simplicity, policy components can support far greater complexity.  For example, a mortgage banker implemented a policy component, which was actually deployed more as a full subsystem, to evaluate its entire portfolio for the likelihood of each of its loans becoming 30, 60, or 90 days delinquent.  The statistical analysis for this component employed logistic and linear regression on eight variables, cluster analysis, and the transformation of the entire model into fuzzy sets.  The scoring (which included reading and writing text files and the “persisting” of objects) averaged 20 milliseconds per loan.  The prototyping phase for this project required six weeks for two people.  The implementation employed a team of several developers for nearly four months, but the predicted ROI of over $1,000,000 per annum was easily realized, and (more importantly) was very close to the prediction.

 
ABOUT THE AUTHOR
Terry Hipolito has several years’ experience with software development and architecture, statistical modeling, databases, and project management; his education includes a Ph.D. from UCLA. Terry is now an independent consultant who specializes in the design, development, and deployment of “policy components.”  He is writing a book on this subject, complete with methodology, statistical theory and full examples.  A subset of this content will soon be available on www.policybots.com.  Reach Terry via tahipolito@earthlink.net or fax (714) 993-3218.


 

All Rights Reserved by Terry Hipolito Copyright © 2006

 

 

3.  ANNOUNCEMENT
 
 
The Data Warehousing Institute interviews TMA's president

Download the TDWI Radio News segment entitled
"Delving Deeper into Data Mining"
 
 

ABOUT THE INTERVIEW
The Data Warehousing Institute's web editor, Eric Kavanagh interviewed The Modeling Agency's president, Eric King to gain insights on data mining definitions, misconceptions, trends, best practices, strategy, process and applications.

TDWI Radio News delivers close-up interviews with industry professionals in the growing field of information management. Listen as practitioners give their elevator pitch, then answer questions designed to elicit brass-tacks examples.

DATA MINING INTERVIEW TOPICS INCLUDE

  • How companies use data mining for competitive advantage

  • Common misconceptions about data mining

  • Professional tips for beginning a data mining initiative

  • Establishing a successful data mining project plan

INTERVIEW DOWNLOAD PAGE

Download a .wav or .mp3 file from TDWI's Radio News archive page.

Produced with permission from The Data Warehousing Institute Copyright © 2006
 


4.
  ANNOUNCEMENT
 

TDWI World Conference
The Premier Event for Business Intelligence
and Data Warehousing Education

August 20 - 25, 2006
Manchester Grand Hyatt
San Diego, California

 
CONFERENCE HIGHLIGHTS
The TDWI World Conference in San Diego brings together leading industry visionaries to deliver a unique program of cutting-edge education, best practices, one-on-one consulting, peer networking, business intelligence certification, and product demos. From business intelligence fundamentals to business analytics, TDWI’s program of more than 50 full-day, half-day, and night school courses offers something for your entire team.

At the TDWI World Conference, The Modeling Agency's Tony Rathburn will present a full-day seminar on "Predictive Analytics" Wednesday and Dean Abbott will present "Data Mining" on Thursday.
 


HOT TOPICS

  • Bringing Business and IT Together

  • Measuring the Value of Information

  • Data Mining and Predictive Analytics

  • Getting BI Requirements Right

  • BI and Governance

  • Open Source Adoption

 

CONFERENCE REGISTRATION

View additional information, and register for the TDWI World Conference in San Diego.
 

Produced with permission from The Data Warehousing Institute Copyright © 2006
   
 


 
5.  NEWSLETTER SUMMARY
 

The Modeling Agency newsletter is a quarterly publication which provides course announcements, training schedule updates and informative articles.  This newsletter may be shared in its entirety and subscriptions are free. For additional information on TMA's training, consulting services and solutions, follow corresponding links at the top of this page.

This newsletter is shared with those who have activated a subscription, or have supplied their Email address to The Modeling Agency when requesting product information. If you wish not to receive future releases, simply send an empty email with cancel as he subject from the account which you were subscribed.

    address
One Oxford Centre
301 Grant St, Ste 4300
Pittsburgh, PA 15219 USA
 
phone: 281.667.4200
fax: 281.652.5721
training: 888.742.2454
Copyright © 2000 - 2008 The Modeling Agency. All rights reserved.