|
|
|
| |
|
| |
|
[ August 15, 2006 |
This Edition: ]
1.
Training Schedule Update: Learn How Experts Mine Data
in
San Diego, September 25 - 29
2.
Feature Article: "Predictive Modeling at
the Transaction Level: A Simple 'Policy Component'"
by Terry Hipolito
3.
Announcement:
The Data Warehousing Institute interviews
TMA's president for a TDWI Radio News segment entitled
"Delving Deeper into Data Mining"
4.
Announcement: "TDWI World Conference" in
San Diego,
August 20 - 25, 2006
5.
Newsletter Summary
|
| |
 |
| |
|
1.
TRAINING SCHEDULE UPDATE
COURSE SERIES ON DATA MINING STRATEGY, METHODS AND APPLICATION
Learn how experts
mine data by attending The Modeling Agency's vendor-neutral,
application-oriented data mining courses. Participants
will enjoy a balanced and broad presentation of predictive
analytics without restriction through a particular tool or
product. Attendees will learn about data mining
capabilities, limitations, methods, tools, strategies,
techniques, applications, and costly pitfalls. Those in attendance
will leave with a comprehensive binder of notes,
illustrations and references to valuable resources.
Don't leave a
powerful competitive advantage untapped: harness the valuable information and
profits hidden in your data. Each offering is limited to just 18 seats. A current status of remaining space may be
viewed at TMA's main training page.
Submit an
unofficial registration and reserve your seat today while
your training request is processed.
Since The Modeling Agency is not a
tools vendor, participants enjoy a balanced, broad and
non-promotional perspective of predictive analytics at desirable
venues throughout the USA. |
CHOOSE THE TRAINING
THAT'S RIGHT FOR YOU
The Modeling Agency offers three data mining courses with
distinct objectives. The courses are designed to be
attended independently, or as a progressive series. While the
three levels are staged as a progression, they should not be viewed
simply as "introductory, intermediate and advanced." Refer to the table
below to ensure that your experience, situation and objectives align
properly with the intent, scope and depth of each offering:
|
Course |
Focus |
Scope |
Geared To |
|
Data Mining:
Level I |
Strategy |
An intensive overview
of strategy, best practices and case studies |
Project
leaders,
Stakeholders,
Functional Managers |
|
Data Mining:
Level II |
Methods |
A tactical drill-down
of the data mining process, methods, techniques and
resources |
Business
Analysts,
Functional Analysts,
IT Professionals |
|
Data Mining:
Level III |
Application |
A hands-on
application workshop as an extension to Data Mining: Level
II |
Practitioners,
Model-builders,
Decision Support
Developers |
|
|
|
 |
|
2.
FEATURE ARTICLE
PREDICTIVE MODELING AT
THE TRANSACTION LEVEL
"A Simple Policy Component"
by
Terry Hipolito
THE CASE
This is a
case study of a simple “policy component,” plug-in software to automate,
streamline and efficiently apply predictive and risk models into the work
flow. A policy component is: a small piece of plug-in software which
implements institutional policy. The definition and description should
become clear in the case study.
A large insurance company
processes thousands of scanned documents in a day. Operators are trained to
work specific kinds of transactions; many of these go into several exception
queues before they are either completed or discarded. The work is partially
automated, but there are many manual touchpoints, because of technical
difficulties with the documents, typographical errors, and incomplete policy
information.
The insurance company
operates as a servicing agent between mortgage holders, mortgage banks, and
insurance companies of record, all of whom issue transactions either from
branch offices, or (a fourth element to the puzzle) independent agents. The
incoming transactions must of course have accurate and conclusive
identification of all parties. Operators are evaluated on volume and are
expected to process over one hundred transactions an hour.
There are many places where
this complex work flow can break down. A common one, indeed the most common
one, is that of the ‘payee code’. The payee code refers to the office, or
merely to the department within the office, from or to which the transaction
flows. Each mortgage bank refers to these offices with its own coding
system; so the same office (for example a State Farm agent in Des Moines,
Iowa) may be identified in numerous ways, depending on whether the mortgage
banker is Washington Mutual, Wells Fargo, or Ameriquest. Of course the
office does not change, but the code itself may be entirely different
depending on the banker.
The payee code is,
therefore, a crucial but slippery bit of data required for a transaction to
be completed. And very often this code does not come in with the
transaction, or if it does it may be incomplete or inaccurate. If the
office issues the transaction, it is unlikely to know what code the banker
uses to identify it. If the bank issues the transaction, it may simply
neglect to enter the information or get it wrong. The payee code, for
whatever variety of reasons, ends up being a major problem in pushing
transactions efficiently through the system. The company could well spend
several thousand dollars each week tracking down payee codes.
Most transactions come in to
the office as physical paper and are scanned into images and interpreted
into text. The image is available to the operator, and the text is used by
software to prepare the transaction even further. For example, other
missing or incorrect data elements are frequently the borrower’s loan or
policy number. The borrower’s name, address, and social security number
are, however, far more likely to be correct and complete. This information
often makes it possible to query a database and fill in the loan and policy
numbers. A loan or policy number acquired in this way may need to be
verified, and the transaction is flagged accordingly for the operator to
intervene as necessary.
THE POLICY COMPONENT
Software preparation
and intervention is, in other words, already part of the established
process. It turned out that similar help was possible for the payee code
with a simple predictive model. Policy numbers sometimes have specific
formats which may be very informative. For example, some carriers embed
groups of letters in their policy numbers; this particular policy:
6132HP200809 is likely to be identifiable by its pattern of digits and
letters. Indeed, the policy number was quite plausibly designed just so it
could be easily identified.
One problem in the bulk processing center is
that there are so many such patterns flying around that it takes a very
experienced and talented operator indeed to keep very many of them in mind,
as other pressures are also being exerted. An expert operator might
recognize the pattern, know the mortgage banker quite well, recall that this
is not an EDI transaction, and realize that therefore the missing payee code
is most likely: 61074. But relying on that expert knowledge happening is
not especially efficient if something better is available.
And something better is
available. We have learned already that a policy number of four digits,
followed by two letters, followed by six digits is for this mortgage banker
either guaranteed or quite likely to be (specifically) Nationwide Mutual.
This sort of knowledge is similar to the knowledge that a particular
borrower name and address imply a particular loan and policy. The knowledge
about the payee code, however, is shakier than the loan and policy number
knowledge. But logically there is very little difference. Here is the
response of a prototype installation for this policy number:
| |
2 scores for
6132HP200809 |
|
| |
|
|
|
| |
Code |
Office |
Score |
| |
61074 |
NATIONWIDE MUT FI INS CO |
486 |
| |
EDINA |
NATIONWIDE MUT FI INS CO |
340 |
The same office is involved in this case, but the distinction between manual
and electronic transfer (which the different payee codes indicate) is not
built into the model. This lack of certainty needed to be recognized and
dealt with. It might well be that a policy pattern applies to several
payees. It is also true that some patterns of policy numbers carry more
‘information’ about the payee than others do. For example a nine digit
policy number with no letters or special characters might apply to several
carriers as well as numerous payee codes, as in this example:
| |
7 scores for 921228387 |
|
| |
|
|
|
| |
Code |
Office |
Score |
| |
60923 |
FI INS EXCHANGE |
413 |
| |
60917 |
FARMERS INS CO INC |
388 |
| |
60702 |
ALLSTATE INS CO |
361 |
| |
62772 |
ALLSTATE INS |
336 |
| |
60705 |
ALLSTATE INDEMNITY CO |
299 |
| |
61184 |
STATE FARM FI & CAS CO |
133 |
| |
61178 |
STATE FARM FI & CAS CO |
95 |
Here a purely numeric policy leads to more ambiguous results, but it looks
as though Farmers’, Allstate, or State Farm are likely candidates. The
operator may find this sort of information helpful or confusing. Sorting
all of this out takes some effort and careful analysis.
It might at first seem
daunting to get one’s arms around how to grapple with this sort of problem.
But it merely needed a quite straightforward bit of data mining:
-
Make a
collection of problem transactions and arrange by policy number and
payee code.
-
Inspect
the policy numbers of the high volume exceptions and look for
patterns.
-
Prepare
some pattern matching software in prototype to catch the policy
patterns.
-
Attempt
to predict the payee code on a set of randomly selected
transactions.
-
Quantify
the success rate as a confidence score.
-
Iterate
for optimal effect, with examples selected at random.
The resulting model employs simple statistics – percent of the pattern in
the transaction population, percent of the payee in the transaction
population, and percent of the pattern for the payee – to come up with a
predictive score. All of this goes to a database which is maintained in
memory and is instantly available to any transaction. Here are the two
database records which provided the scores for the first example above:
Payee Code: 61074
Name: NATIONWIDE MUT FI INS CO
Score: 486
Regular Expression: [0-9]{4}[^a-z]{2}[0-9]{6}
Min. Length: 12
Max Length: 12
Pattern / Population: 0.021
Pattern / Payee: 0.8
Payee / Population: 0.017
Payee Code: EDINA
Name: NATIONWIDE MUT FI INS CO
Score: 340
Regular Expression: [0-9]{4}[^a-z]{2}[0-9]{6}
Min. Length: 12
Max Length: 12
Pattern / Population: 0.0298
Pattern / Payee: 1
Payee / Population: 0.0045
A policy component such as the payee code software is liable to require some
handcrafting, but it was not in this case very labor-intensive, considering
the potential payoff in improved transaction throughput and accuracy in a
hectic and ambiguous work environment. The prototype in the background of
this discussion required about two weeks’ work for a single person. Even if
it had turned up exactly nothing, it might have been worth an investment of
that size to discover where not to look further for relief.
The results, however, were a
bit better than that. Whether they were worth production deployment was
another question. Fortunately the statistics which predict the effect of
such a system are quite reliable. Unknown, and unknowable from the
perspective of the prototype, is the actual effect on the business, since
that depends on variables outside of the data mining and prototyping. Those
have to do with volumes, training, and other environmental and economic
factors. Decisions concerning the purely business problems had to be
supplied from business managers, but the costliness of the problem made them
more than willing to consider them.
This two-week effort, in
other words, made it possible to create a reliable cost benefit analysis of
deploying the prototype into production. After the prototyping effort,
nearly all risk involved with the technology had been surmounted. The
deployment strategy dictated where and how to implement the new facility
within the workflow. But the risk of having incorrect or impractical
software had essentially been achieved with the conclusion of the prototype.
COST BENEFIT ANALYSIS
The cost benefit analysis for this project
was not formal. It was in fact a proverbial slam dunk. Its numbers,
however, are well worth reviewing here for two reasons: (1) readers of this
description will not have the familiarity which the managers did, and (2)
the consideration of ROI is a crucial piece of any policy component.
The plug-in nature of
components generally is familiar to one and all as bits of technology. Now
consider the nature of the component as part of a business environment. On
the one hand there is a legacy system and process for transactions; on the
other is a functional prototype which might plug in to the legacy and which
would address a quantifiable bottleneck in the existing system. Those of
you who have managed IT projects, either from the technical or business
sides, know the difficulty of measuring scope, risk, and the impact on the
overall process at the beginning of a six-month campaign of software
development.
Contemplate the component,
all fabricated apart from the system by one or two people, essentially
without need for management. All that is required with such a prototype in
hand is to measure what the component does exactly and see how much that
would help or hurt. In the case of the payee code, the help was
considerable but not alas all that might have been hoped for. It was
estimated that for a typical batch of transactions, the components could
accurately predict the payee code approximately 8% of the time and offer
useful suggestions 33% of the time.
These numbers suggested that
perhaps the component might reduce the payee code exception queue by 10%,
increase customer satisfaction and reduce error handling on 5,000
transactions a week, thereby speeding up the flow, creating good will,
increasing competitive edge (all without quantification), but (a hard
number) saving nearly 100 hours of labor.
The following screen from
the prototype shows the results of running 100 randomly selected policies
against the patterns and evaluating the performance:
Scored 100 transactions in 16 milliseconds
Percent scored: 49.00%
Absolute percent accurate top score: 12.00%
Absolute percent accurate secondary score: 9.00%
Percent accurate top score of those processed: 24.49
Percent accurate secondary score of those processed: 18.37%
Average correct top score: 426.67
Average correct secondary score: 226.67
In this particular case nearly 50% of the policies received some sort of
score, 12% of which were accurate and another 9% plausibly helpful. The
average score of the accurate predictions is exactly 200 points higher than
those which seem merely helpful. A series of such random selections offers
the sort of quantification which should be helpful for making a confident
decision.
Meanwhile the component has
been completed and technical risk eliminated. Nevertheless coding is
necessary to install the component into the legacy system, to make changes
to workflow and to provide documentation and training. Such things will
vary of course and need not concern us, except to note that the estimates
should be relatively accurate. The remaining build-out resembles a plumbing
project more closely than it does advanced technology.
DEPLOYMENT
These results have genuine meaning only when
the deployment strategy has been thought out. Other numbers might be
more relevant to other deployment strategies. The following seemed to be
the most straightforward way to effect deployment in this transaction
environment:
-
Subject all transactions
to pattern matching for payee code during preprocessing. (The
scoring process requires negligible computer resource, as is evident
in the last screen shot, which averages slightly over 15
milliseconds to process 100 randomly selected policy numbers.)
-
Write a database record
for each prediction; there might be more than one per transaction,
in which case they would be ordered by score.
-
Predictions for each
transaction, if they exist, are available to the operator.
-
An icon or some
unobtrusive signal is presented for selection if there are
predictions.
-
The operator may select
the icon and receive a pop-up list which indicates the payee codes,
the payee names, and their scores, sorted by highest score at the
top.
-
The operator considers
these recommendations and chooses any one (or none) of them.
-
Transaction logging
reflects any of these choices in the history of the transaction for
subsequent analysis and tuning.
This strategy might or might not be feasible or desirable in any given work
environment. The decision and strategy to deploy the prototype of a policy
component requires a thorough review. The strategy is available for review,
correction, emendation, and the like. A completely different strategy might
require a different set of statistical results, which should not be
difficult to produce. In any event, concrete “policy” is now available for
decision support.
MAINTENANCE
A final aspect of deployment involves
maintenance. In this example, payee codes are quite volatile; the mortgage
bankers are liable to change, add, and delete them frequently. When new
mortgage bankers are added to the system, this subsystem needs to be
maintained as well. The prototype implements its scoring by loading a
database table, with these columns:
-
a unique identifier for
the mortgage banker,
-
the payee code,
-
the pattern as a regular
expression,
-
a minimum length,
-
a maximum length,
-
the resultant score,
-
some statistical numbers
for internal use, generated by the software.
Maintenance is therefore quite simple to effect physically. The ‘logical’
maintenance is somewhat more difficult. That requires the statistical
routines to be applied. The prototype has these in place; so running them
on a schedule or as the situation demands is simple enough to effect, but
does require some time, some planning, and some budgeting, and decisions.
The implemented system, however, need never require coding changes to keep
pace with environmental changes.
POLICY COMPONENTS IN SUMMARY
This example was
selected for its simplicity. It starkly profiles a policy component; a
policy component has the following features:
-
Implements policy as predictive modeling at any place in the enterprise from
point of sale to boardroom;
-
Results from specific analysis, often through data mining and statistical
modeling;
-
Employs a RAD prototyping methodology;
-
Has a predictable ROI after prototype and prior to full implementation;
-
Is capable of nearly any statistical functionality;
-
Is “pluggable,” “embeddable” and extremely efficient.
Although this example was selected primarily for its simplicity, policy
components can support far greater complexity. For example, a mortgage
banker implemented a policy component, which was actually deployed more as a
full subsystem, to evaluate its entire portfolio for the likelihood of each
of its loans becoming 30, 60, or 90 days delinquent. The statistical
analysis for this component employed logistic and linear regression on eight
variables, cluster analysis, and the transformation of the entire model into
fuzzy sets. The scoring (which included reading and writing text files and
the “persisting” of objects) averaged 20 milliseconds per loan. The
prototyping phase for this project required six weeks for two people. The
implementation employed a team of several developers for nearly four months,
but the predicted ROI of over $1,000,000 per annum was easily realized, and
(more importantly) was very close to the prediction.
ABOUT THE AUTHOR
Terry Hipolito has several years’ experience
with software development and architecture, statistical modeling, databases,
and project management; his education includes a Ph.D. from UCLA. Terry is
now an independent consultant who specializes in the design, development,
and deployment of “policy components.” He is writing a book on this
subject, complete with methodology, statistical theory and full examples. A
subset of this content will soon be available on
www.policybots.com. Reach Terry via
tahipolito@earthlink.net or fax (714)
993-3218.
All Rights Reserved by Terry
Hipolito.
Copyright ©
2006
The Data Warehousing Institute interviews
TMA's president
 |
ABOUT THE INTERVIEW
The Data Warehousing Institute's web editor,
Eric Kavanagh interviewed The Modeling Agency's president,
Eric King to gain
insights on data mining definitions, misconceptions, trends, best
practices, strategy, process and applications.
TDWI Radio News delivers close-up interviews with industry
professionals in the growing field of information management. Listen
as practitioners give their elevator pitch, then answer questions
designed to elicit brass-tacks examples.
DATA MINING INTERVIEW TOPICS INCLUDE
-
How companies use data mining for competitive advantage
-
Common misconceptions about data mining
-
Professional tips for beginning a data mining initiative
-
Establishing a successful data mining project plan
|
Produced
with permission from
The Data Warehousing Institute.
Copyright ©
2006
|
 |
|
TDWI World Conference
The Premier Event for Business Intelligence
and Data Warehousing Education
August 20 - 25, 2006
Manchester Grand Hyatt
San Diego, California |
CONFERENCE HIGHLIGHTS
The TDWI World Conference in San Diego brings together leading
industry visionaries to deliver a unique program of cutting-edge
education, best practices, one-on-one consulting, peer
networking, business intelligence certification, and product
demos. From business intelligence fundamentals to business
analytics, TDWI’s program of more than 50 full-day, half-day,
and night school courses offers something for your entire team.
At the TDWI World Conference, The Modeling Agency's
Tony Rathburn will
present a full-day seminar on "Predictive
Analytics" Wednesday and
Dean
Abbott will present "Data
Mining" on Thursday.
-
Bringing Business and IT Together
-
Measuring the Value of Information
-
Data Mining and Predictive Analytics
-
Getting BI Requirements Right
-
BI and Governance
-
Open Source Adoption
Produced
with permission from
The Data Warehousing Institute.
Copyright ©
2006
|
 |
5.
NEWSLETTER
SUMMARY
The Modeling Agency newsletter is a quarterly publication which
provides course announcements, training schedule updates and informative
articles. This newsletter may be shared in its entirety and subscriptions
are free. For additional information on TMA's training,
consulting services and solutions, follow corresponding links at
the top of this page.This newsletter
is shared with those who have activated a subscription, or have
supplied their Email address to The Modeling Agency when requesting
product information. If you wish not to receive future releases,
simply send an empty
email
with cancel as he subject from the account which you were subscribed.
|
|