Subscribe
About

Five elements of effective data mining

There are five key considerations for the selection of a statistical analysis and data mining tool.
By Charl Barnard, GM of business intelligence at Knowledge Integration Dynamics
Johannesburg, 18 Aug 2004

Statistical analysis and data mining are targeted at information analysts, people who regularly perform correlation analysis, trend analysis and projections.

This advanced style of business intelligence (BI) is achieved by applying mathematical, financial and statistical functions against company data. The business insight derived is critical for every organisation. However, specialised data mining tools are difficult to use for anyone other than statisticians with technical training.

The requirement is for BI technology that is designed specifically to deliver much of the common functionality of data mining tools - and to deliver it in a way that is familiar and consistent with everyday BI usage.

Here are five key considerations for the selection of a statistical analysis and data mining tool:

1. Applying statistics and data mining against the entire database: With most BI software, business users depend on an administrator or a developer to apply the analytical function or model against their specific set of data. Users then frequently have to ask to have new functions applied to further clarify and analyse data. This process is slow and inefficient for both users and administrators.

The quality of the insights gained from a statistical analysis or data mining application is directly related to the quality and completeness of the underlying data. If the application can not access all of the data needed to build a model, then even the most complex mathematical model will give misleading or incorrect results. This issue is made more difficult given that database sizes have been growing exponentially over the last decade, and valuable information is buried in this growing mass of data.

The ability to apply mathematical, OLAP, financial and statistical functions against the entire volume of data collected in the enterprise data warehouse is critical. Due to their inherent data limitations, cube-based BI architectures are incapable of providing a comprehensive picture of the inter-relationships of data across the organisation.

If users can perform analysis ranging from the simple to the advanced, they can better understand how customers spend their money by answering questions such as "What is the median spending of customers in each region?" and "What is the standard deviation of all customers` spending in each region?" or "What is our market share growth this year by store, for those stores where actual sales exceeded target sales by more than 15%?"

2. Plug-and-play architecture for custom analytical functions: Many organisations have specific calculations that are needed because of the unique and proprietary character of their business model. Whether it`s some unique calculation of productivity, or a unique correlation formula relating sales to promotions, or heuristic price elasticity coefficients, or some unique predictor of fraud probability, all companies have some business calculations that are not simply standard mathematical functions. Interestingly, these also tend to be some of the most critical business questions that a BI system should be able to address. This is where custom analytic functions become very important.

An open and extensible architecture allows organisations to create their own custom analytic functions, and embed them in any BI report or analysis, making it simple for people to create new analytics for the organisation or their workgroups. Users do not need to learn multiple tools to achieve their analysis goal.

3. Seamless integration with data mining tools: The main purpose of data mining is to discover patterns and algorithms in the existing data that can be used to predict future outcomes. This includes analysis techniques such as regression, segmentation, clustering and forecasting. Typically, whenever users need sophisticated predictive functionality, they must contact a highly skilled developer to assist with their request using a formal data mining tool.

Mathematicians, statisticians and administrators then "discover" the appropriate algorithms using data mining tools along with a subset of the actual data by "training" the data mining tool. Unfortunately, a recurring problem with this basic model is that each analysis tends to be a once-off event. That is, the new algorithms are not generally available to all users through their standard BI reporting or analysis system - instead the separate data mining processes delivers a series of one-time answers to the requester.

An architecture that supports customised analytic functions can easily incorporate the highly specialised data mining algorithms generated by best-of-breed data mining products.

Business users can access data mining calculations for everyday business analysis. Users can use these models to answer questions like, "Which customers are likely to switch to the competition in the next six months?" without needing to learn the complexities of the models.

4. Collaboration technology: A statistical analysis and data mining tool must operate collaboratively with the calculation engines embedded in the relational database management system. Not all databases have the same calculation capabilities. There are some important analytical functions that some databases simply do not support and other analytic functions that database systems cannot do quickly. To overcome this, an analytic engine must automatically compensate for variations in calculation capabilities of the different database systems.

Collaboration with database technology in an intelligent and proprietary multi-step interaction process can provide answers to queries that cannot be answered by the database itself, such as user-defined groupings, custom groups, consolidations and other analytical functions not supported by the database.

Automatic collaboration can overcome database limitations, easily constructing the analysis at the moment it is requested, with whatever variations are desired. An inquiry such as "For any car dealers that belong in the top three deciles in sales any month in the last year, return the contact information for those who fell out of the top three deciles last month" can be answered automatically.

5. Multi-pass SQL: Many normal business questions - those that any business user would likely ask if not constrained by the limits of a tool - simply cannot be answered with single-pass SQL.

However, specialised data mining tools are difficult to use for anyone other than statisticians with technical training.

Charl Barnard, GM, MicroStrategy

Most people would agree that there are sophisticated questions that imply complicated kinds of analysis requiring multi-pass SQL, such as dynamic pricing decisions or logistics optimisation problems. However, it is surprising to many people that there are many cases of business questions that are deceptively easy to ask, but which can be difficult or impossible to answer without the use of multi-pass SQL.

Take, for example, ranking and contributions. There are two ways to accomplish these simple-to-ask queries; either by bringing all data back to a cube database in an offline data aggregation process, or by generating multi-pass SQL that bring the results immediately to the user. The former method can have severe consequences in terms of query latency and network traffic. The latter method is optimal. Generating multi-pass SQL breaks the user`s query down into a number of simple queries that are processed separately by the database and then automatically grouping the results of the separate queries. This multi-step processing can even be performed in parallel, to the extent that the database is configured to optimise and parallelise these queries as they arrive.

Answers to a broad range of seemingly innocent queries can only be achieved with multi-pass SQL, where each pass represents a sub-query necessary to calculate the final result.

Share