SPL adds value to warehousing and data mining

Johannesburg, 25 Mar 1998

If your data warehouse project involves data about customers, you must be able to cluster the diverse data from multiple sources as it enters the data warehouse. You must also be able to continuously search the data warehousing using customers' names and addresses. Furthermore, these activities must succeed despite the great variation and error that exists in this multiple sourced data. In simple terms, building a data warehouse means extracting existing data from two or more existing systems and merging it into a new physical database. Data mining is about discovering new things from large volumes of data with specialised tools. If the data to be mined is stored in a data warehouse, the data mining process is likely to be easier than if it were not. And the more intelligent the warehousing process is at stabilising and merging information, the easier the job of data mining becomes. Of course, data mining needs to access and process all types of data, but if your data mining application requires that data is grouped or processed by identities, then SSA-Name3 and the SSA Clustering Engine are very relevant. This is because once you try to combine name and address data from two or more systems, it is probable that problems will occur from each system having one or more of the following characteristics: different name and address formats; different validation rules; different quality of data and/or different completeness of data. SSA specialises in solving these dilemmas. Putting intelligence into the warehousing effort means solving the name and address matching and merging problem before the data is loaded into the warehouse. The traditional approach is to wheel in one of the many formatting and scrubbing tools which must manipulate both the format and data content before matching and merging the data into the warehouse. The intention is to achieve clean data with each name and address component stored in its correct place and duplicate records removed. This approach obviously has some merit, because it is being used in the marketplace and will generally give better results than an in-house developed approach. The problem is that the level of destruction this type of manipulation leaves, and the extent of matches it is missing, are not immediately apparent. Nothing you do to format and clean name and address data will remove the error and validation problem. It can reduce it (although this does not solve the matching problem), but more seriously it can introduce error. The SSA Clustering Engine, a new product from Search Software America, is a non-destructive, non-sequence dependent and highly intelligent method of grouping records together from different systems, prior to them being merged into a data warehouse. It handles multiple input formats, but does not need to scrub the data, and uses the SSA-Name3 and SSA Extensions software for its matching intelligence. Whether or not identity records are matched and merged before their warehousing, data mining can benefit greatly from the use of SSA-Name3 keys built on the warehoused data. Real-time searching and matching, in which the quality of the search criteria may vary greatly, can then make use of SSA-Name3 Search Strategies and Scoring to significantly improve the quality of the results and performance of the application. And if no data warehouse exists, a data mining application can make direct use of the SSA Clustering Engine to discover the groupings and links in the diverse identity data. If data about an individual or company comes from multiple sources, and you can not successfully group the data together, then you can not mine it. SPL's Enterprise Systems Division supplies and supports Search Software America's software throughout Africa. SSA-Name software solves quality and performance problems for on-line and batch name search and matching applications.

Editorial contacts