At the SAS Global Forum currently being held in Seattle, Washington, local data visualisation expert David Logan presented a paper to delegates on the how he has managed to improve ETL performance by up to 92% for clients using intelligent, dynamic and parallel capabilities.
Logan, who is a principal consultant with the PBT Group, has been working with business intelligence solutions for more than 17 years in a number of countries and across a variety of sectors, including retail, telecommunications, banking and insurance.
He is currently regarded as an expert in the field of data visualisation and process optimisation and has extensive experience working with databases with a billion+ rows of data.
His paper entitled: “Blistering ETL performance using the Intelligent, Dynamic and Parallel Capabilities of SAS” was presented today to a packed audience at this year's SAS Forum.
“One of the biggest criticisms of getting to your data in order to perform advanced analytical tasks has always been speed,” states Logan. “That said, I began exploring the concept of parallel processing, which itself is not a new concept, to the ETL or Extract Transform and Load process in data extraction - the results have been phenomenal.”
During his presentation, Logan asked delegates to imagine they were building a house with one person and it took 100 days; he then suggested that if you put 10 people on the project it should in theory then take 10 days. While seemingly simple, the process itself is not that easy, as he drew attention to the fact that this would not be the case unless the tasks were intelligently split among all resources, as some tasks take longer than others and tasks have interdependencies. The ideal result would be for all resources to work 100% of the time and finish their tasks as simultaneously as possible; only then would the house would be built in 10 days.
“The concept is not that different when trying to extract data from large data stores in order to provide a business with true intelligence. So what would happen if you then took this concept and applied it to very large ETL problems? We recently tried this at one of the largest mobile operators in South Africa - we needed to extract 200 million+ rows from an 18 billion+ row database in the shortest period of time,” he states.
“Our goal was to gather intelligence, dynamically spread the load evenly among a number of parallel processes, and then run parallel processes to accomplish the task so that they all finished around the same time - in short, make sure the builders were on the same page and the house was built in the given period of time.”
To achieve this, Logan told the audience his team ran 64 parallel processes on the source database to gather the minimum intelligence needed to help the business make intelligent decisions from its data (these were lightweight processes running extremely fast). Based on the results they then did a workload assessment and split the work between <n> number of parallel jobs, and finally they began running the processes in parallel across the data.
“The process reduced the running time of the ETL task from 12 hours, for a sequential solution, to one hour, a staggering 92% improvement. In addition, the ETL process now effectively “tunes” itself automatically by spreading the workload differently depending on the intelligence gathered.
“An added benefit has been evidenced in the data quality of the actual intelligence gathering, which has improved dramatically. From a business scenario, the client had up-to-date information to work with from the first hour of the process, which has considerably improved productivity, and lastly, when applied to the billing database, which has a higher workload, processes were completed in a much shorter period of time, resulting in less contention and reduced room for error,” adds Logan.
According to Logan, the process has been repeated a number of times to ensure it works across a variety of databases, with a number of technology tools, and the results have all the same, namely: shorter times to intelligence, improvements in productivity, better data quality, and less room for error.
In conclusion, he asked the audience: “If you're not being intelligent, dynamic and parallel in your problem-solving, then the question is, why not?”
Share
SAS
SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions delivered within an integrated framework, SAS helps customers at more than 45 000 sites improve performance and deliver value by making better decisions faster. Since 1976, SAS has been giving customers around the world 'the power to know'. www.sas.com and www.sas.com/sa
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. (R) indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright (c) 2009 SAS Institute Inc. All rights reserved.
Editorial contacts