Riding the clickstream

Johannesburg, 10 Jul 2000

One of the most common technology mergers in current IT infrastructures is that of the Web server and the database, allowing true dynamic content to be delivered to the Internet.

However, many companies are taking this natural symbiosis one step further, and integrating their data warehouses - once the monsters lurking alone in the dark and quiet corners of the IT department - with the Net. The resultant architecture, according to Ralph Kimball and Richard Merz, is termed the data Webhouse, hence the title of their new book, "The Data Webhouse Toolkit: building the Web-enabled data warehouse".

There are two purposes of data Webhousing, namely getting clickstream information from the Internet, and delivering data to the Internet. The book deals with both issues in great detail, arguing that most enterprises will want to follow both strategies.

Mining for Web data

Mining clickstream information involves capturing every user`s surfing behaviour while they are on a Web site, analysing it in real-time, and using that information to deliver dynamic content to take advantage of that user`s surfing and purchasing habits. Other advantages gleaned from mining the data include the ability to streamline your site, target exit points, and promote up-selling and cross-selling.

Capturing such a huge amount of data and analysing it online, while still responding to a user`s requests in the very short time periods allotted by the browsing habits of the Internet, results in great challenges in the design of such systems.

The book deals with the obstacles that such a system will have to overcome, and proposes a standard architecture that will allow you to deliver on what at first seems an impossible task.

Real-world issues

It doesn`t only restrict itself to a high-level overview, but also dives into real-world technology issues with a fair amount of granularity, discussing the difference that memory will make to a system, or how to resolve host names with the least amount of server overhead. It also offers advice on how to identify users despite the limitations of the HTTP protocol, and how to cater for their needs individually once they have been identified.

Moving a data warehouse to the Web offers just as many complexities, including security issues, time for queries to execute, and even how to design a site to make it the most readable for users. The book also takes the reader through the process of basic data warehousing design, although the authors do assume a fair amount of knowledge regarding the structure and implementation of a traditional data warehouse.

Future technologies are also discussed, especially XML, which the authors believe will add an entirely new dimension to understanding how an end-user navigates through a site. Related technologies are not ignored either, and a great deal of discussion revolves around different data sources, cache engines, the HTTP protocol, and even browser limitations.

Wrapping up

This book is a must-read for all designers, project managers, and even developers who are interested in or involved in moving the data warehouse to the Web. High-level management may find the book a little too granular, but it does foster great ideas on how to use technology to get the most from any online venture, as well as arming you with a good arsenal of arguments as to why an existing data warehouse should be readied for the Web.

Riding the clickstream

Whether turning clicks into analysable data or taking analysed data to the Internet, "The Data Webhouse Toolkit" by Ralph Kimball and Richard Merz provides a good grounding for IT architects, developers and project managers.