Debunking data virtualisation

By Julian Thomas, Principal consultant at PBT Group

Johannesburg, 07 Dec 2018

Julian Thomas.

I have been thinking a lot about data virtualisation recently. The concept has been around for a while now, but it has taken some time to get significant uplift in the South African market.

This, however, I believe is changing, with more and more companies investing in this capability. Since it is becoming more widespread in the local context, I thought it was time to put my own thoughts on this topic in order, and in the spirit of "waste not, want not", share my views.

The first thing I have noticed when engaging in data virtualisation is a distinct difference between what was promised, versus the actual onsite experience.

The promise of data virtualisation

We can join any data together, anytime and on any platform. We can do this in such a way that we can completely remove the need for storing all of an organisation's data redundantly, in central or federated data stores.

In doing this, we will experience numerous performance improvements, such that we will never have to worry about poor performing data queries, transformation and load routines.

With data virtualisation, we will be able to drastically scale down on the amount of human resources we require to load and manage the data, and we will ultimately be able to decommission all of our existing, analytic data repositories.

Does this 'promise' sound familiar?

The reality is a bit different

I have discovered that the reality has sadly fallen far short of what was promised. A few key concepts need to be clearly understood, to put the above promises into perspective. In effect, there is a pinch of salt required for this dish, tasty as it may be.

Data virtualisation is not magic, and it cannot ignore these fundamental laws of physics, allowing you to cross data platform boundaries with no impact or cost.

One of the most expensive things in the world, performance wise, is network traffic and data input/output access. The same access protocols such as Open Database Connectivity, Java Database Connectivity and Native Protocols are still required to access the data, and to then send across the network. Data virtualisation is not magic, and it cannot ignore these fundamental laws of physics, allowing you to cross data platform boundaries with no impact or cost.

Tied to this is the fact that it accesses the underlying data platforms directly, which in many cases are live, production operational systems. This has the same scale of impact as the traditional data warehouse and extract, transform and load (ETL) options.

In addition to data extraction, legacy business intelligence (BI) solutions incorporate a massive amount of ETL processing to transform the data and create new data. Data virtualisation does not make this go away. A common mistake is underestimating the resources required for this. Sadly, when this happens, clients quickly discover it is not as simple as upgrading the underlying hardware. At this point, the dreaded licensing impact is finally understood. Upgrading inevitably increases licensing costs, which were not catered for in the budget.

Lastly, data virtualisation is incapable of showing what is not there. In other words, once the underlying data has changed, that view of the data is gone, and cannot be retrieved. A critical use case of a data warehouse is the ability to produce a 100% accurate view of the data at a point in time. This is often a deal breaker as many use cases, such as data analytics, voice of customer, financial, legislative and compliance reporting, are heavily dependent on this point in time view of the data.

To resolve many of these challenges, the implementation teams are left with no option but to, ironically, persist the data. Most, if not all, data virtualisation tools include the capability to persist the models, either to disk or to memory, reducing overhead on the underlying systems and the data virtualisation platform itself.

Is it all doom and gloom?

Absolutely not. To be fair, many of the data virtualisation tools are good products, and they incorporate features that can minimise or even overcome some of these challenges. For example, in many cases the data and associated transformation rules can be cached in memory, reducing the impact on the underlying source systems and the data virtualisation platform.

What is clear though, is that if the organisation is not careful, it can very easily travel full circle, and end up right where it began.

How do we overcome this?

I believe it is important to set realistic expectations. A realistic assessment of capability and potential challenges needs to be understood, and processing and performance requirements, with associated costs, need to be identified.

Not all use cases are viable for data virtualisation, so reasonable use cases need to be identified. This can include real-time operational views, self-service and analytic reporting; self-service integration; current performance dashboards. The data virtualisation implementation should focus on these.

Finally, a strategic partnership needs to be established between the operational systems, traditional BI and data warehousing tools and the data virtualisation platform. Like it or not, these traditional systems are here to stay, at least for the medium-term future.

Debunking data virtualisation

When it comes to data virtualisation, there is a distinct difference between what was promised, versus the actual onsite experience.

The promise of data virtualisation

The reality is a bit different

Is it all doom and gloom?

How do we overcome this?