Six things to remember when preparing big data for advanced analytics and BI

By Prabith Kalathil – Regional Head, Africa, Nihilent; and Prashant Pawar – Head of Data and Cloud CoE, Nihilent

Johannesburg, 19 Feb 2021
Read time 4min 50sec

“Information is the oil of the 21st century, and analytics is the combustion engine.” – Peter Sondergaard, Former Senior Vice-President, Gartner

The quote above highlights the importance of big data management and analytics. Many organisations have ambitious plans for analytics and ML. Overall, the investments in big data and analytics are increasing every year. However, studies indicate that organisations are finding it difficult to realise the full potential of data. Only about half of the organisations think they can use data and analytics for competitive purposes. And even less than that think they are a data-driven organisation and are getting results from the data and AI investments. Fifty-five percent of data collected by organisations is never used. Why is this happening?

On the other hand, we came across organisations that have been very successful in using data for their business needs. When we look at those examples, one thing that stood out was that technology was not the main barrier here. There are several open source and proprietary platforms available today. Hadoop led the technology landscape in the last decade and then many other options emerged in specific areas like MPP databases, NoSQL databases, data lakes, advanced analytics, stream analytics, BI, ML and so on. We observed that the successful big data implementations have high levels of seamless integration between data assets. They are agile enough to respond to the changing business needs, and enable self-service BI and analytics for the end-user. These attributes of seamless integration, agility, self-service BI are achieved by focusing on six key data disciplines.

1: High-speed data acquisition and processing

There are many options for how you can do data acquisition, extraction and ingestion into the big data platform. The choice will largely depend on factors like the frequency with which you want to capture the incoming data, whether the data is coming in batches or real-time, whether to employ a push or pull model, etc. Irrespective of the tool or approach you choose, how well your data platform performs in terms of speed will ultimately decide if users will use the platform widely.

2: Metadata management and data catalogue

Metadata helps us decode three different perspectives. One is that it will help us to create very accurate and consistent reports. Metadata is information about your data. If we know the metadata very well, then we can check where the data is coming from in our reports. The second use of metadata is that it allows the end-user to find data. This is important, especially in a big data platform. Because in a big data platform we end up collecting lots of data, thousands of data sets are collected from the data source daily. It becomes very difficult for a data scientist or a data analyst who wants to find the data in a large data lake or a data warehouse that you have created. The third use is that if you combine it with other elements, it will also help you to track the data lineage as well.

3: Ensuring data quality

We are all aware that if we do not maintain data quality, then the data platform soon reflects a ‘garbage in garbage out’ type of scenario. Therefore, maintaining data quality is paramount. There are many different thoughts about who is responsible for data quality. We tend to agree with the suggestion that data quality is the responsibility of the data owner. However, the data platform should have some tools available to check the quality of data when it is integrated on the central platform.

4: Master data management (MDM)

The next important point is master data management (MDM). It’s about maintaining the master list, whether it is the products, vendors, suppliers or customers. Having a master list across the organisation helps to create accurate analytics. Usually, your organisation will require standardised master lists to bring consistency in reporting and analytics. These master lists act as a single source of truth across the organisation. The master lists will be consistent across the data platform. You can consolidate master lists by (1) matching and merging, (2) data standardisation, and (3) data consolidation.

5: Data security

The fifth ability is about data security and access control. When we bring data together in a central storage area, the data owners demand the highest level of data security, especially when that data is personal or financial. Data owners will not give you the data if there is no guarantee that data is going to be secured. It has to be protected and secured from unauthorised access by using measures like user authentication, user access control using RBAC or item level security, data encryption and network security. These security measures should be implemented in all components of the big data platform.

6: Data lineage

The last important aspect is the data lineage. It is about tracing data back to its origin. To trust the data used for BI and analytics, users will demand to know the flow of their data, where it originated, who had access to it, which changes it underwent, and when. They would want to know where it resided throughout the organisation’s multiple data systems. Hence data lineage is an important aspect.

These are the six key areas important to make your data platform scalable, agile, searchable, secure and traceable.

See also