Modernising Hadoop Data Platform: Hybrid cloud in financial services

Johannesburg, 30 Sep 2024

Bjorn Olsen, Head of Data Engineering at cloudandthings.io.

Overview

The cloudandthings.io team recently assisted a major banking client in modernising their on-premises Hadoop Data Platform. The team successfully migrated its HDFS storage to S3-compatible storage, all while ensuring minimal disruption to its production workloads. Furthermore, cloudandthings.io helped the client achieve its strategic goal of creating a low-touch, automated object replication tool that works across various environments, including on-premises, multicloud and platform as a service (PaaS) solutions.

The problem

The customer’s Hadoop Data Platform was integral to numerous critical workloads across the organisation. However, over time, the hardware had become outdated and constrained. These limitations made it difficult to integrate modern, cloud-based technologies and advanced analytical tools.

A bottleneck was identified in the HDFS (Hadoop Distributed File System), especially due to the limited capacity of the NameNodes and an ageing storage infrastructure. This outdated architecture restricted the customer's ability to take advantage of more scalable, elastic storage solutions such as cloud-based storage.

In addition to the immediate challenge of migrating petabytes of data from HDFS, the customer required a seamless, automated method to manage data replication across different environments – whether on-premises, multicloud or even to/from potential PaaS solutions. The replication process had to be low-touch, automated, near real-time and simple to observe and maintain.

The solution

To address these challenges, cloudandthings.io developed and deployed a two-part solution: a data migration tool and an elastic, serverless replication service.

HDFS to S3-compatible storage migration

“The first step was to develop a robust tool for migrating the customer’s HDFS data to S3-compatible storage. This tool was designed to operate without relying on the Hadoop compute layer, instead leveraging cloud computing to handle the data transfers. The tool was seamlessly integrated into the customer’s customised Hadoop environment to ensure that data was transferred with precision and accuracy,” says Bjorn Olsen, Head of Data Engineering at cloudandthings.io.

During the migration, the tool switched Hive locations iteratively as each transfer was completed. It also included capabilities for recovery, observability and auditing, providing a clear view of what data was copied and to where, ensuring compliance and integrity throughout the process. Additionally, it had restore and revert functionalities in case of any errors.

By the end of the project, cloudandthings.io had successfully migrated hundreds of millions of files, thousands of Hive tables and decades of historical data. Most importantly, the migration was accomplished with minimal disruption to the majority of consumer or producer workloads.

Automated, elastic replication tool

To support the customer's need for seamless data replication across on-premises, multicloud and PaaS environments, cloudandthings.io developed an automated, elastic, serverless replication tool. This tool was capable of replicating data across over 70 different storage engines, including AWS, Azure, local storage, SMB, HDFS and many more.

For S3 targets, the tool performed 10%-50% better than AWS's native S3 sync functionality, with the added advantage of checksums for data integrity. Once data landed, it could begin replication in under a second, enabling near real-time backups that were immediately available for analytics and backup/restore operations.

The tool also enabled advanced features such as point-in-time restoration through "time travel" against S3 versioned buckets. This allowed the customer to perform point-in-time restores, which are otherwise difficult to perform reliably.

Running this tool came in at a small fraction of the cost of similar solutions (like AWS DataSync) and allows the customer a higher degree of flexibility and extensibility in future.

The impact

By migrating the customer’s HDFS data to S3-compatible storage and implementing an advanced data replication tool, cloudandthings.io enabled the customer to overcome the limitations of its HDFS infrastructure and integrate modern cloud technologies.

Key benefits included:

Moving to cloud-based storage eliminated the capacity constraints previously imposed by the ageing Hadoop Namenodes and legacy storage hardware.
The replication tool outperformed traditional AWS tools, reducing replication times, enabling near real-time data availability and maintaining low cost.
With automated replication, time-travel functionality and observability, the customer can now manage replication across a multicloud and on-premises environment with visibility and minimal manual intervention.
Throughout the migration and implementation, critical workloads remained largely unaffected, ensuring minimal disruption to business operations.

Conclusion

This project illustrates how modernising Hadoop infrastructure can open doors to more flexible, scalable and performant data solutions. By migrating data to S3-compatible storage and deploying a low-touch replication tool, cloudandthings.io helped this customer leverage cutting-edge cloud technologies while maintaining operational continuity and reducing the burden of manual data management.

This modernisation effort not only resolved immediate infrastructure challenges but also positioned the customer for future growth by enabling seamless integration with multicloud and hybrid cloud environments.

Contact the cloudandthings.io team to learn more about the company's data, cloud and software offerings: connect@cloudandthings.io.

Editorial contacts