SKA, CERN ink massive big data deal

Johannesburg, 17 Jul 2017

SKA and CERN will be faced with the same problem of how to distribute the gigantic volumes of data they produce.

The Square Kilometre Array (SKA) Organisation and CERN, the European Laboratory for Particle Physics, have signed a big data deal to tackle the vast amounts of information the organisations will generate.

Signed last week Friday, the agreement formalises the scientific organisations' growing collaboration in the area of extreme-scale computing.

In a joint statement, the organisations say the agreement establishes a framework for collaborative projects that addresses joint challenges in approaching Exascale (one billion gigabytes) computing and data storage, and comes as the Large Hadron Collider (LHC) will generate even more data in the coming decade and SKA prepares to collect a vast amount of scientific data as well.

World greats

CERN is the world's largest physics lab, in which the Worldwide Web originated, while the SKA project is an international effort to build the world's largest radio telescope to be hosted in SA and Australia. SA's Karoo desert in the Northern Cape will host the core of the mid-frequency dish array, ultimately extending over the African continent.

It will be built in two phases - SKA1 and SKA2 - starting in 2018. SKA1 will include two instruments - SKA1 MID (to be built in SA) and SKA1 LOW (to be built in Australia); they will observe the universe at different frequencies.

Burning topics

Professor Eckhard Elsen, CERN's director of research and computing, says the LHC computing demands are tackled by the Worldwide LHC computing grid which employs more than half a million computing cores around the globe interconnected by a powerful network.

"As our demands increase with the planned intensity upgrade of the LHC, we want to expand this concept by using common ideas and infrastructure, into a scientific cloud. SKA will be an ideal partner in this endeavour," he says.

Dr Fabiola Gianotti, CERN's director general, and professor Philip Diamond, SKA director general, signing the co-operation agreement.

CERN and SKA have identified the acquisition, storage, management, distribution and analysis of scientific data as particularly burning topics to meet the technological challenges.

In the case of the SKA, it is expected that phase one of the project - representing approximately 10% of the whole SKA - will generate around 300PB (petabytes) of data products every year. This is 10 times more than today's biggest science experiments.

CERN has just surpassed the 200PB limit for raw data collected by the experiments at the LHC over the past seven years. A layered (tiered) system provides for data storage in the remote centres. The High-Luminosity LHC is estimated to exceed this level every year.

"This in itself will be a challenge for both CERN and SKA given the step change in the amounts of data we will have to handle in the next five to 10 years," explains Miles Deegan, high-performance computing specialist for the SKA.

"Transferring an average dataset will take days on the SKA's ultra-fast fibre-optic networks, which are 300 times faster than your average broadband connection; so storing or even downloading this data at home or even at your local university is clearly impractical."

Shared challenge

As is already the case at CERN, SKA data will also be analysed by scientific collaborations distributed across the planet. There will be common computational and storage resource needs by both institutions and their respective researchers, with a shared challenge of taking this volume of data and turning them into science that can be published, understood, explained, reproduced, preserved and presented.

"Processing such volumes of complex data to extract useful science is an exciting challenge that we face," adds Antonio Chrysostomou, head of science operations planning for the SKA.

"Our aim is to provide that processing capability through an alliance of regional centres located across the world in SKA member countries. Using cloud-based solutions, our scientific community will have access to the equivalent of today's 35 biggest supercomputers to do the intensive processing needed to extract scientific results. In short, we need to fundamentally change how science is done."

"CERN has proposed the concept of the Federated Open Science Cloud with other EIROForum members. This agreement is an important step in this direction," says Ian Bird, responsible at CERN for the World-wide LHC Computing Grid.

"Essentially, we will provide a giant cloud-based, Dropbox-like, facility to science users around the world, where they will be able to not only access incredibly large files, but will also be able to do extremely intensive processing on those files to extract the science."

As part of the agreement, CERN and SKA will hold regular meetings to monitor progress and discuss the strategic direction of their collaboration. They will organise collaborative workshops on specific technical areas of mutual interest and propose demonstrator projects or prototypes to investigate concepts for managing and analysing Exascale data sets in a globally distributed environment. The agreement includes the exchange of experts in the field of big data as well as joint publications.