Building a big data architecture

Johannesburg, 05 Feb 2016

The biggest question businesses are asking themselves is whether they should build a big data architecture and fund it through the resultant turnover and savings, or whether to wait until they have a cost-effective use-case before investing in the infrastructure.

"It's the classic case of the chicken and the egg, and it seems like most of us are holding our breath to see what works," says Paul Leroy, technical account manager at Slipstream Data, who will be presenting on 'Big Data in The Cloud - Virtual Machines and The Chicken and The Egg', during the ITWeb Business Intelligence Summit 2016 at The Forum in Bryanston on 1 and 2 March.

He says to give the discussion some context, most infrastructure providers have been pushing virtual computing as a solution to all of our problems, yet virtual computing technology has not seen a considerable uptake in SA.

"Data science and the infrastructure investment that is required to make it work is being held back by one key concern - bandwidth. This has been the biggest hurdle in SA, particularly for smaller businesses who are looking towards the likes of Google and Amazon's capacity."

Asking how we can get around this, Leroy says big data offshore requires huge uploads of raw data, and small downloads of results, which is the polar opposite to most company's networking traffic.

See also

What makes a data scientist?

Dealing with the 'data deluge'

"This is where the opportunity lies. This means we can be utilising our unused (but paid for) bandwidth to bolster our capacity, while using virtual machines to increase our capability."

Paul Leroy, technical account manager at Slipstream Data.

He says virtual machines reduce the direct costs of any operation when the machines are not on a dedicated cluster and are not running 24/7.

"When looking at Hadoop clusters, this can become a problem as it requires all data nodes to have to be running in order to store the data. Traditionally this would mean that Hadoop does not translate to a good use case on virtual machines, but with new advancements in large dataset analytics, this is no longer a concern.

"The cost of deploying a cluster and moving the data has also been resolved. The large cloud providers are building access to their storage platforms from their virtual machine platforms which negates the need to move data around before analysing it."

What we're left with, explains Leroy, is the cost of the human resources required to program these new cluster computers; with Java, Scala and Python developers who are fairly expensive.
"But what if the entire process can be simplified to a point where very little coding is required? There are a few projects currently working on addressing this, including Apache NiFi, which simplify the complex processes needed to wrangle the data and alleviate the burden on traditional programming."

He says Slipstream Data will do a live demo at the summit on the new tools in BI and how users can prepare for the coming wave of big data as a service. Delegates can expect to learn about lowering storage costs, simplifying deployment, code-free analysis, as well as the benefits of data visualisation as an essential tool in the BI workflow.