Data scientists predict Springbok defeat
Although anything is possible in a rugby match, local data scientists at Principa, who were spot on when they predicted a four-point Springboks win over Wales last week, are singing a different tune this time.
Tomorrow, the Springboks take on nemesis, the All Blacks of New Zealand, in a tough semi-final encounter set for Twickenham Stadium.
Using predictive analytics, the data scientists say the probability of a Springbok loss is high due to past behaviour. However, variables like rain, referee calls, missed penalties and player strategy could make the difference between a win and a loss for SA.
New Zealand are playing in their seventh Rugby World Cup semi-final; no side has reached the last four on as many occasions. SA have the best tackle success rate at this tournament so far, completing 91% of their tackles.
The All Blacks are yet to lose a scrum at this tournament, winning 30 out of 30 on their own put-in at this tournament. Another win for SA would see them become the first team to beat New Zealand three times in the competition.
Principa has two teams of data scientists - Nero and Trojan - on sport prediction site, Superbru.com.
"As much as we hate to be the bearer of bad news, both are predicting a win for New Zealand. Nero is predicting a 12-point margin, while Trojan is predicting a seven-point margin," says Robin Davies, head of data analytics at Principa.
He adds the bookie odds are currently pegged at a nine-point margin in favour of New Zealand.
Davies explains the 12 versus seven margin disparity is due to the various data sources used in the two different predictive models and the difference in the predictive algorithms applied by the two teams.
Both of these key aspects are key to predictive analytics, and they can have a significant effect on the accuracy of the models, he points out.
"In Trojan's case, we are using purely performance statistics for the two teams - we've looked at historical statistics, and then added the most recent statistics from the last few matches played. However, in Nero's case, we've also added a bit of human sentiment, which we bring in by incorporating the bookie odds and the fantasy league values placed on all the players."
The challenge to predicting this score was the many disparate data sources from previous international rugby games that were sourced for use in the predictive models, Davies notes.
"All of the data obtained was made available for the predictive models. However, not all data is very predictive or worth bringing into the final model. In order to build a reliable predictive model, we only bring in the most predictive and stable fields."
He says examples of fields that gave the most "predictive power" were bookie odds, the monetary value obtained from a rugby fantasy league, number of tries scored in previous games, and world ranking, to name but a few.
"The biggest challenge for us was not in the development of the predictive model itself, but rather in the sourcing and scrubbing of the various datasets, to get them in a state that could be used by the predictive algorithms."
He points out that most of the data scrubbing involved extracting data from the Internet and packaging this data in the correct format, joining the various datasets together, and carrying out validations on the data to ensure it was fit for purpose.
Once the data was in an acceptable format, the predictive algorithm could be applied with relative ease, says Davies.
"The next challenge was then extracting the same data for the recently played games, retraining the models off the recent data and then applying the algorithms on the upcoming games to predict the outcomes. It was quite a process to set up, but once it is up and running, the updating process becomes less onerous."
The Principa data scientists accurately predicted 91% of the games on Superbru.com so far.
"Considering that we did not know what to expect going into the tournament, we are very pleased with this outcome. However, we won't be disappointed if our prediction that New Zealand will win this weekend's game is incorrect. We may be data scientists, but our blood is and always will be green. Go Bokke!"