Subscribe

Hackers scrape Sapa database

Staff Writer
By Staff Writer, ITWeb
Johannesburg, 14 May 2015
Sapa's archives should be available to all, argues the group that scraped its data.
Sapa's archives should be available to all, argues the group that scraped its data.

An anonymous hacking group has made several years' worth of South African Press Association (Sapa) stories available online.

Sapa, which was owned by a collective of news publications, closed its doors at the end of March. Its archives and assets were bought by Sekunjalo Investment Holdings, one of the founding investors of the African News Agency (ANA).

ANA, the continent's first news syndication service, was launched on 1 March with an initial investment of $20 million.

The hacking group says the association was, for several years, the "cornerstone of broad news coverage in South Africa".

"No single news organisation could afford the number of journalists required to cover our broad country, and the elegant solution of pooling resources meant SAPA became the record-keeper of South African history."

After its closure, Sekunjalo initially required publications to delete all Sapa copyrighted content, including copyrighted AP, AFP and DPA content supplied through its news subscription service to be deleted from all subscriber media platforms, as well as archives, and storage facilities. It apparently subsequently reversed this requirement.

The hacking group says, subsequent to Sapa's closure, it was "concerned about the archives of Sapa - essentially an historic record of some of South Africa's greatest (and smallest) moments - being unavailable to the media.

"Such a valuable resource should be available publicly, not just to the media." That, it says, is the reason behind its publication of what it calls SapaFiles.

"We believe that knowledge should be free, and that this particular archive is a nationally important trove of history that belongs in the public domain."

However, the hackers do not have a complete archive as its collection only spans up to 2007 and this year, when Sapa closed its doors. "In the crazy few days before Sapa closed, we frantically scraped their Web site, which was not the fastest thing in the world. We even made it fall over a few times."

The hacking group notes, of the digital archive, it retrieved about half of the four million articles Sapa published, which it has turned into a searchable archive of 1.8 million indexed documents. "If you have a more complete set, please consider sharing it with us so that we can include it in this archive."

In addition, say the hackers, Sapa was not easy to scrape, and so some of the articles are missing headlines. "And since there are so many of them, converting the HTML into something usable is a big task. Please don't send us reports of problems with single articles right now. We'll be refining the processor and rebuilding the index over time."

The hacking group also plans to have "some fun with this data", such as entity extraction, automatic and user tagging, and relationship mapping/networks. "We also managed to get some of the multimedia files, so expect those available soon too."

The group adds the stories must be used at "own risk".

Share