Vespa, Yahoo's search code, released as open source

Johannesburg, 29 Nov 2017

Vespa, Yahoo's big data processing and serving engine, has been released as open source by Oath, the Verizon subsidiary that's been the owner of record of Yahoo since June 2017. It is now available on GitHub.

With over 1 billion users, Vespa is currently used across many different Oath brands - including Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini and Flickr, to process and serve billions of daily requests over billions of documents while responding to search queries, making recommendations, and providing personalised content and advertisements.

According to Jon Bratseth, an architect with Vespa, Vespa processes and serves content and ads almost 90 000 times every second with latencies in the tens of milliseconds. On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Vespa also serves over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140 000 requests per second.

"Over the last couple of years, we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack," he said. "Vespa is larger in scope and lines of code than any open source project we've ever released. Now that this has been battle-proven on Yahoo's largest and most critical systems, we are pleased to release it to the world."

Vespa's open source release is regarded as an unexpected boon for developers, because Vespa is loaded with potential that reaches far beyond search.

Bratseth said that building applications increasingly means dealing with huge amounts of data. While developers can use the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users.

"Serving often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organise the matches to remove duplicates, add navigation aids, and then return a response to the user," he explained.

"As these computations depend on features of the request, such as the user's query or interests, it won't do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of these operations with the ability to perform them over large amounts of data requires a lot of infrastructure - distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy-to-use engine."

Now that it's open sourced, Vespa is set to become a major part of the open source toolbox, alongside the likes of Hadoop, Kubernetes, OpenStack, and even Linux.

"By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at Internet scale - capabilities that up until now have been within reach of only a few large companies," Bratseth concluded.