Yahoo Inc. has announced the public release of what it calls “the largest-ever machine learning dataset” to the academic research community.
The tech company says it aims to advance the field of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research.
“Many academic researchers and data scientists don’t have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies,” said Suju Rajan, director of research, Yahoo Labs.
According to Yahoo, “we are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state-of-the-art in machine learning and recommender systems.”
The Yahoo News Feed dataset is a collection based on a sample of anonymised user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B events (13.5TB uncompressed) of user-news item interaction data, collected by recording the user-item interactions of about 20M users from February 2015 to May 2015, the search engine says.
“Yahoo’s release of the Yahoo News Feed dataset is a significant contribution to the research community. Academic researchers everywhere will finally have access to realistic scale data to study how to automatically discover which news articles are of interest to which users, and will be able to compare their methods using this as a shared test case,” Tom Mitchell, machine learning department chair, Carnegie Mellon University says. “Here at CMU we’ll certainly be using it for our research.”
According to Yahoo, the dataset provides categorised demographic information (age range, gender, and generalized geographic data) for a subset of the anonymised users. On the item side, the title, summary and key-phrases of the news article in question are also included, and interaction data is timestamped with the user’s local time and also contains partial information of the device used to access the news feeds.
“Access to datasets of this size is essential to design and develop machine learning algorithms and technology that scales to truly ‘big’ data,” said Gert Lanckriet, professor, Department of Electrical and Computer Engineering, University of California, San Diego. “At the Jacobs School of Engineering at UC San Diego, it will directly and significantly benefit the wide variety of ongoing research in machine learning, artificial intelligence, information retrieval, and big data applications.”
“At the UMass Amherst Center for Data Science we have broad interests in developing new methods for scalable analytics on a wide variety of big-data domains,”said Andrew McCallum, director of the Center and professor in the College of Information and Computer Sciences. “The release of this large Yahoo News Feed dataset will be a tremendous asset for the academic research community, and for us at UMass particularly, given our major research activities in natural language processing, information retrieval, databases and computational social science.”