Wikimedia Foundation Research Award of the Year

The Wikimedia Foundation Research team established the Wikimedia Foundation Research Award of the Year in 2021 to recognize recent research that has the potential to have significant impact on the Wikimedia projects or research in this space.


This year we are announcing two winners for WMF-RAY. One in the general category and the other in the best student paper category.

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning (paper)

In this research the authors develop a new image-text dataset (WIT) built on data from Wikimedia Commons and Wikipedia. WIT contains a curated set of 37.5 million image-text examples which makes the dataset the largest multimodal dataset at the time of publication. The dataset is also highly multilingual, offering coverage in over 100 Wikipedia languages.

Today, more than half of Wikipedia articles are unillustrated, and more than half of the images available to us don’t have captions. Wikipedia needs more images that have captions in local languages to better support different learning and accessibility needs on the project. Captions are also important in improving the search experience of users on Wikipedia.

Manually adding captions to large numbers of images, particularly in under-resourced language communities, is an enormous, and in many instances practically impossible, task for Wikimedia volunteers. In recent years, models have been developed to automatically generate captions for images. However, these models are generally biased towards English and Western content due to a variety of reasons including biased training sets.

WIT has the potential to enable and significantly accelerate research and development for adding captions to images across more than 100 Wikipedia languages. Furthermore, given the way WIT is built, some of the representation biases that exist in other datasets traditionally used for training are addressed when using WIT. We are already seeing strong signals that the potential of the dataset is being realized.

WIT is well-received by the research community; Wikimedia Foundation organized a public image/caption matching competition based on the data-set which in turn has resulted in at least 4 open source solutions for automatically retrieving text closest to an image on Wikipedia; A new community has come to life with a focus on multimodal and multilingual machine learning research on Wikipedia (The first event of this community, Wiki-M3L Workshop, took place as part of ICLR 2022).

[Best Student Paper] Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach (paper)

This research, whose first author was a PhD student at the time of publishing the research, explores the quality of references in Wikidata, the free and open knowledge base, in an elaborate and systematic way.

Wikidata has become a critical project within the Wikimedia ecosystem, with significant impact on the Wikimedia projects as well as the broader ecosystem that Wikimedia operates in and serves. Across the world people and businesses rely on the statements stored in Wikidata for a range of activities such as making new content available in languages that content is missing, building smart assistants, training AI systems, and more.

Many of the statements in Wikidata come with references. According to Wikidata’s community policies, these references are to meet three criteria: relevance, authoritativeness and ease of access. Wikidata’s quality and reliability—and its impact—depends upon the fact that its references are generally perceived to be high quality in all of these three senses.

This research evaluates the state of references in Wikidata in 6 languages. The authors do a significant amount of work that involves developing a creative set of methods to combine multiple rounds of automatic and manual assessment into a complex and multistage research project. They make their full data and code available and by doing so allow others to build on their learnings and reproduce the analyses. Furthermore, the authors provide a detailed report card for the Wikimedia community about the state of Wikidata references, making the result of their work more accessible for the Wikidata community members.


Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia (paper)

This innovative research demonstrates causal evidence of the relationship between increases in content quality in English Wikipedia articles and subsequent increases in attention. The researchers conduct a natural experiment using edits done on English Wikipedia via the Wiki Education Foundation program. The paper shows that English Wikipedia articles that were improved by students in the program gained more viewers than a group of otherwise similar articles. It also found that this effect spills over into a range of articles linked to from the improved articles.

The Wikimedia Foundation's mission has two parts: (1) disseminating knowledge and (2) encouraging people to engage in the production of new knowledge. This work provides new evidence that links these goals in an exciting way. From the Wikimedia Foundation and Wikimedia movement’s perspective, this research provides strong evidence to support a range of content improvement efforts. Although it might seem that there is a tension between focusing resources on improving content that is poorly developed (but also currently unpopular) and putting efforts toward articles that have more viewers and higher audiences already, this work suggests that content improvement efforts focused on content gaps and areas of information poverty can create new audiences for that content.

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages (paper) and the Masakhane Community (learn more)

This paper and the Masakhane community have attempted to fundamentally change how we approach the challenge of "low-resourced languages" in Africa. The research describes a novel approach for participatory research around machine translation for African languages. The authors show how this approach can overcome the challenges these languages face to join the Web and some of the technologies other languages benefit from today.

The work of the authors and the community is an inspiring example of work towards Knowledge Equity, one of the two main pillars of the 2030 Wikimedia Movement Strategy. "As a social movement, we will focus our efforts on the knowledge and communities that have been left out by structures of power and privilege. We will welcome people from every background to build strong and diverse communities. We will break down the social, political, and technical barriers preventing people from accessing and contributing to free knowledge."

We cannot think of a better or more inspiring example of a project from the last year seeking to achieve these goals. Additionally, we see the success of this project as something that will directly support a range of Wikimedia Foundation and Wikimedia Movement goals including the newly-announced Abstract Wikipedia which will rely heavily on machine translation tools.