Wikimedia Foundation Research Award of the Year

The Wikimedia Foundation Research team established the Wikimedia Foundation Research Award of the Year in 2021 to recognize recent research that has the potential to have significant impact on the Wikimedia projects or research in this space.


After reviewing more than 180 candidate peer-reviewed publications from 2022 we are honored to announce the WMF-RAY 2023 winners. Watch the award ceremony or read below to learn more.

Controlled Analyses of Social Biases in Wikipedia Bios (paper)

for developing a novel methodology for studying bias in English Wikipedia biographies and for affirming and challenging past findings through their new approach.

One of the most discussed challenges faced by the Wikimedia Movement is that there is systemic bias in both what is covered in Wikimedia projects and in how it is covered, referred to as content gaps. Among the most discussed content gaps is the gender gap, especially as it relates to Wikipedia biographies.

Over the past decade researchers have attempted to answer questions such as: How do the number of biographies of women compare to men on Wikipedia? How do biographies that exist differ systematically based on the gender of the subject? Are the differences between the number of biographies a result of systemic bias in Wikipedia? What are the barriers for inclusion of content about women and non-binary groups in Wikipedia?

One of the most important steps of conducting research to answer these types of questions is choosing a set of Wikipedia articles to compare. The choice of article matters and sometimes that choice can result in different and even contradictory findings.

In this research, the authors developed a new methodology for studying bias in Wikipedia. Given a target corpus of biographies, their proposed method uses the Wikipedia category system to construct a comparison corpus that matches the target in as many attributes as possible. The research further offers an approach for evaluating bias in a more general way.

This research demonstrates the power of the methodology developed by showing disparities in biographies in various ways that have been difficult to infer at scale in the past. This includes non-binary gender, racial, and intersectional identities.

The Gender Divide in Wikipedia: Quantifying and Assessing the Impact of Two Feminist Interventions (paper)

for a thorough scientific evaluation of Art+Feminism and 500 Women Scientists projects to address Wikipedia's gender gaps in content.

While characterizing content gaps is a critical piece of any attempt to close them, it is only the first step.

Members of the Wikimedia community have been working hard to close content gaps for many years. In the English Wikipedia community, Art + Feminism project and the 500 Women Scientists project are among the leading projects to address gender gaps in content. These projects have contributed to thousands of Wikipedia biographies of women, primarily through edit-a-thons.

The authors of the paper utilized scientific methods to study and surface the effectiveness of the two projects as well as areas for improvement.

They collected data on thousands of biographies of women artists, scientists, athletes, and politicians edited by the two projects. They carefully constructed a dataset of otherwise similar biographies of men, which they used as comparison. They evaluated the results of the content improvement efforts by comparing the length, quality, and visibility of the biographies in articles worked on by the two interventions and the comparison set. They further evaluated the interventions in terms of the degree to which the articles are linked from other articles and the amount of material about the subjects that are put into infoboxes.

The researchers offer two key findings:

  • The projects have been successful in writing biographies of women that are longer, higher quality, and viewed more than the comparison articles;
  • Articles worked on by these projects lag behind the comparison group in several other respects that may limit the visibility of the articles (for example, they are less integrated into the intra-Wikipedia link network).
The findings from this research provides scientific insights for decision makers in these projects and offers valuable insights for editors and edit-a-thon organizers in other relevant initiatives across the Wikimedia Movement.


This year we are announcing two winners for WMF-RAY. One in the general category and the other in the best student paper category.

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning (paper)

In this research the authors develop a new image-text dataset (WIT) built on data from Wikimedia Commons and Wikipedia. WIT contains a curated set of 37.5 million image-text examples which makes the dataset the largest multimodal dataset at the time of publication. The dataset is also highly multilingual, offering coverage in over 100 Wikipedia languages.

Today, more than half of Wikipedia articles are unillustrated, and more than half of the images available to us don’t have captions. Wikipedia needs more images that have captions in local languages to better support different learning and accessibility needs on the project. Captions are also important in improving the search experience of users on Wikipedia.

Manually adding captions to large numbers of images, particularly in under-resourced language communities, is an enormous, and in many instances practically impossible, task for Wikimedia volunteers. In recent years, models have been developed to automatically generate captions for images. However, these models are generally biased towards English and Western content due to a variety of reasons including biased training sets.

WIT has the potential to enable and significantly accelerate research and development for adding captions to images across more than 100 Wikipedia languages. Furthermore, given the way WIT is built, some of the representation biases that exist in other datasets traditionally used for training are addressed when using WIT. We are already seeing strong signals that the potential of the dataset is being realized.

WIT is well-received by the research community; Wikimedia Foundation organized a public image/caption matching competition based on the data-set which in turn has resulted in at least 4 open source solutions for automatically retrieving text closest to an image on Wikipedia; A new community has come to life with a focus on multimodal and multilingual machine learning research on Wikipedia (The first event of this community, Wiki-M3L Workshop, took place as part of ICLR 2022).

[Best Student Paper] Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach (paper)

This research, whose first author was a PhD student at the time of publishing the research, explores the quality of references in Wikidata, the free and open knowledge base, in an elaborate and systematic way.

Wikidata has become a critical project within the Wikimedia ecosystem, with significant impact on the Wikimedia projects as well as the broader ecosystem that Wikimedia operates in and serves. Across the world people and businesses rely on the statements stored in Wikidata for a range of activities such as making new content available in languages that content is missing, building smart assistants, training AI systems, and more.

Many of the statements in Wikidata come with references. According to Wikidata’s community policies, these references are to meet three criteria: relevance, authoritativeness and ease of access. Wikidata’s quality and reliability—and its impact—depends upon the fact that its references are generally perceived to be high quality in all of these three senses.

This research evaluates the state of references in Wikidata in 6 languages. The authors do a significant amount of work that involves developing a creative set of methods to combine multiple rounds of automatic and manual assessment into a complex and multistage research project. They make their full data and code available and by doing so allow others to build on their learnings and reproduce the analyses. Furthermore, the authors provide a detailed report card for the Wikimedia community about the state of Wikidata references, making the result of their work more accessible for the Wikidata community members.


Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia (paper)

This innovative research demonstrates causal evidence of the relationship between increases in content quality in English Wikipedia articles and subsequent increases in attention. The researchers conduct a natural experiment using edits done on English Wikipedia via the Wiki Education Foundation program. The paper shows that English Wikipedia articles that were improved by students in the program gained more viewers than a group of otherwise similar articles. It also found that this effect spills over into a range of articles linked to from the improved articles.

The Wikimedia Foundation's mission has two parts: (1) disseminating knowledge and (2) encouraging people to engage in the production of new knowledge. This work provides new evidence that links these goals in an exciting way. From the Wikimedia Foundation and Wikimedia movement’s perspective, this research provides strong evidence to support a range of content improvement efforts. Although it might seem that there is a tension between focusing resources on improving content that is poorly developed (but also currently unpopular) and putting efforts toward articles that have more viewers and higher audiences already, this work suggests that content improvement efforts focused on content gaps and areas of information poverty can create new audiences for that content.

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages (paper) and the Masakhane Community (learn more)

This paper and the Masakhane community have attempted to fundamentally change how we approach the challenge of "low-resourced languages" in Africa. The research describes a novel approach for participatory research around machine translation for African languages. The authors show how this approach can overcome the challenges these languages face to join the Web and some of the technologies other languages benefit from today.

The work of the authors and the community is an inspiring example of work towards Knowledge Equity, one of the two main pillars of the 2030 Wikimedia Movement Strategy. "As a social movement, we will focus our efforts on the knowledge and communities that have been left out by structures of power and privilege. We will welcome people from every background to build strong and diverse communities. We will break down the social, political, and technical barriers preventing people from accessing and contributing to free knowledge."

We cannot think of a better or more inspiring example of a project from the last year seeking to achieve these goals. Additionally, we see the success of this project as something that will directly support a range of Wikimedia Foundation and Wikimedia Movement goals including the newly-announced Abstract Wikipedia which will rely heavily on machine translation tools.