Research Report Nº 4
The fourth in a series of biannual reports from Wikimedia Research, published every June and December.
Welcome! We're the Wikimedia Foundation's Research team. We turn research questions into publicly shared knowledge. We design and test new technologies, produce empirical insights to support new products and programs, and publish research informing the Wikimedia Foundation's and the Movement's strategy. We help to build a strong and diverse community of Wikimedia researchers globally. This Research Report is an overview of our team's latest developments – an entry point that highlights existing and new work, introduces new staff, and details new collaborations and new considerations, including trends that we're watching.
Between January and June 2021, we've worked with other staff at Wikimedia Foundation, our formal collaborators, Wikimedia affiliates, and Wikimedia volunteers to address knowledge gaps on the Wikimedia projects, improve knowledge integrity, and conduct foundational work. Each success is crucial in breaking down the barriers that prevent people from freely sharing in the sum of all knowledge.
As you read about our work below, we invite you to learn more - through links that lead to collaborative projects and other key programs, including events that we're organizing or attending. The work that the Research team does is a work in progress. That's why we're publishing these Research Reports: To keep you updated on what we think is important to know - about our most recent work, and about the work that will happen in the following six months.
Addressing knowledge gapsWe aim to help Wikimedia projects engage more editors, increase the diversity of editors, identify and address missing content, and reach more readers across the globe. The Wikimedia projects are a form of discourse that we want everyone to share in. Who writes the knowledge matters. Who reads the knowledge matters. What’s missing from that knowledge matters. That’s our challenge as we help address our projects’ knowledge gaps.
A deeper understanding of Wikipedia use cases and readership gaps. We have found that across the globe: (1) women are underrepresented among readers of Wikipedia, (2) women view fewer pages per reading session than men, (3) men and women visit Wikipedia for similar reasons, and (4) men and women exhibit specific topical preferences. (Paper)
An improved readership experience through better new user onboarding. We continued our collaboration with the Growth Team to improve new editor retention through Structured Tasks which aim to break down the editing process into steps that are easily understood and guided by algorithms. Inspired by earlier research, we developed an algorithm for the add-a-link structured tasks that automatically generates hyperlink recommendations which editors can add to articles. One challenge was to build a model that works equally well across different languages (and language families) to support smaller or medium-sized Wikipedias. The model has been successfully deployed as part of the newcomer tasks on Arabic, Vietnamese, Czech, and Bengali Wikipedias. (Learn more, Paper)
Metrics to measure knowledge gaps. In January 2020, we started the second phase of development of the knowledge gap index: the measurement phase. Our aim in this phase is to map each gap in the taxonomy to one or few numbers (a "metric") reflecting the extent to which the gap is present in the Wikimedia projects, based on surveys and observations. As of August 2021, we have defined metrics for 75% of the gaps in the knowledge gaps taxonomy. (Learn more)
A deeper understanding of the role of visual knowledge in Wikimedia Projects. Despite the widespread usage of images on web platforms and the large volume of visual content on Wikipedia, little is known about the importance of images in the free knowledge. In May 2020, we started a set of initiatives to study the role of images in Wikimedia projects. We worked on three main areas in the past six months:
We developed a map of visual knowledge gaps to quantify the under-representation and over-representation of images in Wikipedia and Wikidata. The map aids Wikimedia organizers and developers in designing more targeted image contribution campaigns and new product developments. (Learn more)
In March 2021, we performed the first large-scale analysis of interactions with images on English Wikipedia. We found that 1 in 30 pageviews results in a click on at least one image, one order of magnitude higher than interactions with other types of article content. We observed that clicks on images occur more often in shorter articles and articles about visual arts, transportation, and biographies of less well-known people. The findings deepen our understanding of the role of images on Wikipedia and provide a guide for enriching its visual content. (Learn more)
In 2020, we started designing a large-scale qualitative experiment to test how images impact reading comprehension of Wikipedia articles. We manually curated a list of reading comprehension questions for 100+ Wikipedia articles related to visual and non-visual aspects of the article topics. We recently launched a large-scale study that uses these questions to test readers' comprehension skills on articles with and without images. (Learn more)
A model for image recommendation. On average, half of Wikipedia articles are missing an image to illustrate their content, despite the fact that Wikimedia Commons, Wikimedia's media repository, contains more than 75 million freely licensed images. In March 2021 we developed an algorithm that recommends relevant images to unillustrated articles. The algorithm was tested in April 2021 as part of the Train Image Algorithm feature on the Wikipedia Android app. Based on positive results in algorithm accuracy and user engagement, the Growth, Android and Structured Data teams deployed the algorithm in more Wikipedia products in July 2021.
A deeper understanding of reader navigations. More than 1.5 billion unique devices access almost 55 million articles in Wikipedia every month. The people behind these devices read more than 15 billion pages a month. The aim of this project is to better understand the ways in which readers explore the network of articles when learning about a given subject. By systematically characterizing reading sessions, we learned that they are generally short; there are strong differences with respect to the topic of interest, time of the day, and geographic region of the readers; external search engines such as Google play a major role in navigation, not only to bring the user to Wikipedia but also to help the user navigate once on Wikipedia. In fact, for almost 1 out of 3 consecutively viewed Wikipedia pages, readers don't use the available hyperlinks in the article to navigate. Instead they use external search engines to navigate to their next article on Wikipedia. (Learn more)
A unified framework for equitable article prioritization. How Wikipedia editors prioritize contributions towards missing content plays an important role in content equity. The desire to utilize recommender systems in the Wikimedia world creates an even stronger reason to pay particular attention to content equity and prioritization. We are developing frameworks and principles that can support the community in more equitable task prioritization. We studied how editors assess article importance and identified several core criteria as well as implications for the design of recommender systems that might seek to support these processes (Learn more). We also studied how existing recommender systems for editors operationalize article importance, with a focus on how these approaches affect the types of content that is edited. By analyzing the Suggested Edits recommender system, we showed that surfacing random articles (or images or Wikidata items) through the system supports the (biased) status quo, generating many more edits for content about men (as opposed to women or non-binary gender identities) and many more edits to content about regions like the United States for which there already exist many articles on Wikipedia (Learn more). The final component of the project has focused on the concept of misalignment for rankings in recommender systems - i.e. prioritizing content that is low quality but in high demand (determined by pageviews). As a first step, we developed an API that can generate simple quality and demand scores for any article in any language of Wikipedia. Misalignment in these scores can be used to evaluate individual articles as well as aggregate topic areas - e.g., by comparing the misalignment of content about sports to content about medicine (Learn more).
Improving knowledge integrityWe help Wikimedia communities assure the integrity of knowledge on projects by conducting research and developing and testing technologies that can help editors detect Wikipedia policy violations more effectively. Below is a summary of our ongoing projects.
A program focused on disinformation. In Research Report No. 1, we shared the launch of a research program focused on disinformation and misinformation. We continue to invest in research in this space and you can read more about some of our related ongoing projects below.
A model for understanding knowledge propagation across Wikimedia projects. We are continuing research towards creating a model to understand content propagation across Wikimedia projects. As part of this work we developed and published a dataset of inter-language knowledge propagation in Wikipedia across 309 Wikipedia language editions and 33 million articles. The dataset includes the full propagation history of Wikipedia articles and can accelerate research in understanding content propagation in Wikipedia. (Paper, Data, Learn more)
Machine learning models to align Wikipedia and Wikidata content. In our pursuit of supporting editors and patrollers in enforcing content policies on Wikipedia, we aim to build models that can automatically detect violations of core content policies. One of the ways we approach building such models is to align Wikipedia content across the different languages (by extensively using Wikidata) and identify possible sources of inconsistencies in content. We are currently exploring new methods to solve this task. (Learn more)
Wiki-Reliability: A large scale dataset for improved content reliability. Wikipedia content is patrolled by volunteer editors. Machine learning and information retrieval algorithms could help scale up editors' manual efforts to improve the reliability of Wikipedia content. However, there is a lack of large-scale datasets to support the development of research in this space. To fill this gap, we released Wiki-Reliability, the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues using the templates from WikiProject Reliability. (Paper, Data, Learn more)
A sockpuppet detection model. The Platform Engineering team productionized a model developed by our team earlier in the year. We are waiting to seek feedback on the model from the community of checkusers. (Learn more)
A model for better understanding of the effect of collaboration on article quality. We developed a model to understand the relationship between users' collaboration patterns, articles' characteristics and the quality of the final resulting articles. Specifically, we focused on articles about Biographies of Living People and developed a baseline model that shows a relation between content reliability issues, such as biographies written with a promotional tone, and the collaboration patterns behind that article. We expect to continue this line of research to develop more sophisticated models in the future. (Learn more)
A prototype for automatic fact-checking in Wikipedia. We are developing the first open automatic fact-checking API for Wikipedia through natural language inference. We apply state-of-the-art techniques to perform the fact checking task using English Wikipedia as ground truth. Currently, we are working on extending this service to other languages. (Learn more)
Conducting foundational workWikimedia projects are created and maintained by a vast network of individual contributors and organizations. We focus part of our efforts on expanding and strengthening this network.
Wiki Workshop. With EPFL Data Science Lab, we co-organized the 8th annual Wiki Workshop, which took place virtually on April 14, 2021 as part of the Web Conference 2021. The annual workshop serves as a platform for the researchers of Wikimedia projects to convene on an annual basis and present their ongoing and completed research projects, brainstorm and build collaborations, and connect with Wikimedia volunteers and Wikimedia Foundation staff. We had 23 accepted papers and more than 150 participants from 5 continents. Yolanda Gil, of the University of Southern California, gave the keynote talk entitled, "Crowdsourcing to Synthesize Scientific Knowledge." Robert West of EPFL facilitated a conversation with Catherine Adeya of the World Wide Web Foundation and Denny Vrandečić of the Wikimedia Foundation focused on "Towards a World Wide Wikipedia, One Step at a Time." (Videos, Papers)
Research Showcases. Since 2013, we have been organizing the Wikimedia Research Showcase, which occurs on the third Wednesday of every month, to showcase the research on Wikimedia projects. These events are broadcast live and recorded for offline view. During the past five showcases we invited speakers on the topics of AI model governance, censorship, curiosity, the value and importance of Wikipedia, and the macro-level analysis of peer-production communities. Our speakers included Danielle Bassett (University of Pennsylvania), Andy Craze (Wikimedia Foundation Machine Learning team), Tiziano Piccardi (EPFL), Margaret Roberts (University of California San Diego), Daniel Romero (University of Michigan), Aaron Shaw (Northwestern University), Nick Vincent (Northwestern University), and Haiyi Zhu (Carnegie Mellon University). (Videos)
Office Hour Series. Since 2020, we have been organizing public monthly research office hours with the goal of supporting the contributors of the Wikimedia projects with their research and data-related questions. Topics discussed have included: projects that participants have been working on (e.g., studying the relationship of Wikipedia and other online communities or assessing the impact of ML models trained on Wikipedia data), the projects and priorities of the Research team at the Wikimedia Foundation, entry points for contributing to research on the Wikimedia projects, and how to work with specific data (edit retention data, clickstream data, deleted article revisions, etc.). The office hours take place on the first Tuesday of each month (16:00-17:00 UTC) and are held via video-call. (Learn more)
Research Award of the Year. In 2021, we started the Wikimedia Foundation Research Award of the Year to recognize recent research that has the potential to have significant impact on the Wikimedia projects. The award was given by Jimmy Wales, the founder of Wikipedia, during Wiki Workshop 2021 to:
- Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia (Paper)
- Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages and the Masakhane Community (Paper)
Mentorship through Outreachy.
From December 2020 through March 2021, Jesse Amamgbu worked on a model to predict what countries are relevant to a given Wikipedia article. The model starts with country information from Wikidata and expands its labels by inspecting the links in an article. He helped to develop the model, test it, and build an interface for exploring the data and predictions. (Learn more)
In May 2021, Muniza A. began developing a tool for analyzing and visualizing reader navigation on Wikipedia. Existing resources, such as the clickstream dataset, are not easily accessible without substantial technical skills. Muniza has already released a first prototype of an interactive API that allows for exploration of the data, including comparisons across languages. In addition, the API provides additional context for the clickstream data by querying data from other Wikimedia-APIs; for example, it not only shows the absolute number of times a link is clicked but it also assesses the overall importance of a link for driving readers to a specific target-page. (Learn more)
In May 2021, Slavina S. began developing better tutorials for working with Wikimedia data and a new Python library for working with Wikimedia's SQL dumps. SQL dumps contain lots of valuable information about Wikipedia article page links, categories, and beyond. However, they are challenging to process as the data is buried in a lot of SQL syntax and prone to parsing errors. She released the initial version of the Python library, mwsql (akin to the popular mwxml package used for parsing XML dumps), and the tutorials are under development.
Image competition. Since March 2021, we have been working on organizing a scientific competition around the topic of Wikimedia image caption retrieval. We started a collaboration with researchers at Google Research, Naver Labs and Hugging Face. We formulated the task as an image-to-text retrieval problem: competitors will build systems that, given an image, retrieves the closest text from a large pool of words and sentences. We identified Kaggle as the platform for running this task. The competition dataset is based on the WIT dataset, which was recently released by researchers at Google. As part of our efforts towards this competition, we have released image files at 300-px resolution and ResNet-50 embeddings for the images in the WIT dataset.
TREC Fair Ranking Track. We have been collaborating with organizers from the Text Retrieval Conference (TREC) on a research challenge that is focused on developing models that can fairly rank articles that are relevant to a given English Wikipedia WikiProject. Such models can have various applications in Wikipedia. For example, using them, the WikiProject Jazz editors and organizers would be able to identify a fair-ranked list of relevant Wikipedia articles for their project where fairness can be measured as a function of the representation of different gender identities or geographies of the world. The datasets have been released and participants are working on designing their models. (Learn more)
I Encuentro wikimedista sobre lucha contra la desinformación en Wikipedia en español. We participated in the first Wikimedia meeting on the fight against disinformation on Spanish Wikipedia. This event, organized by Wikimedia España, was held in conjunction with esLibre 2021, a Spanish conference for people interested in free open source technologies. As part of the event, we participated in a panel with representatives from Wikimedia Argentina, Wikimedia Chile, Wikimedia Mexico, and Wikimedia España, and presented a preliminary approach to knowledge integrity risk assessment in Wikipedia.
Other student initiatives. We supported the Berkman Klein Center for Internet & Society and Digital Asia Hub in running a Research Sprint with a group of 25 graduate-level students across different disciplines and countries. The sprint focused on the concept of "digital self-determination" -- i.e. how to maintain agency in an online world -- with a session focused on the availability of open knowledge and exercises involving Wikipedia and Wikiversity. (Learn more)
The people on the Research team
In February 2021, we hired Pablo Aragón as a research scientist. Pablo received a PhD in Information and Communication Technologies (AI and ML research group at UPF) and has extensive experience conducting research in projects that require multidisciplinary expertise and collaborations with communities. He has held a research engineer position at the Barcelona Media Foundation, a research scientist position at Eurecat - Technology Centre of Catalonia, and a visiting appointment at the Oxford Internet Institute. Pablo coordinated the Data Analysis for Citizen Participation project at Medialab Prado, and co-founded the Democratic Innovation Lab of Barcelona City Council. Pablo's focus within the Research team is the Improving Knowledge Integrity program and in particular research in the space of misinformation and disinformation.
In June 2021, we hired Emily Lescak as a senior research community officer. Emily earned a PhD in Fisheries from the University of Alaska Fairbanks College of Fisheries and Ocean Sciences and has extensive experience in research, data science, education, and community engagement. She was a National Science Foundation-funded postdoctoral fellow at the University of Alaska Anchorage, a fisheries geneticist at the Alaska Department of Fish and Game, and the program specialist for the Genetics Society of America's Peer Review Training Program. Most recently, she developed The Event Fund at Code for Science & Society, which provides financial and programmatic support to organizers of international open data science events. In our team Emily focuses on the research community building aspect of our work.
The Research team'sother staff members include Research Scientists Martin Gerlach and Isaac Johnson, Senior Research Scientists Miriam Redi and Diego Sáez-Trumper, Senior Research Engineer Fabian Kaelin, and the Head of Research, Leila Zia. Our Research Fellow is Bob West.
The Research team's work has been made possible through the contributions of our past and present formal collaborators. With the 2030 Strategic Direction now in place, we expect to build more formal collaborations in the coming months and years to help achieve the direction set in our Research:2030 whitepapers. To this end, we've initiated the following formal collaborations:
- Andreas Vachlos is a senior lecturer at NLIP at University of Cambridge. His research with our team will focus on learning from dispute templates. This project will help us to understand how collaboration patterns across Wikipedia editors impacts on article's quality.
- Christine De Kock is a PhD student from the NLIP group from the University of Cambridge. Her research interest is online conversations, specifically how to make disagreements constructive. She will collaborate on Learning from dispute templates.
- Mo Houtti is a PhD student from University of Minnesota. He is working on Understanding Article Importance, a project aimed at understanding how content might be prioritized in the growing suite of recommender systems on Wikipedia.
- Loren Terveen is a professor of Computer Science and Engineering in the University of Minnesota with a long history of research in social computing and recommender systems. Loren conducts research on Understanding Article Importance.
Research ShowcasesEvery 3rd Wednesday of the monthJoin us remotely for Wikimedia related research presentations and discussions. The showcases are great entry points to the world of Wikimedia research and staying in touch with other Wikimedia researchers. Read more
Research office hoursEvery 1st Tuesday of the monthJoin us in the Research office hours to have your questions related to Wikimedia data and research answered. All are welcome! Read more
WikimaniaAugust 2021The sixteenth edition of Wikimania, the largest Wikimedia conference of the year, was held virtually on August 13-17. Read more
TREC Fair Ranking competitionFebruary - August 2021Researchers are working to build information retrieval models that retrieve articles relevant to given WikiProjects and fairly rank them. Read more
Wikipedia image competitionSeptember 2021 to December 2021We are organizing a Kaggle playground competition for Wikipedia Image Caption Matching. Please spread the word and participate! Read more
We encourage you to keep in touch with us via one or more of the methods listed in Keep in touch section to receive more information about these and other events.
Trends to watch
We're keeping an eye on significant trends that relate to the Wikimedia projects and the broader ecosystem in which Wikimedia operates:
Research Grants. We are supporting the Community Resources team in launching a Research Fund to support researchers working to address high priority questions related to the sustainability and accessibility of Wikimedia projects. We anticipate opening the call for proposals in fall 2021.
Webrequest logs or Clickstream data? We frequently receive requests from researchers interested to learn from or about Wikipedia readers' navigations to gain access to the webrequest logs. In the past, we have granted access when we have found strict alignment of interests between the direction of the Research team or Wikimedia Foundation and researchers in academia or industry, and we have turned down many requests with a heavy heart. As part of building a deeper understanding of reader navigations, we learned that for many research purposes, the navigation of readers observed on Wikipedia can be sufficiently approximated from the clickstream dataset. We showed that differences between results using webrequest logs and the clickstream data, while statistically significant, are small in terms of the effect-size. This is a significant finding as it allows us to more confidently point researchers interested in studying reader navigations on Wikipedia to the public clickstream dataset.
Keep in touch with us
The Wikimedia Foundation's Research team is part of a global network of researchers who study Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list and follow @WikiResearch, which is the research community's Twitter handle.