Wikimedia Research — Research Report Nº 7

Research Report Nº 7

December 14, 2022

The seventh in a series of biannual reports from Wikimedia Research, published every June and December.

Executive Summary

Welcome! We are the Wikimedia Foundation's Research team. We turn research questions into publicly shared knowledge. We design and test new technologies, produce empirical insights to support new products and programs, and publish research that informs the Wikimedia Foundation’s and the Movement’s strategy. We help to build a strong and diverse community of Wikimedia researchers globally. This Research Report is an overview of our team’s latest developments — an entry point that highlights existing and new work, and details new collaborations and considerations, including trends that we’re watching.

Between June and December 2022, we worked with other staff at the Wikimedia Foundation, our formal collaborators, Wikimedia affiliates, and Wikimedia volunteers to address knowledge gaps on the Wikimedia projects, improve knowledge integrity, and conduct foundational work. We initiated new research on key fronts, including gaining a deeper understanding of content contributions on the Wikimedia projects and developing machine learning models to support content patrolling. We developed new metrics for the Knowledge Gap Index and the Knowledge Integrity Risk Observatory. We continued to develop algorithms for copyedit detection in multiple languages, image recommendations for Wikipedia articles, and more. We launched the second Wikimedia Research Fund and have begun planning a special 10th edition of Wiki Workshop, our major research community event, to be held in 2023.

As you read about our work below, we invite you to learn more through links to collaborative projects and other key programs, including events that we are organizing or attending. The work that the Research team does is a work in progress. That is why we are publishing these Research Reports: To keep you updated on what we think is important to know about our most recent work and the work that will happen in the following six months.

Projects

Addressing knowledge gaps
We aim to help Wikimedia projects engage larger and more diverse groups of editors, identify and address missing content, and reach more readers across the globe. The Wikimedia projects are a form of discourse that we want everyone to share in. Who writes the knowledge matters. Who reads the knowledge matters. What’s missing from that knowledge matters. That’s our challenge as we help address Wikimedia projects’ knowledge gaps.
[New] A deeper understanding of content contributions. Content creation is often viewed as synonymous with edits on Wikipedia. However, many (productive) edits do not add new facts, but instead aim to update existing information or curate articles through adding categories or other annotations. This project seeks to develop methods for assigning edits to Wikipedia articles to high-level categories of contribution types, which will enable better understanding of gaps and challenges to content maintenance. This builds on our work to develop a framework and Python package for identifying the specific actions taken within a given Wikipedia edit. (Learn more)

An improved readership experience through better new user onboarding. We continued our collaboration with the Growth team to improve new editor retention through Structured Tasks. We supported the Machine Learning team in the deployment of our link recommendation model for the add-a-link structured task. The task has been made available for 35 additional Wikipedia languages with ongoing work on an additional 17 languages. Editors have made 245k edits via the add-a-link task across 56 Wikipedia languages.

We extended research on the copyedit structured task. We created a sample of copyedits for four pilot-wikis (Arabic, Bengali, Czech, Spanish), for manual evaluation by ambassadors. Our results show that LanguageTool’s suggestions were judged to be correct in ~50% of the cases, whereas the accuracy of the suggestions from spellcheckers was much lower. We are using this feedback to refine the copyedit suggestions for a second round of manual evaluation. (Learn more)

A model to increase the visibility of articles. We extended the add-a-link structured task to the recommendation of links to orphan articles, i.e. pages without any incoming links. We evaluated our link translation-based model and predicted new incoming links to orphan articles in 183 Wikipedia languages. Our model generates substantially better suggestions when compared to different baselines, such as tools readily available for editors (morelike or findlink) or state-of-the-art ML-models (graph embeddings). The model performs particularly well in smaller Wikipedias (based on the number of articles) and when surfacing a small number of suggestions. (Learn more)

Metrics to measure knowledge gaps. We continued research and development to measure the extent of Wikimedia knowledge gaps.

We refined the data pipelines for the productization of our content gap metrics (gender, sexual orientation, geography, and time gaps) and documented the corresponding output datasets. We developed and started the productization of a metric for the multimedia gap, which measures the amount of illustrated Wikipedia articles. Some of these measurements have been adopted as part of the first efforts towards organization-level metrics that matter.

We developed and evaluated a model to assign readability scores to articles in six languages using only language-agnostic features. The model is trained on articles with different readability levels in English and can be applied without fine-tuning to other languages (since for most languages, there is a lack of annotated ground truth data). We found that while our model is slightly less precise in English (compared to existing readability formulas such as Flesch reading ease), it generalizes much better to other languages. (Learn more)

We began to develop a metric for the structured data gap for Wikipedia. This metric would capture how much machine-readable content is available for a given article, which is an important aspect of making articles interpretable and discoverable by tools. Our initial focus is on Wikidata items, which serve as the main body of structured data associated with any given Wikipedia article. We are building the metric based on a prior initiative for evaluating Wikidata item quality. (Learn more)

We are continuing to develop and improve upon a prototype tool to surface and explore knowledge gaps measurements, which we hope to make publicly available in the coming months. [Note: We experienced research engineering capacity reduction in our team during the period of this report and, as a result, the work on this front went more slowly than originally anticipated.]

A deeper understanding of the role of visual knowledge in Wikimedia Projects. We found that the link click-through rate for sections with images is higher than for unillustrated sections. We are piloting a crowdsourcing experiment based on Wikispeedia that will help us understand whether images help readers navigate across articles. (Learn more)

We found that the presence of images alongside article text helps with some visual learning tasks, especially when images are of high quality. (Learn more)

A model for image recommendation. We obtained new results from our collaboration with the Growth team on the "add-an-image" structured task, powered by an image recommendation algorithm developed by our team. As of September 1st, 14,291 images were added through the task across eight wikis, with a revert rate of 9%. The algorithm has been used to generate recommendations for new product features based on image suggestions for experienced users, developed by the Structured Data Across Wikis team. We shifted our efforts towards developing an algorithm that uses our previous work on Section Alignment to discover relevant images for sections, giving recommendations to editors to visually enrich existing articles. (Learn More)

A deeper understanding of reader navigation. Our study on how readers browse Wikipedia was accepted at ACM Transactions on the Web. We completed an analysis on temporal patterns in how Wikipedia articles are accessed by readers. We found that articles from different topics are read at different times throughout the day, which helps us understand the diversity of information needs of readers. (Learn more)

A unified framework for equitable article prioritization. We began an online experiment focused on understanding the balance between personalization and equity in edit recommenders on Wikipedia. The experiment is being conducted with SuggestBot, a personalized edit recommendation system available on several language editions of Wikipedia. It builds on our previous offline analyses of SuggestBot (and other recommender systems), which indicated that aspects such as gender and geography can affect whether a user edits the content. The results will help inform the strategy for prioritizing articles taken by the growing number of recommender systems that help Wikimedia editors find tasks to complete. (Learn more)

Models for content tagging. With the support of the Machine Learning team, we deployed our language-agnostic topic classification model to the Machine Learning platform. We are now working with the Growth team to connect the model to their Newcomer Tasks module so that new editors across all languages of Wikipedia can easily filter edit recommendations to articles about topics that match their interests. This new model will resolve geographic biases that arose from using topics derived only from English Wikipedia articles.

We started exploring the main obstacles to our ability to build equitable content tagging models. We focused on the concept of data gaps, or community processes that we would like to better support through our models that lack high-quality data. We identified templates and logging on Wikimedia as two highly-important forms of data for researchers that require extensive post-processing to extract useful labeled data for training models. This makes modeling less transferable across languages and poses a major barrier to supporting more Wikimedia communities. (Learn more)

We are expanding our tooling for enabling researchers to use Wikimedia content more easily and equitably in multiple languages within natural language processing workflows. This project will focus on standardizing the pre-processing of Wikimedia data to better support sentence and word tokenization across Wikipedia languages. This will support the deployment of our models and content gap metrics, and execute on recommendations we made in a Wiki-M3L paper on considerations for multilingual Wikimedia Research. (Learn more)
Projects

Improving knowledge integrity
We help Wikimedia communities assure the integrity of knowledge on projects by conducting research and developing and testing technologies that can help editors detect Wikipedia policy violations more effectively.
[New] Enhanced models for content patrolling. In collaboration with the Machine Learning team, we are creating a new service to help patrollers detect revisions that might be reverted. Our goal is to create a model that scores revisions based on their revert risk in all Wikipedia languages. We developed the first version of a language agnostic model, which has been deployed to production in our Machine Learning platform. The model shows an accuracy of 80% on balanced data, an improvement of 14% over the baseline. We are working on a multilingual language model for the same task, with early experiments showing good results on revert risk detection on unbalanced data. We will be integrating both models to have a single API end-point for revert risk scoring. (Learn More)

A spambot detection model. We presented the results of our predictive models of spambot activity during the Monthly Stewards / WMF meeting in June 2022. Stewards concluded that the most efficient way to take advantage of this research would be to incorporate such models into existing machine learning services that support patrolling. As a result, the latest efforts of this project have focused on extending the spambot datasets with the full content of deleted revisions to make them available for building models as part of the new machine learning service to help patrollers currently under development. (Learn more)

Wikipedia Knowledge Integrity Risk Observatory. We completed the first version of the multi-dimensional observatory by integrating metrics on article quality distribution for vital articles and on seniority distribution of editors. We started a collaboration with the Trust & Safety Disinformation team to build a Knowledge Integrity Risk Composite Index that will provide the team with actionable information to carry out preventive workflows. (Learn more)

A project to help develop critical readers. We performed a study on understanding curiosity of readers using the knowledge networks constructed during information seeking. We focused on reproducing the findings from a previous study showing the existence of at least two types of knowledge networks associated with curiosity. While that study was restricted to only 149 participants, we find that similar knowledge networks can be found in the larger population of Wikipedia readers, allowing us to generalize the conceptual framework to characterize readers’ curiosity. (Learn more)

A model for understanding knowledge propagation across Wikimedia projects. To understand how content quality in one project can impact that in other languages, we developed a model to compute revision quality across languges, extending our previous work on article quality prediction. Our results show a general improvement on article quality over time across all languages. However, Stubs are still the most frequent category. We are comparing our predicted scores with manually labeled quality scores. This analysis will help us understand the limitations of our model, and the latency of manual labels with respect to the actual changes on the article content. (Learn More)

A better understanding of reference quality in English Wikipedia. We found that reference quality has improved during the past few years, with a positive impact of community efforts such as the Perennial Source List. Our results suggest that collaboration between different types of users is fundamental to improve overall citation quality. (Learn More)

A sockpuppet detection model. The model service and user interface are complete, but deployment is on hold until there is sufficient technical resourcing to reliably support the service long-term.
Projects

Conducting foundational work
Wikimedia projects are created and maintained by a vast network of individual contributors and organizations. We focus part of our efforts on expanding and strengthening this community.
[New] A Wikimedia Research course. We are developing a course on the WikiLearn platform intended for early stage researchers who are interested in contributing to the Wikimedia projects. We anticipate that the course will be released in the second half of 2023. (Learn more)

[New] TREC 2023 AToMiC Track. To continue foster multimedia research, in 2023, we will co-organize the AToMiC (Authoring Tools for Multimedia Content) Track at the 2023 Text Retrieval Conference (TREC), in collaboration with researchers from Google, Naver Labs, and University of Waterloo. We created tasks for researchers to design scalable, diverse, and reproducible multimedia retrieval models based on English Wikipedia data. We are now working on the dataset release for the track.

Wiki Workshop. We are in the early stages of planning our 10th Wiki Workshop! The 2023 edition will be a stand-alone, virtual event. We are working on the details and you can follow updates here.

Research Showcases. Our July Showcase featured this year’s WMF-Research Award of the Year winners. We did not host showcases in August or September due to Wikimania and our team’s offsite. To celebrate Wikidata’s 10th birthday, our October Showcase featured a discussion focused on the past, future, and key insights from Wikidata. The November Showcase focused on libraries and Wikimedia knowledge. Our final showcase of 2022 will feature research and initiatives led by the WMF Research team that are focused on the research community (Learn more)

Office Hour Series. Office Hours have been on hold over the past six months as we focused our efforts on the Research Fund and the development of a research course on WikiLearn.

Research Fund. We are excited about the work of our first cohort of grantees. The submission deadline for this year’s request for proposals is December 16, 2022. (Learn more)

Research Award of the Year. Our July Showcase featured this year’s recipients of the Research Award of the Year. We invite you to nominate one or more scholarly research publications for next year’s award. The research must be on, about, use data from, and/or be of importance to Wikipedia, Wikidata, Wikisource, Wikimedia Commons or other Wikimedia projects. Nominated publications must be available in English and have been published between January 1, 2022 and December 31, 2022. Submit your nominations by February 6, 2023.

TREC Fair Ranking Track. We wrapped up our participation in TREC 2022 with presentations from the five teams who submitted results for the Fair Ranking Track that focused on supporting Wikiprojects via building diverse lists of relevant articles. This was the second year for this iteration of the track, for which we expanded from two fairness criteria (gender and geography) to nine. A final report will be published in a few months.

Presentations and keynotes. We engaged with research audiences through the following presentations and keynotes during the past six months.

In July, we gave a webinar to the Data Umbrella community focused on how data scientists can contribute to the Wikimedia projects. We shared examples of tools and datasets that contributors can access and how to find projects to contribute to. (Video)

Also in July, we participated in the Wikidata Data Quality days by giving a presentation with Lydia Pintscher on "Controversies in Wikidata". (Slides)

In September, we attended the Dagstuhl Seminar on Challenges and Opportunities of Democracy in the Digital Society. Dagstuhl Seminars are research events organized by Leibniz Center for Informatics to bring together leading experts in specific fields of computer science. Through participation in this seminar, we were able to enrich existing narratives on disinformation with ideas from our research on knowledge integrity in Wikipedia.

Also in September, we gave a keynote at the Content-based Multimedia Indexing (CBMI) conference, an international forum for researchers interested in the latest multimedia retrieval, classification and visualization technologies. We talked about challenges and opportunities for multimedia retrieval technologies in free knowledge ecosystems, and their importance for Wikipedia and its sister projects.

In October, we participated in Decidim Fest, a conference in which researchers and practitioners discussed challenges at the intersection of technology and democracy. We shared learnings from our program on improving knowledge integrity in a panel on strategies against disinformation with academics from the University of Washington and the University of Barcelona. (Video)

In November, we participated in Leading the Future of AI and Public Archives, a virtual workshop focused on the importance of ethical frameworks for artificial intelligence in cultural institutions. We shared our approach to incorporating algorithms into tools to support community processes and some of the principles that guide our work. (Slides)

In December, we were invited to give a talk at the Korean AI Summit 2022 at the Responsible AI track. We discussed the challenges and opportunities of using machine learning to support and strengthen Wikipedia communities. We were also invited by our formal collaborators from the Institute for Basic Science (IBS) to give a talk at the Data Science Lab.

Mentorship through Outreachy. In August, Nazia Tasnim completed her internship. Her work culminated in the release of a Python library mwparserfromhtml that simplifies the process of working with the new Wikipedia HTML dumps. It is a work in progress and we welcome contributions. You can visually explore how the library works in practice in this interface.

Mentorship through internships.

In August, Mo Houtti wrapped up his internship focused on the equitable article prioritization project. He was able to use the internship to launch a recommender system experiment. He extended the SuggestBot code base to support the experiment logic and identified articles on English Wikipedia that relate to different facets of equity so we can understand how they affect editors’ interactions with recommendations.

In August, Shobha S V finished her internship. She wrote up case studies of community-based multilingual research, which builds off of preliminary recommendations of good research practices in this space. She also completed an analysis of the first round of the Research Fund, which informs our outreach strategy for this year’s request for proposals.

Effectiveness of Wiki Loves Monuments campaigns. We continued our support for the WikiLovesMonuments community to better understand the effectiveness of their banners to engage readers in participating. We found that fewer than 1% of readers click on the banner to visit the WikiLovesMonuments landing page. Our results suggest that the design of the landing page influences the rate at which visitors continue to contribute. (Learn more)

The people on the Research team

In October 2022 we promoted Isaac Johnson and Martin Gerlach to Senior Research Scientist roles.

Isaac joined us in October 2018 as a research scientist and has been leading new areas of research and supporting cross-team projects. He led the development of multilingual content tagging models — classifiers that automatically categorize Wikipedia edits or articles by topic — in any language. By building user-facing tools and APIs, and writing extensive documentation and strategic directions, he brought this work closer to our communities and created a culture for researchers to implement similar machine learning technologies focused on knowledge equity. Isaac has also stepped up to contribute to the department's differential privacy efforts.

Martin joined us in September 2019 as a research scientist. Over the last three years, he has led the Research team's efforts around understanding readers through large-scale data analysis, which will have a long-term impact on how we measure readership. He also developed a link recommendation algorithm, which was integrated in the add-a-link structured task and allowed the scaling up of recommendations for links to be added to Wikipedia by newcomers.

In September 2022, Head of Research Leila Zia became an affiliate of the Berkman Klein Center for Internet and Society. In this new capacity, she looks forward to further exchange of ideas and collaborations that can help expand the Wikimedia research communities and advance our understanding of the Wikimedia projects.

The Research team’s other staff members include Research Scientist Pablo Aragón, Senior Research Scientist Diego Sáez-Trumper, Senior Research Engineer Fabian Kaelin, Senior Research Community Officer Emily Lescak, and Research Manager, Miriam Redi. Our Research Fellow is Bob West.

Collaborations

The Research team’s work has been made possible through the contributions of our past and present formal collaborators. To inquire about becoming a formal collaborator, please tell us more about yourself and your interests.

Events

Research Showcases

Every 3rd Wednesday of the month (virtual)
Join us for Wikimedia-related research presentations and discussions. The showcases are great entry points into the world of Wikimedia research and for connecting with other Wikimedia researchers. Upcoming topics include editor retention, the free knowledge ecosystem, and gender and equity. Learn more
Research Office Hours
In 2023, we will replace our traditional monthly office hours with two modes of engagement. Researchers can book 1:1 consultation sessions with a member of the research team to discuss their work, ask questions, or learn more about our community initiatives. In addition, we will host quarterly group office hours centered on a theme. Learn more
Wiki Workshop
More information about Wiki Workshop 2023 will be available in the next few months. Stay tuned!

We encourage you to keep in touch with us via one or more of the methods listed in the Keep in touch section to receive more information about these and other events.

Trends to watch

We’re keeping an eye on significant trends that relate to the Wikimedia projects and the broader ecosystem in which Wikimedia operates:

Differential privacy. The differential privacy project is progressing towards releasing its first dataset: pageview counts by country for Wikipedia articles. Research has continued to support technical development, focusing on the challenge of guaranteeing that the dataset protects reader privacy without collecting more sensitive data about readers. We worked with the Security team to survey the research community about datasets that would be useful in their work. Results will inform the contents of a dataset that will be released in the second half of 2023. (Learn more)

Upcoming changes to data engineering. The Data Engineering team builds and supports much of the key infrastructure that enables our team's work and many public research datasets. They are overhauling some of this infrastructure, which may result in changes in the coming months. For example, DataHub is being tested as a means of better documenting Wikimedia datasets and the Airflow system simplifies job orchestration for generating regular datasets. We look forward to these upcoming changes, as they will hopefully not only simplify our own work but also enable more data releases and clearer documentation around Wikimedia data.

Research and Wikidata. October 2022 marked the tenth anniversary of the launch of Wikidata. We organized a panel as part of the October Wikimedia Research Showcase to celebrate this special occasion and learn more about the state of Wikidata and its research challenges in the AI/ML era. Interesting avenues for future work emerged from the lively discussion among established researchers and practitioners in the field, including the potential for Wikidata to inform new and better datasets in machine learning benchmarks and to provide context and provenance of data for tackling online disinformation.

Donors

Funds for the Research team are provided by donors who give to the Wikimedia Foundation, and by a grant from the Argosy Foundation. Thank you!

Keep in touch with us

The Wikimedia Foundation's Research team is part of a global network of researchers who study Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list. You can follow us on Twitter and Mastodon.

Previous Report

Research Report Nº 6

Research Report Nº 7

Table of contents

Research Report Nº 7

Executive Summary

Projects

Addressing knowledge gaps

Projects

Improving knowledge integrity

Projects

Conducting foundational work

The people on the Research team

Collaborations

Events

Research Showcases

Research Office Hours

Wiki Workshop

Trends to watch

Donors

Keep in touch with us

Previous Report