Wikimedia Research — Research Report Nº 6

Research Report Nº 6

August 4, 2022

The sixth in a series of biannual reports from Wikimedia Research, published every June and December.

Executive Summary

Welcome! We're the Wikimedia Foundation's Research team. We turn research questions into publicly shared knowledge. We design and test new technologies, produce empirical insights to support new products and programs, and publish research that informs the Wikimedia Foundation's and the Movement's strategy. We help to build a strong and diverse community of Wikimedia researchers globally. This Research Report is an overview of our team's latest developments – an entry point that highlights existing and new work, and details new collaborations and considerations, including trends that we're watching.

Between January and June 2022, we worked with other staff at the Wikimedia Foundation, our formal collaborators, Wikimedia affiliates, and Wikimedia volunteers to address knowledge gaps on the Wikimedia projects, improve knowledge integrity, and continue building the foundation for a more diverse and global network of Wikimedia researchers. We published an updated roadmap for Addressing Knowledge Gaps. We co-organized two major research community events: Wiki Workshop 2022 and Wiki-M3L workshop. We announced the Wikimedia Foundation Research Award of the Year winners. We funded 9 research applications through Wikimedia's Research Fund. And we continued developing models such as a section alignment model across more than 100 languages, language agnostic readability scores for Wikipedia articles, and more.

As you read about our work below, we invite you to learn more through links to collaborative projects and other key programs, including events that we are organizing or attending. The work that the Research team does is a work in progress. That is why we are publishing these Research Reports: To keep you updated on what we think is important to know about our most recent work and the work that will happen in the following six months.

Projects

Addressing knowledge gaps
We aim to help Wikimedia projects engage more editors, increase the diversity of editors, identify and address missing content, and reach more readers across the globe. The Wikimedia projects are a form of discourse that we want everyone to share in. Who writes the knowledge matters. Who reads the knowledge matters. What's missing from that knowledge matters. That's our challenge as we help address Wikimedia projects' knowledge gaps.
An updated roadmap for addressing knowledge gaps. In February 2019, we published three white papers outlining the plans and priorities of the Research team in response to the 2030 Wikimedia Movement strategic direction. In October 2021, we revisited the white paper on Knowledge Gaps in light of what we had learned, discovered, and developed over the past three years. In May 2022, we released an updated research roadmap (paper, blog post) for the Addressing Knowledge Gaps program, where we introduced the guiding principles for our research, three main research directions (identify, measure, and bridge knowledge gaps), and ideas for future research. (Learn more)

An improved readership experience through better new user onboarding. We continued our collaboration with the Growth team to improve new editor retention through Structured Tasks. Today, the add-a-link structured task (based on add-a-link model) is successfully deployed in 21 wikipedia languages (Phabricator task) with ongoing work around the next set of 17 languages (Phabricator task). We also handed over the maintenance of the add-a-link model to the Machine Learning Team and supported improvements to the model requested by the community, such as avoiding links in certain sections (Phabricator task), avoiding links to first names (Phabricator task), or avoiding overlinking (Phabricator task). The hand-over of the model to the ML team is a major milestone for this work on our end, as it indicates the maturity of the model and the overall infrastructure that supports research to product innovation.

We continued research on another structured task focused on copyediting. After performing a literature review on existing approaches for automatic spell- and grammar-checking, we identified LanguageTool as a candidate to surface copyedits to newcomers. We evaluated its performance in error detection, showing that LanguageTool can detect a high volume of genuine copyedit errors in Wikipedia articles. (Learn more)

A model to increase the visibility of articles. We continued research on extending the add-a-link structured task with the aim to increase the value of the added links by recommending links to orphan articles, i.e. pages without any incoming links. We found 8.4M orphan articles across all Wikipedias, corresponding to 14.6% of all 57M articles. We developed a model to recommend links based on link translation (i.e. links that already exist in another language version), which would allow us to de-orphanize 59% of all orphan articles. (Learn more)

Metrics to measure knowledge gaps. We continued research and development to measure the extent of Wikimedia knowledge gaps. We worked on productionizing four content gap metrics and released datasets with gender, sexual orientation, geography, and time gap measurements across all Wikipedia languages. In collaboration with the Product Design Strategy team and a data visualization firm, we have developed a prototype tool to surface and explore knowledge gaps measurements. We intend to make the tool available to the public in the coming months.

Additionally, we continued our efforts to measure readability of Wikipedia articles. While there is an abundance of readability formulae for English or one-off methods for individual languages, our primary challenge is to develop an approach that can scale to the 300+ Wikipedia language editions. Inspired by previous research, we started to develop a model to assign readability scores to articles relying exclusively on language-agnostic features by representing sentences as sequences of entities using the open-source entity-linking tool DBPedia-spotlight. (Learn more)

A deeper understanding of the role of visual knowledge in Wikimedia Projects. [Note: We experienced research scientist capacity reduction in our team during the period of this report and as a result, the research on this front went more slowly than originally anticipated.]

We published a position paper on the importance of images on Wikipedia in africa.com and ArtAfrica magazine.

To support better understanding of image usage and reuse across the Wikimedia projects, we developed a tool for large-scale image similarity detection. Given an image name or url, the tool retrieves similar images from a dataset of around 7 million images from Wikimedia Commons. (Learn more)

A model for image recommendation. We have seen the first results of our collaboration with the Growth team for the "add-an-image" structured task, powered by an image recommendation algorithm developed by our team. As of May 1st, 5,750 images were added through the task across eight wikis (Spanish, Arabic, Persian, Turkish, French, Bengali, Czech, Portuguese), with a revert rate of less than 8.5%.

A deeper understanding of reader navigations. We published several papers on characterizing reader navigation. We presented the paper "Wikipedia Reader Navigation: When Synthetic Data Is Enough" at WSDM 2022 as well as another paper "Going down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions" at Wiki Workshop 2022. We published a pre-print, "A Large-Scale Characterization of How Readers Browse Wikipedia", which is under review. We presented a summary of the results of the research in this space at Wikimedia Foundation's Monthly Staff meeting in April 2022. (Slides)

A unified framework for equitable article prioritization. We extended our analysis of how recommender systems impact content equity to include SuggestBot, a volunteer-run personalized edit recommender used predominantly by well-established Wikipedia editors. This analysis builds on our prior findings by identifying which aspects of an article topic appear to be most relevant to whether the editor actually takes the recommendation and edits the article. These findings can set the stage for experimentation that can inform the Wikimedia movement about how to best align editor interests with the movement's content equity goals. (Learn more)

A tool for automatic identification of edit actions. We published a Python package that can summarize the changes made by edits on Wikipedia – e.g., how many references were added or words changed. This package is undergoing enhancements but ready to be used. We encourage researchers and developers to try it out and leave feedback. (Learn more and try it out!)

Models for content tagging. We are developing a language-agnostic quality model for Wikipedia, which allows us to assign quality scores to any Wikipedia article and track how quality changes for different topical areas (a key challenge to measuring knowledge gaps). The model is based on features extracted directly from an article's wikitext. Therefore it can be applied to any historical revision. (Learn more and try it out!)

A section alignment model for more productive translations. Based on our previous research on cross-lingual section alignment, we have developed a new pipeline to create section mappings across more than 100 languages. The output of this research and development is now included in the Content Translation Tool, and is helping the Wikipedia community to expand existing articles through the section translation tool. (Learn more)
Projects

Improving knowledge integrity
We help Wikimedia communities assure the integrity of knowledge on projects by conducting research and developing and testing technologies that can help editors detect Wikipedia policy violations more effectively.
A Spambot detection model. [Note: We experienced capacity reduction in our team during the period of this report and as a result, the research on this front went more slowly than originally anticipated.] We built models that predict whether a given tuple (URL, revision, Wikimedia project) is a spambot activity or not. We learned that new user accounts that add a URL in their first edit are more likely to be spambots. We are currently discussing this and other learnings from the study with stewards and the Trust & Safety team to support the two groups in making a decision about next steps. (Learn more)

Wikipedia Knowledge Integrity Risk Observatory. We continued adding data and metrics to the multi-dimensional observatory of knowledge integrity risks. We integrated new data from the social media traffic report pilot to expand the set of indicators to those coming from sources outside of the Wikimedia ecosystem. We also initiated exploratory conversations with other researchers to study the usability of the Knowledge Integrity Risk Observatory for understanding disinformation risks in medium or small size Wikipedia language editions. (Learn more)

A project to help develop critical readers. We performed a first round of analysis to quantify reader engagement with talk pages and version history of Wikipedia articles. These pages are often mentioned in teaching materials on how to assess the quality and trustworthiness of information in the context of Wikipedia – e.g., Civic Online Reasoning Curriculum or the Reading Wikipedia in the Classroom guide. We find that readers, not just editors, engage with these non-encyclopedic features. The level is, in some cases, almost as high as engagement with citations, but strongly depends on the article (higher engagement for articles with quality and reliability issues) and the location of the corresponding button, which differs across devices and language versions. (Learn more)

A report of controversial Wikidata properties and claims. We learned that well-established controversies, such as territory disputes between countries, are well covered by the "statement disputed by" qualifier; reverted revisions are associated with user characteristics more than the content-related ones; and Wikidata item talk pages are not commonly used on Wikidata. We also observed a correlation between the most edited items on Wikidata and ongoing events receiving attention on Wikipedia. We shared the findings of this study with the Wikidata team and community and concluded the work in this space. (Learn more)

A model for understanding knowledge propagation across Wikimedia projects. We found limited evidence between the existence of reliability-related templates and the propagation of content across Wikipedia languages. We now have shifted our attention to understand how content quality in one project can impact content quality in other languages. (Learn more)

A dataset to detect peacock behavior on English Wikipedia. We released WikiEvolve, a dataset to study peacock behavior in Wikipedia. It contains seven versions of the same article from Wikipedia, from different points in its revision history, one with promotional tone, and six without it. The dataset offers more precise training signals for building models to detect promotional tone on Wikipedia. (Learn more)

A better understanding of reference quality in English Wikipedia. We started a new project to assess reference quality in English Wikipedia with the aim of informing Wikimedia editors and the Wikimedia Foundation in initiatives that can expand and improve the quality of references on Wikipedia. (Learn more)

A sockpuppet detection model. The status of work for this model remains the same as reported in the previous report.
Projects

Conducting foundational work
Wikimedia projects are created and maintained by a vast network of individual contributors and organizations. We focus part of our efforts on expanding and strengthening this network.
Wiki Workshop. We co-organized the 9th annual Wiki Workshop, which took place virtually as part of the Web Conference 2022. We had more than 160 attendees, 25 accepted papers, and submissions from more than 20 countries. The workshop featured a keynote given by Lawrence Lessig (Harvard Law School). Erik Möller (Freedom of the Press Foundation) facilitated a panel discussion with Tiffiniy Cheng (Fight for the Future), Mishi Choudhary (Software Freedom Law Center), and Cory Doctorow (Electronic Frontier Foundation) in the occasion of the decade anniversary of SOPA/PIPA, focused on the past and future of online protest. We also awarded the Wikimedia Research Award of the Year 2022. (Learn more from the recordings and the published papers!)

Wiki-M3L Workshop. We co-organized the first workshop on Wikipedia and Multi-Modal & Multi-Lingual Research, which took place virtually as part of ICLR 2022. We had 144 attendees, 13 invited speakers from industry, academia, and non-for-profit organizations, and four accepted papers. The workshop featured a session on the "Wikipedia Image/Caption Matching Competition", where the winners presented their solutions and shared more about the challenges and opportunities for future editions of the competition.

Research Showcases. Since 2013, we have been organizing the Wikimedia Research Showcase, which occurs on the third Wednesday of every month, to showcase the research on Wikimedia projects. These events are broadcast live and recorded for offline view. Since January, our showcases have featured external researchers conducting research on Wikipedia beyond the English edition, collective attention, article quality, gaps and biases, and the diversity of languages in Wikipedia.

Office Hour Series. We organized monthly Research Office Hours. Topics discussed included programming, such as internships and the Research Fund, as well as technical topics around linked data, interactive model building techniques, and measuring bias.

Research Fund. We received 35 applications from 22 countries as part of the Wikimedia Research Fund. Through a two-stage review and deliberation process involving technical reviewers as well as Regional Fund Committee members, and taking into account the feedback from the broader Wikimedia Movement we selected nine applications to receive a total of 303,535.20 USD in research funds.

Strategy to support and grow the research community. We drafted our team's strategy for growing and supporting a global and diverse network of Wikimedia researchers informed by the feedback that we received from some of the existing Wikimedia research community members we consulted with in the first half of the year, and rooted in our learnings from an array of activities and events we have been part of or led over the years. We invite you to learn more about the strategy. You can use the Talk page to provide feedback or indicate your interest in contributing to our initiatives.

Research Award of the Year. We awarded WMF-RAY 2022 to "WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning" and WMF-RAY 2022 - Best Student Paper to "Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach" after carefully reviewing more than 230 research publications on or about Wikimedia projects in 2021. Watch the award ceremony and read about the papers!

TREC Fair Ranking Track. The summary of the 2021 Fair Ranking track has been published and instructions and data are released for the 2022 iteration. This iteration asks teams to build models for retrieving a well-balanced list of articles for a given WikiProject. It expands the number of fairness criteria to take into consideration from two (gender and geography) to nine in an effort to explore questions of intersectional fairness. Notably, this year's criteria include an evaluation of the geography of an article based upon its references and will involve the collection of additional labeled data about the geo-provenance of websites and publishers. (Learn more)

Wikimedia Hackathon. We hosted a community session at the Wikimedia Hackathon focused on bringing together the developer and researcher communities. We framed the session to focus on how developers help bring research to life, provided examples of collaborations between developers and researchers, explained ways that developers can get involved in research at WMF, and brainstormed how to improve communication between developers and researchers.

Presentations and keynotes. We engaged with research audiences through presentations and keynotes during the past six months. Below, you see an excerpt of our engagements.

In January, in collaboration with the Foundation's Trust & Safety team, we presented at the Creative Commons Open Journalism webinar series. Through this talk, we shared the technical perspective and community-led approach to addressing misinformation and disinformation campaigns on Wikipedia. (Video)

In March, we gave a talk entitled "Wikipedia and The Science of Knowledge Integrity" at the MisinfoCon @ MozFest 2022, where we shared our more recent findings, insights and tools in the field of Knowledge Integrity. We had the opportunity to discuss the role of Wikipedia in the landscape of (dis)information, and how we can collaborate with other open knowledge communities to fight against false information.

In March, we gave an invited talk titled "Research at the Wikimedia Foundation: The Science of Knowledge Equity" at the Cavendish QI Seminars at University of Cambridge. Hosted by the Cavendish Quantum Information Group and the Hitachi Cambridge Laboratory, the talk provided an opportunity to interact with a community of physicists working in Computer Science and share ongoing research and challenges in developing machine learning models in the Wikimedia ecosystem.

In March, we gave a keynote at IQIIR a workshop on information quality measurements, co-located with CHIIR, the international conference on Human Information Interaction and Retrieval. In the presentation, we shared our framework for identifying, measuring, and bridging knowledge gaps, sharing our thinking around our knowledge equity-based notion of content quality measurements.

In May, we gave an invited talk at the Helsinki Institute for Social Science and Humanities Seminar. This Seminar series is intended as a forum for interdisciplinary researchers to discuss cutting-edge research and methodological innovations. The talk provided an overview of existing work and challenges on classifying and quantifying the role of images in free knowledge ecosystems.

In June, we gave a keynote presentation as part of The Web Science 2022 in Barcelona, Spain on the topic Research at the Service of Free Knowledge. The primary focus of the keynote was to elevate the importance of Wikimedia projects within the Web Science research community, surface the complexity of research on the Wikimedia projects, and raise awareness about the open research questions that Wikimedia projects can benefit from addressing.

Mentorship through Outreachy. In May, Nazia Tasnim began working on building a python library to work with HTML dumps. When analyzing Wikipedia's content for a research project or training large language models, researchers typically use the publicly available Wikimedia database dumps and parse the wikitext using tools such as mwparserfromhell. However, it is often desirable to work with an HTML version of the dumps. For example, Mitrevski et al. found for English Wikipedia that from the 475M internal links in the HTML versions of the articles, only 171M (36%) were present in the wikitext. Fortunately, very recently the Wikimedia Enterprise HTML dumps have been introduced and made publicly available with regular monthly updates so that researchers may use them in their work. Thus, the aim of this project is to provide tools that will lower the technical barriers to work with the HTML dumps and empower researchers and others to take advantage of this beneficial resource.

Mentorship through internships. In April, Paramita Das began working on cross lingual article quality as part of our research on understanding knowledge propagation across Wikimedia projects. In this work we study the evolution of content quality across languages, and we are planning to release a dataset with the full history of article quality across all Wikipedia languages.

In April, Aitolkyn Baigutanova started working on reference quality in English Wikipedia. In this project we are developing metrics to understand the evolution of content reliability in the English Wikipedia.

Mo Houtti is continuing his work on equitable article recommendation that he started two years ago as a Formal Collaborator with the goal of better understanding the trade-offs between personalization and content equity.

In May, Shobha S V began her internship focused on program evaluation and development of resources for researchers interested in pursuing multilingual research on different language editions of Wikipedia.

Effectiveness of Wiki Loves Monuments campaigns. We have been supporting the Wiki Loves Monuments community to perform research to understand the effectiveness of the WLM image competition campaigns, which are advertised by utilizing CentralNotice banners. We hypothesize that our understanding of the effectiveness of CentralNotice banners will be expanded when we study its effectiveness in activating edit contributions when compared to donations (the most commonly studied use-case of the banners). (Learn more)

The people on the Research team

The Research team's staff members include Research Scientists Pablo Aragón, Martin Gerlach, and Isaac Johnson, Senior Research Scientist Diego Sáez-Trumper, Senior Research Engineer Fabian Kaelin, Senior Research Community Officer Emily Lescak, Research Manager, Miriam Redi, and the Head of Research, Leila Zia. Our Research Fellow is Bob West.

Collaborations

The Research team's work has been made possible through the contributions of our past and present formal collaborators. During the last six months, we established the following new collaborations:

Dani Bassett is the J. Peter Skirkanich Professor at the University of Pennsylvania, with appointments in the Departments of Bioengineering, Electrical & Systems Engineering, Physics & Astronomy, Neurology, and Psychiatry, as well as an external professor of the Santa Fe Institute. Their work focuses on studying biological, physical, and social systems by using and developing tools from network science and complex systems theory. They collaborate with us in the Understanding the Curiosity of Readers project.
Meeyoung Cha is an associate professor at KAIST in the School of Computing and a chief investigator in the Pioneer Research Center for Mathematical and Computational Sciences at the Institute for Basic Science. We started a new collaboration with Meeyoung to study Reference Quality in English Wikipedia.
Changwook Jung is a Phd student at KAIST collaborating with us in the Reference Quality in English Wikipedia project.
David Lydon-Staley is an Assistant Professor at the Annenberg School for Communication at the University of Pennsylvania. His research focuses on the unfolding of human behavior over short timescales (e.g., moment-to-moment, day-to-day) during the course of everyday life. He conducts research on Understanding the Curiosity of Readers.
Jaehyeon Myung is a master student at KAIST collaborating with us in the Reference Quality in English Wikipedia project.
Shubhankar Patankar and Dale Zhou are PhD students at the University of Pennsylvania collaborating with us on the project on Understanding the Curiosity of Readers.
Perry Zurn is an Assistant Professor of Philosophy at American University, and affiliate faculty in the Department of Critical Race, Gender, and Culture Studies. His research interests are political philosophy, critical theory, and trans philosophy, with special expertise in feminist philosophy, philosophies of resistance, and network theory. He collaborates with us on the project on Understanding the Curiosity of Readers.

We also want to take this opportunity to thank Benjamin Mako Hill (University of Washington) for extensively supporting our team's efforts on two major fronts. Mako served as a Research Fund co-chair during the past six months (and before that) as well as the Award co-chair for the WMF Research Award of the Year.

To inquire about becoming a formal collaborator, please tell us more about yourself and your interests.

Events

Research Showcases

Every 3rd Wednesday of the month (virtual)
Join us for Wikimedia-related research presentations and discussions. The showcases are great entry points to the world of Wikimedia research and staying in touch with other Wikimedia researchers. Read more
Research Office Hours

Every 1st Tuesday of the month (virtual)
We will be pausing the Research Office Hours August - December 2022 so that we can focus our attention on the development of a course. We have been brainstorming new approaches to Office Hours to increase inclusivity and look forward to introducing them when we restart. Read more
Wikimania

August 2022
The seventeenth edition of Wikimania, the largest Wikimedia conference of the year, will take place August 11-14. Register
Research Fund

August 2022
The request for proposals for the next round of funding will be announced on Meta-Wiki in August. Read more
Wikidata 10th Birthday

October 2022
Wikidata will celebrate its 10th birthday in October with a series of events. We encourage researchers to contribute messages or other artifacts celebrating the importance of Wikidata to their work and communities. Read more

We encourage you to keep in touch with us via one or more of the methods listed in the Keep in touch section to receive more information about these and other events.

Trends to watch

We're keeping an eye on significant trends that relate to the Wikimedia projects and the broader ecosystem in which Wikimedia operates:

Data Governance for AI. There has been a growing focus on how researchers collect, curate, and respect the rights of those whose data is included in datasets that are used for the training of AI models. While most of the machine learning models that the Research team develops are derived from Wikimedia data, and thus follow the basic principles that govern Wikimedia content, many models developed within other industry organizations or academia use data whose provenance is far less clear – e.g., content scraped from web pages. We have been participating in efforts such as BigScience that have been seeking to design better approaches to data governance (read more). This was also an important topic within the Wiki-M3L workshop, where governance models for community ownership of data were discussed in a panel (watch the recording, starting at 04:28:09) with leaders from Indigenous AI and the Hugging Face community. We believe that the Wikimedia communities have much to contribute to these conversations given the heavy usage of Wikimedia data by AI practitioners and rich history of self-governance within the Wikimedia projects. We encourage the researchers and practitioners to reach out to the Wikimedia communities to include their knowledge and perspective as part of their conversations and learning journeys.

Donors

Funds for the Research team are provided by donors who give to the Wikimedia Foundation, by grants from the Siegel Family Endowment, and by a grant from the Argosy Foundation. Thank you!

Keep in touch with us

The Wikimedia Foundation's Research team is part of a global network of researchers who study Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list and follow @WikiResearch, which is the research community's Twitter handle.

Previous Report

Research Report Nº 5

Research Report Nº 6

Table of contents

Research Report Nº 6

Executive Summary

Projects

Addressing knowledge gaps

Projects

Improving knowledge integrity

Projects

Conducting foundational work

The people on the Research team

Collaborations

Events

Research Showcases

Research Office Hours

Wikimania

Research Fund

Wikidata 10th Birthday

Trends to watch

Donors

Keep in touch with us

Previous Report