Research Report Nº 5
The fifth in a series of biannual reports from Wikimedia Research, published every June and December.
Welcome! We’re the Wikimedia Foundation’s Research team. We turn research questions into publicly shared knowledge. We design and test new technologies, produce empirical insights to support new products and programs, and publish research informing the Wikimedia Foundation’s and the Movement’s strategy. We help to build a strong and diverse community of Wikimedia researchers globally. This Research Report is an overview of our team’s latest developments – an entry point that highlights existing and new work, and details new collaborations and considerations, including trends that we’re watching.
Between June and December 2021, we worked with other staff at the Wikimedia Foundation, our formal collaborators, Wikimedia affiliates, and Wikimedia volunteers to address knowledge gaps on the Wikimedia projects, improve knowledge integrity, and conduct foundational work. We initiated research on multiple key fronts, including building a risk observatory for the Wikimedia projects, gaining a deeper understanding of readers' curiosity and critical thinking process while on the Wikimedia projects, and developing models for automatic content tagging. We secured two major workshops for 2022, which give us the opportunity to bring established and emerging Wikimedia researchers together: Wiki Workshop and Wiki-M3L. And, we co-launched the first-ever Wikimedia Research Fund.
As you read about our work below, we invite you to learn more through links to collaborative projects and other key programs, including events that we are organizing or attending. The work that the Research team does is a work in progress. That is why we are publishing these Research Reports: to keep you updated on what we think is important to know about our most recent work and the work that will happen in the following six months.
Addressing knowledge gapsWe aim to help Wikimedia projects engage more editors, increase the diversity of editors, identify and address missing content, and reach more readers across the globe. The Wikimedia projects are a form of discourse that we want everyone to share in. Who writes the knowledge matters. Who reads the knowledge matters. What’s missing from that knowledge matters. That’s our challenge as we help address Wikimedia projects' knowledge gaps.
An improved readership experience through better new user onboarding. We continued our collaboration with the Growth team to improve new editor retention through Structured Tasks. We presented the add-a-link structured task model at CIKM 2021 (paper). After success with four pilot wikis (Arabic, Bengali, Czech, and Vietnamese), the model has been deployed to six more languages (Persian, French, Hungarian, Polish, Romanian, and Russian) with more languages scheduled for deployment. We conducted exploratory research around recommending links that increase the visibility of articles (read more) and background research on copy editing as a structured task (read more).
Metrics to measure knowledge gaps. We continued research and development to measure the extent of Wikimedia knowledge gaps. Since June 2021, we have worked on productionizing some of the content gap metrics already developed. We also started research to measure the content readability gap. (Read more)
A deeper understanding of the role of visual knowledge in Wikimedia Projects. We continued the research on understanding reader engagement with images on English Wikipedia. More specifically, we started looking into how images help navigate the content in the articles and across articles. We expect further exploratory research on this front to continue in the coming months. (Read more)
We continued to gather data about readers' comprehension skills on Wikipedia articles with and without images. The data will be used to better understand how images impact reading comprehension on Wikipedia. (Read more and participate!)
A model for image recommendation. We continued our collaboration with the Growth team to build the "add-an-image" structured task. In this task, newcomers are presented with an unillustrated Wikipedia article and a candidate image match. The task is powered by an image recommendation algorithm developed by our team. If the match is correct, they are asked to add the image to the article, and write an appropriate caption. As of November 30th, the feature is available on mobile devices for Arabic, Bengali, and Czech Wikipedias. As part of our efforts to scale up image recommendations using machine learning, we launched the Kaggle Image/Caption matching competition, where participants were asked to design models that can associate Wikipedia images with the closest piece of text, in 100+ languages.
A deeper understanding of reader navigations. We completed two studies that build on our initial characterizations of reading sessions.
In the first study, we investigate how readers of English Wikipedia reach and transition between articles and how these patterns form navigation paths. We find that Wikipedia navigation paths commonly mesh with external pages as part of a larger online ecosystem and that readers have a higher chance of stopping navigation when reaching low-quality pages. (Read more)
In the second study, we summarize our findings comparing navigation data from webrequest logs to publicly available clickstream dataset in eight different languages, as shared in the previous report. Our paper has been accepted for presentation at WSDM 2022. (Read more)
A unified framework for equitable article prioritization. We continued research on understanding the impact of recommender systems on equity. More specifically, we analyzed the impact of the Newcomer Task recommender system on the gender and geographic distribution of content created through these recommendations. Through the newcomer task recommender system, editors from several languages (Arabic, Persian, French, and Russian) had access to topic filters such as Art or Sports. The editor-selected topics induced a slight balancing out of the gender distribution of edited content but they also induced new biases into the geographic distribution of content. We observed some selection bias by editors around geographic content, suggesting that closing those gaps depend on attracting a diverse community of editors. (Read more)
A tool for automatic identification of edit actions. One challenge to studying trends in editing and building tools to support editors is building capabilities that help us understand what is happening in a given edit, at scale and across languages. This is a long-studied problem and there are many ways to approach it - e.g., the "what" like editing a link vs. the "why" like wikification. We are starting with the "what" and building a very basic taxonomy of edit types for Wikipedia articles. We intend to develop tools for extracting diffs for edits and labeling them with the associated edit types. (Read more and try it out!)
Models for content tagging. A large, and sometimes hidden, side of maintaining Wikipedia for volunteer editors is maintaining the annotations or metadata that exist alongside the Wikipedia articles. These categories - including WikiProject tags, quality ratings, importance ratings, Neutral Point of View violations, citations needed, and other relevant tags - help editors track and prioritize content and help researchers understand trends and build tools to improve Wikipedia. We are building language-agnostic predictive models that can help categorize and evaluate Wikipedia content to assist editors and researchers in their work. The models we are currently focusing on are quality and geography. (Read more)
An updated roadmap for addressing knowledge gaps. In fall 2021, we revisited our team's roadmap for addressing knowledge gaps that was first shared in 2019. We dedicated a week to brainstorming exercises and did a deep dive in the research conducted over the past few years and our findings. We then reflected on the Movement's strategic direction and updated the roadmap that we intend to use in the coming years to plan our work. We are now working on finalizing the documentation and hope to share the result in the next report.
Read more about our direction and vision for addressing knowledge gaps.
Improving knowledge integrityWe help Wikimedia communities assure the integrity of knowledge on projects by conducting research and developing and testing technologies that can help editors detect Wikipedia policy violations more effectively.
A Spambot detection model. We started a new project to support stewards to more efficiently detect spambot across the projects. Wikimedia stewards are the group of volunteers with the most extensive rights and permissions on the Wikimedia projects and with cross-wiki responsibilities. Despite their critical role in content governance and moderation, their workflows hardly benefit from advanced tools or technologies. With the support of stewards and the Trust & Safety team, we are building a model to automatically identify revision and editor features of spambot URLs. We have created a novel dataset of URLs from visible and deleted revisions by spambots and revisions that hit spam-related AbuseFilter rules. We are currently examining relevant patterns.
Wikipedia Knowledge Integrity Risk Observatory. We started a new project with the goal of providing a multi-dimensional observatory (monitoring system) that can support the Wikipedia communities with tracking and taking action on knowledge integrity risks. Over the past months, we conducted literature review and worked closely with the Moderators Tools team to create a taxonomy of knowledge integrity risks. We are currently building an exploratory and interactive dashboard that integrates different metrics based on data of interest. (Read more)
A project to help develop critical readers. The majority of the research and development to curb misinformation and disinformation on the platforms focuses on improving the content or processes on the content generation side. While we continue to invest on that front, we believe it is important to conduct research to learn how to equip readers to become more resilient to disinformation or misinformation. That is why we started a multi-year initiative to focus on curious and critical readers. Since July 2021, we have started to review relevant literature to identify two main research directions. First, we are interested in understanding how readers demonstrate curiosity when seeking information in Wikipedia. We aim to apply the framework of knowledge networks to capture traits of curiosity and how they relate to receptiveness to inaccurate information. Second, we are interested in quantifying the degree to which readers are critically engaging with information on Wikipedia. Our aim is to understand how much readers engage with additional non-content elements of an article, specifically talk pages, version history, or pages from other namespaces related to reliability (such as templates or policies), to critically assess articles’ reliability. (Read more)
A report of controversial Wikidata properties and claims. In July 2021, we started a project to develop a framework for detecting controversial content on Wikidata. We spent the past months defining and scoping controversiality on Wikidata, in collaboration with the Wikidata team. We have developed our first model for detecting controversial content on Wikidata and we expect to improve the model after feedback iterations in the upcoming months. (Read more)
A model for understanding knowledge propagation across Wikimedia projects. In our previous report we shared a dataset of inter-language knowledge propagation in Wikipedia across 309 language editions. Since July 2021, we have started utilizing the dataset by exploring how the presence of reliability related templates impacts cross-lingual content propagation. (Read more)
Machine learning models to align Wikipedia and Wikidata content. We continued exploring different models for aligning Wikipedia and Wikidata content. We consider the results so far as not satisfactory. As a result, we have put this project on hold until we can gather more insights about how this problem can be approached. (Read more)
A dataset to detect peacock behavior on English Wikipedia. Wikipedia articles should be written in a neutral way. However, some edits in biographies of people on Wikipedia promote the subject in a subjective manner. This, in English Wikipedia, is referred to as peacock behavior. Detecting content that signals peacock behavior is a hard task that is done primarily manually by editors today. We see opportunities for developing machine learning models to support editors in more effectively detecting such content. To facilitate the development of such models, we intend to develop and publish a dataset. We have started the work on this front and we expect to release the data in the coming months. (Read more)
A prototype for automatic fact-checking in Wikipedia. Our aim is to implement an open API that will automatically perform a facts validation process. In Natural Language Processing (NLP), that task is called Natural language inference (NLI), in which a claim is compared with a reference to determine whether it is correct, incorrect, or unrelated. The output of this project will be the first prototype of an Automatic Fact Checking API. (Read more and test the model)
A sockpuppet detection model. The Anti-Harassment Tools team has been designing an interface to surface the sockpuppet detection model that we developed. The next steps will be to gather feedback from checkusers via the interface. (Read more)
Read more about our direction and vision for improving knowledge integrity.
Conducting foundational workWikimedia projects are created and maintained by a vast network of individual contributors and organizations. We focus part of our efforts on expanding and strengthening this network.
Wiki Workshop. We are co-organizing the 9th annual Wiki Workshop, which will take place virtually as part of the Web Conference 2022. The workshop serves as a platform for the researchers of Wikimedia projects to convene on an annual basis and present their ongoing and completed research projects, brainstorm and build collaborations, and connect with Wikimedia volunteers and Wikimedia Foundation staff. (Contribute!)
Image competition and workshop. The “Wikipedia Image/Caption Matching Competition” that we shared with you in the previous report has concluded. During the three month long competition, more than 100 teams submitted models! The winners will present their work at the Wiki-M3L workshop, which will be held virtually as part of the ICLR 2022 conference. Learn more about Wiki-M3L in the events section of this report.
Research Showcases. Since 2013, we have been organizing the Wikimedia Research Showcase, which occurs on the third Wednesday of every month, to showcase the research on Wikimedia projects. These events are broadcast live and recorded for offline view. Since July, our showcases have featured external researchers working on content moderation, socialization, content gaps, and online education landscapes. Our October Showcase featured updates from the Research team on their work to bridge knowledge gaps.
Office Hour Series. Since 2020, we have been organizing public monthly research office hours with the goal of supporting the contributors of the Wikimedia projects with their research and data-related questions. Topics discussed have included Wikidata queries, statistics on readers and editors, natural language processing, recommender systems, Apache Airflow, and content translation. The office hours take place on the first Tuesday of each month and are held via video-call. Sessions alternate each month between 12:00 UTC and 24:00 UTC. (Read more)
Research Fund. In November, we launched the Wikimedia Research Fund in collaboration with WMF’s Community Resources team! We will provide up to $50,000 USD to support research related to Wikipedia projects and communities. Applications are due by January 3!
Strategy to support and grow the research community. Since July 2021, we have been developing our team’s strategy for growing and supporting a global and diverse Wikimedia research community. We held conversations with Wikimedia researchers around the world to better understand their unmet needs and goals for growing and sustaining a global community of practice. We anticipate that the strategy will be ready for public review in early 2022.
Research Award of the Year. We are inviting you to submit nominations for the Wikimedia Foundation Research Award of the Year 2021. Your nominations of published research during 2021 will help us identify the most impactful research work(s) of the year. (Nominate!)
TREC Fair Ranking Track. We concluded the 2021 TREC Fair Ranking Track with four teams submitting fair recommendation models. Each team presented at the November conference about how they modeled the balance between content relevance and recommending a fair distribution of articles for WikiProjects. We documented some of our learnings from co-organizing the 2021 track. The track will continue in 2022, likely with some adjustments to include more fairness criteria and narrow the focus of content. (Learn more)
Presentations and keynotes. We engaged with the research audiences through presentations and keynotes during the past six months. Below, you see an excerpt of our engagements.
We presented a preliminary approach to knowledge integrity risk assessment in Wikipedia projects in the MIS2 Workshop (Misinformation and Misbehavior Mining on the Web). The work that we shared is part of a larger effort to build a Wikipedia Knowledge Integrity Risk Observatory. (Slides)
We participated in the annual Conference for Truth and Trust Online that brings together practitioners, technologists, academics and platforms to share useful technical innovations to improve the truthfulness and trustworthiness of online communications. We presented an overview of our research on knowledge integrity in Wikimedia projects. (Slides, Video)
We gave a keynote speech as part of the Knowledge Capture 2021 (K-CAP) conference. The K-CAP research community focuses on research in knowledge capture and information extraction. We presented the work of the Research team in the space of Address Knowledge Gaps. (Slides, Video)
WikiIndaba Conference 2021. We co-organized and moderated a panel discussion on growing and supporting research communities in Africa. (Learn more)
Read more about our direction and vision for building a stronger foundation for research in Wikimedia projects.
The people on the Research team
In October 2021, we promoted Miriam Redi to become our team’s first Research Manager. Miriam joined us in September 2017 as a research scientist. You may know her through the many initiatives she has been involved with or led, including the development of image recommendation algorithms, understanding the role of visual knowledge on Wikipedia, the Knowledge Gap Index, and more. In her new capacity, Miriam will be leading the program to Address Knowledge Gaps, our team's largest multi-year program encompassing 12 projects as part of the current fiscal year. She will also be responsible for the growth and development of the research scientists who contribute to this program. Congratulations, Miriam!
The Research team’s other staff members include Research Scientists Pablo Aragón, Martin Gerlach, and Isaac Johnson, Senior Research Scientist Diego Sáez-Trumper, Senior Research Engineer Fabian Kaelin, Senior Research Community Officer Emily Lescak, and the Head of Research, Leila Zia. Our Research Fellow is Bob West.
The Research team’s work has been made possible through the contributions of our past and present formal collaborators. With the 2030 Strategic Direction now in place, we expect to build more formal collaborations in the coming months and years to help achieve the direction set in our Research:2030 whitepapers. While we have started conversations about new formal collaborations during the period of this report, we have not initiated new collaborations.
Wiki WorkshopApril 25, 2022 (virtual)We invite you to join us at the 9th annual Wiki Workshop to share your research and connect with other Wikimedia researchers. Submissions are due February 3 and March 10! Read more
Research ShowcasesEvery 3rd Wednesday of the month (virtual)Join us for Wikimedia-related research presentations and discussions. The showcases are great entry points to the world of Wikimedia research and staying in touch with other Wikimedia researchers. Read more
Research Office HoursEvery 1st Tuesday of the month (virtual)Join us in the Research office hours to have your questions related to Wikimedia data and research answered. All are welcome! Read more
We encourage you to keep in touch with us via one or more of the methods listed in the Keep in touch section to receive more information about these and other events.
Trends to watch
We’re keeping an eye on significant trends that relate to the Wikimedia projects and the broader ecosystem in which Wikimedia operates:
Differential Privacy. The Wikimedia Foundation is building capacity to apply differential privacy (a robust approach to anonymizing data through adding small amounts of noise) to new Wikimedia datasets. Differential privacy both offers strong guarantees of privacy and shows promise in making our datasets more equitable - e.g., more likely to include data about languages or regions of the world that are often left out due to privacy concerns. (Read more)
Keep in touch with us
The Wikimedia Foundation's Research team is part of a global network of researchers who study Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list and follow @WikiResearch, which is the research community's Twitter handle.