Research Report Nº 8
The eighth in a series of biannual reports from Wikimedia Research, published every June and December.
We usually include some highlights of the work that we have done over the past six months in this section. This time, I’m going to break the convention and share another type of executive summary with you.
As you will see in the report, over the past six months our team made significant progress on the three roadmaps that we have been executing over the past few years. What makes this report particularly special for me is that the team did this work with a backdrop of significant change that took place in the Wikimedia Foundation as well as within our team. As we close our fiscal year and I share this report with you, what stays most present for me is the commitment, persistence, humility, strive for excellence, openness, collaboration and care that the Research team has exhibited over the duration of this report and while doing the work that you will read more about below.
If you donate to the Wikimedia Foundation or volunteer for the Wikimedia projects, I hope reading this report helps you have a glimpse into how thankful we are to you for your gift of contribution to the Wikimedia Foundation or the world.
Welcome to our 8th biannual Research Report!
– Leila Zia (Head of Research)
Addressing knowledge gapsWe aim to help Wikimedia projects engage larger and more diverse groups of editors, identify and address missing content, and reach more readers across the globe. The Wikimedia projects are a form of discourse that we want everyone to share in. Who writes the knowledge matters. Who reads the knowledge matters. What's missing from that knowledge matters. That's our challenge as we help address Wikimedia projects' knowledge gaps.
A deeper understanding of content contributions. We continued our investigation into the distribution of types of edits on Wikipedia. We partly focused our efforts on scaling edit type classification to process larger sets of historical edits in support of longitudinal analyses, which required overcoming some technical challenges (including fixing a memory leak in the widely-used Python package mwparserfromhell). We published some initial visualizations of data from French Wikipedia. (Learn more)
An improved readership experience through better new user onboarding. We continued our collaboration with the Growth Team to improve new editor retention through Structured Tasks by supporting the Machine Learning team in the deployment of our link recommendation model for the add-a-link structured task. The model has been trained for more than 300 Wikipedia languages and is currently in the process of being deployed to most of these languages.
We also continued research on extending structured tasks to copyediting Wikipedia articles. We developed a multilingual model to detect sentences that need copyediting with accuracy between 70-80% across 7 languages. In a complementary approach, we showed that using lists of common misspellings yields high-precision candidates for copyediting in Wikipedia. Based on these insights we developed an automatic approach to curate lists of common misspellings across languages using Wiktionary. (Learn more)
A model to increase the visibility of articles. We continued the research on orphan articles in Wikipedia. We established the existence of a causal relationship between the addition of new incoming links to orphan articles and an increase in their visibility in terms of the number of pageviews. We also demonstrated the need to develop automated tools to support editors in addressing this issue using cross-lingual approaches. We published these findings in a pre-print, which is currently under review. (Learn more)
Metrics to measure knowledge gaps. We continued research and development to measure the extent of Wikimedia knowledge gaps.
We released knowledge gap index datasets for five content gaps: gender, geography, sexual orientation, time, and multimedia. Several measurements are now publicly available for each category in a gap (e.g. the different regions in the geography gap) as monthly time series. These are the number of created articles, the pageviews, the average quality score, and the revision counts. (Data)
We started developing an API for content gap metrics that can be utilized in other tools for surfacing the data.
On the readership metrics front, we improved the multilingual readability model to calculate readability scores for Wikipedia articles: we expanded the ground-truth data of Wikipedia articles with different readability levels (17 datasets across 14 languages); increased the number of supported languages (104) and the accuracy of the model to generate readability scores; and have been working on making the model (and its scores) publicly available through Wikimedia's Machine Learning Platform. We also started work on ensuring the external validity of the model by asking readers about their perception of readability of individual articles. As a first step, we launched a pilot survey. (Learn more)
We began work on a prototype model for assessing the completeness of a Wikidata item as a first step towards assessing the structured data gap surfaced as part of the Knowledge Gaps Taxonomy. We were able to build on past work, including an ORES model and community tools, and are currently working to develop a new evaluation dataset to assess its performance. (Learn more)
A deeper understanding of the role of visual knowledge in Wikimedia Projects. We concluded our studies on the role of images in Wikipedia navigation. We found, through crowdsourcing experiment based on Wikispeedia, that participants find a path from a source to a target article 19% faster in presence of images. (Learn more)
We also finalized and improved our first study on the role of images in learning and submitted the work for peer review. (Learn more)
A model for image recommendation. To support Product teams with a new structured task for section-level image suggestions, we focused part of our efforts towards developing an algorithm that uses our previous work on Section Alignment to discover relevant images for sections in multiple languages. The algorithm output was manually evaluated and judged accurate enough to be included in some product feature. As of June 2023, a new task is available using this algorithm for Wikipedia editors in Arabic, Bengali, Czech, English, and Spanish Wikipedias. (Learn More)
A unified framework for equitable article prioritization. We concluded our experiment exploring the balance between personalization and content equity within the SuggestBot edit recommender system. We found that diversifying recommendation sets for editors led to more diverse edits with no drop in follow-through to edit the page, even though the diversification strategy generally requires recommending less-relevant articles. This result is very promising and suggests that editors may often be open to editing articles on a wider range of topics than one might pre-suppose from their edit history if they receive support for discovering these articles and tasks. (Learn more)
Models for content tagging. We continued our work on extending natural-language processing tooling to support sentence and word tokenization in more Wikipedia languages. We released a Python package called mwtokenizer that contains approaches for segmenting sentences while controlling for many abbreviations and doing word tokenization in both whitespace-delimited and non-whitespace-delimited languages. (Learn more)
Improving knowledge integrityWe help Wikimedia communities assure the integrity of knowledge on projects by conducting research and developing and testing technologies that can help editors detect Wikipedia policy violations more effectively.
Enhanced models for content patrolling. In collaboration with the Machine Learning Platform team we developed an API for the Language-Agnostic Revert Risk Model. We additionally developed a Multilingual Revision Risk model specifically aimed at IP edits. This model currently supports 47 languages, and is complementary with the our Language Agnostic approach. This is our first tool using LLMs, one of the more recent technologies for natural language related tasks. Using this cutting edge technology allows us to deal with complex types of vandalism, enhancing the performance of our models. (Paper, API)
Wikipedia Knowledge Integrity Risk Observatory. We built an updated version of the multi-dimensional observatory. This new dashboard, to be used by the Trust & Safety Disinformation team, simplifies the complexity of the large set of indicators from the previous version by serving data derived from the enhanced models for content patrolling. In particular, the updated observatory provides monthly knowledge integrity risk data for each language version of Wikipedia considering the ratio of high-risk revisions, defined as revisions with a high probability of being reverted, and their corresponding revert rates. (Learn more)
A project to help develop critical readers. We continued our analysis on knowledge networks of Wikipedia readers in order to capture and operationalize curiosity in self-motivated information seeking. We extended the comparison of knowledge networks from lab-based studies with those from readers in Wikipedia to include different time periods, countries, and language versions further generalizing the framework about the existence of at least two types of curiosity. Our analysis also provides quantitative evidence for the existence of additional types of curiosity beyond the previously proposed taxonomy. We are in the process of finalizing a publication. (Learn more)
A model for understanding knowledge propagation across Wikimedia projects. Due to other priorities this project was on a pause during the period of this report.
A better understanding of reference quality in English Wikipedia. We learned that in English Wikipedia less than 1% of the references are from non-authoritative sources and the percentage of statements in need of citation has dropped by 20% over the past years. Both of these observations highlight the importance of the work of the English Wikipedia community in improving the integrity of knowledge on Wikipedia. This research is now concluded. (Paper)
Conducting foundational workWikimedia projects are created and maintained by a vast network of individual contributors and organizations. We focus part of our efforts on expanding and strengthening this community.
A Wikimedia Research course. We continued developing the course modules, inviting instructors for different modules, and arriving at a first draft of learning goals for each module. While we made some progress, some of which are publicly documented, the progress was slower than anticipated due to unforeseen circumstances.
Wiki Workshop. We celebrated the 10th edition of Wiki Workshop on May 11th, 2023 with the first-ever stand-alone Wiki Workshop event. More than 250 participants joined us in this year's edition. We received a record number of 69 submissions and the participants were deeply engaged with the event. If you missed the event, worry not! You can review the schedule, check out the accepted extended abstracts, and even watch the recorded sessions.
Research Showcases. Our January showcase focused on editor retention with a presentation on predicting the departure dynamics of Wikidata editors. We explored the Free Knowledge Ecosystem beyond Wikimedia in our February showcase with talks covering research on OpenStreetMap and data reuse. In March, the showcase featured a study of events on gender bias on Wikipedia and a feminist critical discourse analysis of the #VisibleWikiWomen campaign in honor of International Women's Day and Women's History Month. We focused on Images on Wikipedia in our April showcase with research talks on reader interactions with images and on visual gender biases. The June Showcase served to celebrate the LGBTIQA+ Pride month featuring studies of LGBT people portrayals in Wikipedia and of non-binary gender representation in Wikidata. For both the March and June showcases, expert Wikimedia volunteers were invited to the discussions in order to have a celebration moment for the work that community members have been doing over the past in addressing gender and sexual orientation gaps. (Learn more)
Office Hour Series. After a six-month pause, we resumed our office hours in January with a new format of one on one conversations. (Learn more)
Research Fund. We received 108 applications as part of the Research Fund 2023 grant cycle. From those, 43 were desk-rejected. The remaining 65 applications went through the complete first-stage review process. 12 were invited to Stage II. We accepted 10 submissions out of the 12 Stage II applications and are distributing a total amount of 351,999 USD among these applications. The applicants of the 10 accepted proposals come from the following counties: Argentina, Australia, Chile, Nigeria, Portugal, Serbia, UK, and USA.
Research Award of the Year. We awarded WMF-RAY 2023 to Controlled Analyses of Social Biases in Wikipedia Bios and The Gender Divide in Wikipedia: Quantifying and Assessing the Impact of Two Feminist Interventions after carefully reviewing more than 180 research publications on or about Wikimedia projects in 2022. Watch the award ceremony and read about the papers!
TREC Fair Ranking Track. We concluded our support of the TREC Fair Ranking Track. The highlight of this year's track was the expansion of the number of fairness metrics against which recommender systems were evaluated from two to eight. This stretched the limits of feasible computation for common fairness metrics and showed that high performance on one fairness metric does not guarantee high performance on others. Five different teams submitted 24 approaches. More details can be found in the final report.
Wikimedia Hackathon. We participated in the Wikimedia Hackathon, an annual event that brings together the global Wikimedia technical community to improve the technological infrastructure and software that powers and benefits the Wikimedia projects. We were excited to connect and discuss with developers how models from research can improve their work such as the recently released package for parsing the HTML-dump. We followed up with attendees of this year's WikiWorkshop, which took place the week before and featured a dedicated developer track. We also organized and contributed to running different sessions on the use of AI and ML in Wikimedia projects: a demo session on WikiGPT, a plug-in for ChatGPT using Wikipedia as a knowledge base; a presentation on self-hosting ML models on Cloud Services; and discussion session on potential opportunities and risks for using large language models in Wikimedia projects.
Presentations and keynotes. We engaged with research audiences through the following presentations and keynotes during the past six months.
In February, we participated at FOSDEM, an annual event for developers of free and open source software from all over the world to meet, share ideas, and collaborate. In our presentation, we gave an overview of our recent efforts building tools to support research on Wikimedia projects. (Video)
In March, we participated in a panel on the role of research and technology development in shaping the future of knowledge creation and consumption as part of the second Wikimedia Technology Summit.
In March, we participated in a panel at MozFest in which we discussed how to build an AI ecosystem dedicated to openness while not further concentrating power in a few platforms.
In March, we participated in a Webinar run by the International Science Council and discussed Managing Knowledge Integrity on Information Platforms. (Video)
In April, we gave an invited talk at the King's College Women in Science series, where we shared the latest updates on our research to address knowledge gaps on the Wikimedia project.
In May, we held a workshop at Queering Wikipedia 2023 where we shared the latest updates and insights from our knowledge gaps datasets and gathered input and requests from the community about potential usage of this data. (Slides)
In June, we gave an invited talk as part of the Computational Social Science Seminar at Centre Marc Bloch in Berlin. We shared learnings from our research on understanding information seeking of readers in Wikipedia and discussed implications in the context of the larger online ecosystem. (Slides)
In June, we participated in a panel discussion on Applied Computational Social Sciences in the 18th International Conference on Internet, Law and Politics in Barcelona organized by Universitat Oberta de Catalunya to share findings of our research program on knowledge integrity.
Mentorship through Outreachy. If you have read our past reports, you likely know that we are proud that we have been mentors as part of the Outreachy program (for a total of 10 mentorships so far!). We have learned that Outreachy hit the 1000 interns mentored milestone. Congratulations to the team behind Outreachy and Wikimedia Foundation's Developer Advocacy team that led Wikimedia's participation in Outreachy since 2013!
In February, we published a blogpost with Nazia Tasnim summarizing the result of her internship in 2022 on building a Python package to easily work with Wikipedia's HTML dumps.
In March, Sheila Karuku completed her internship. Her work culminated with a prototype for a web app for patrolling based on the new ML-based service to predict reverts. Sheila's work is already contributing to other streams of work including improvements in our patrolling models.
Mentorship through internships. In February, Nicholas Ifeajika began working on building API endpoints to make our knowledge gaps datasets available for public usage including usage by tools and bots which can inform further decision making. Nicholas' work involved designing and implementing an API that allows users to easily query large amounts of metrics data. The endpoints and the documentation on how to use them is available in the project page.
[New] Ethical AI. We have renewed efforts around formalizing our approach to ethical development and deployment of AI technologies on the Wikimedia projects. This builds on many years of experience in this area (see the recommendations from 2019 on ethical ML processes or more recent set of principles guiding our work on Addressing Knowledge Gaps) but with the additional context of the recent explosion of generative AI models and associated opportunities and challenges. Our work has taken several forms as described below.
We have been supporting the Wikimedia Android team in collaborating with researchers from EPFL to pilot a model that the EPFL team developed for recommending Wikidata article descriptions to add to Wikipedia articles across up to 25 languages. The model generates text but relies heavily on existing content in the lead paragraphs on Wikipedia and existing article descriptions in other languages. This makes the task a well-constrained example of generative AI using large language models and thus an important early test case in what new harms we need to be aware of when deploying these technologies and what guardrails need to be available to ensure that these new technologies are beneficial. (Learn more)
We have begun to work with the Human Rights Team to design a checklist to aid in the evaluation of potential AI services as they relate to Wikimedia Foundation's Human Rights Policy. The checklist will consist of a series of questions that guide the evaluation of how a given AI service might contribute to harms against human rights such as the Right to Non-Discrimination or Effective Remedy and prompts to design mitigations that can reduce these harms.
We have been supporting the development of a plugin for incorporating Wikipedia content into chat-based agents. The initial use-case will be with ChatGPT and will help us advance our understanding of how readers might use generative models to interact with Wikimedia content. It also provides an additional context for improving our approaches to assessing potential harms of these generative technologies. (Code)
The people on the Research team
Emily Lescak Senior Research Community Officer
In April 2023, Emily Lescak, Senior Research Community Officer, left the Wikimedia Foundation and joined the Pathways to Enable Open-Source Ecosystems Training Program at the Center for Scientific Collaboration and Community Engagement as a Project and Community Manager. Emily was the first person we hired in the Research team with a dedicated focus on the Wikimedia volunteer research community. She represented the needs of this community within our team and she worked tirelessly to enhance our understanding of the opportunities and challenges we have in serving this audience. During Emily's tenure, she developed our team's strategy with regards to the research community. She also contributed significantly to new and existing initiatives such as Wikimedia Research Fund and Wiki Workshop. We will miss Emily in our team: her friendship, her commitment, and her ability to bring organization and order to complex circumstances. We wish her all the best in the path ahead and look forward to when our paths cross again.
Kinneret Gordon joined us as our Senior Research Community Officer. Kinneret earned her undergraduate degree in Cognitive Science (UCLA) and is expected to finish her MBA degree (Bar-Ilan) this summer. Kinneret joined the Partnership team at the Wikimedia Foundation in 2021 as a Senior Research Partnership Specialist. During her time in the Partnership team she was involved in several projects including the partnership with ReadCoop as part of the Wikisource Loves Manuscripts initiative and the Sowt podcast collaboration. She was also part of a cross-functional team that focused on regional learning and evaluation of Wikimedia grantee reports. In the Research team, Kinneret will focus on the research community aspect of the work of our team, namely, growing and strengthening the Wikimedia research communities.
Yu-Ming Liou joined us as a Lead Strategist. Yu-Ming is a political scientist by training (ABD, Department of Government at Georgetown University) where his research focused on the political economy of trade and natural resource extraction and was published in both academic journals and public-facing outlets. He has extensive experience in quantitative social science, field experiments, randomized controlled trials, survey research, and impact evaluation. Before joining the Wikimedia Foundation, Yu-Ming worked as a researcher and data scientist in international development and US politics, including at the Analyst Institute and TargetSmart. In 2021 he joined the Global Data and Insights team at the Foundation as a Lead Strategist, where he led learning and evaluation projects such as the Organizer Lab. Yu-Ming has started his work in our team with a primary focus on the Address Knowledge Gaps program.
Caroline Myrick joined us as a Senior Analyst. Caroline holds a PhD in Sociology (Department of Sociology and Anthropology, North Carolina State University) and an MA in English Linguistics (Department of English, North Carolina State University). She has experience in quantitative and qualitative social science, field research, data science, and public engagement. During graduate school, Caroline worked in the Linguistics Lab and was involved in public outreach via the Language and Life Project. She has published research in scholarly and popular publications. Prior to joining the Foundation, Caroline worked as a university instructor at North Carolina State University, a data scientist at an education nonprofit, and a data analyst in the public sector. She joined the Global Data and Insights team at the Foundation as a Senior Analyst in 2022 and has provided support for multiple projects including the Organizer Lab and the Developer Satisfaction Survey. Caroline has started her work on our team with a primary focus on the Address Knowledge Gaps program.
The Research team's other staff members include Research Scientist Pablo Aragón, Senior Research Scientists Martin Gerlach, Isaac Johnson, and Diego Sáez-Trumper, Senior Research Engineer Fabian Kaelin, Research Manager Miriam Redi, and the Head of Research Leila Zia. Our Research Fellow is Bob West.
- Katrin Weller is a group leader at GESIS - Leibniz Institute for the Social Sciences in the department for Computational Social Sciences. Her work focuses on social media, new types of research data and data preservation, scholarly communication and altmetrics, web users and communication structures. She collaborates with us in the Understanding perception of readability in Wikipedia project.
- Mareike Wieland is a Post-doc at GESIS - Leibniz Institute for the Social Sciences in the department for Computational Social Sciences. Her research focuses on the use, processing, and effects of (political) information in automated media environments and the unlocking of new types of data through smartphones. She collaborates with us in the Understanding perception of readability in Wikipedia project.
- Indira Sen is a PhD student at GESIS - Leibniz Institute for the Social Sciences in the department for Computational Social Sciences. Her work focuses on developing theory-based and generalizable computational models for measuring social constructs from digital trace data. She collaborates with us in the Understanding perception of readability in Wikipedia project.
We also want to take this opportunity to thank Benjamin Mako Hill (University of Washington) for extensively supporting our team's efforts on two major fronts. Mako served as a Research Fund co-chair during the past six months as well as the Award co-chair for the WMF Research Award of the Year.
Research ShowcasesEvery 3rd Wednesday of the month (virtual)Join us for Wikimedia-related research presentations and discussions. The showcases are great entry points into the world of Wikimedia research and for connecting with other Wikimedia researchers. Upcoming topics include editor retention, the free knowledge ecosystem, and gender and equity. Learn more
Research Office HoursThroughout the month (virtual)You can book a 1:1 consultation session with a member of the Research team to seek advice on your data or research related questions. All are welcome! Book a session
TREC 2023 AToMiC TrackJuly 2023The TREC AToMiC Track is calling for multimedia retrieval systems that can associate free licensed images with article sections on English Wikipedia. The submission deadline for this track is July 24th, 2023. Submissions will be evaluated during the month of August and by the end of September participants will receive their evaluation scores. Learn more
We encourage you to keep in touch with us via one or more of the methods listed in the Keep in touch section to receive more information about these and other events.
Trends to watch
We're keeping an eye on significant trends that relate to the Wikimedia projects and the broader ecosystem in which Wikimedia operates:
AI and the Wikimedia projects. There is no doubt that the latest advancements of AI, particularly, the immediate access to the output of some of the Large Language Models by millions of people around the world, have affected the Wikimedia and Free Knowledge ecosystem. Here are some things we would like to highlight as part of this section:
Some of the Wikimedia communities started engaging with large language models as early as December 2022. It is important to highlight that the English Wikipedia editor community is actively developing a policy for the usage of LLMs on the project.
Open-source models have played a large role in experimentation with the newer generative AI models, which is heartening to see. There are still many questions about how openness can help democratize and bring greater security to the space of AI. For example: whether to support ethical use restrictions for AI licenses, how to balance the importance of open-source model weights with open-source training data, and how to promote models that are easier to train and deploy without access to extremely expensive (and closed-source) GPUs.
If you are interested to learn more about some of our team's perspectives on this space, we encourage you to watch the relevant Wiki Workshop panel.
There are many places where AI models can support the Wikimedia projects. A common theme that has arisen in many of these use-cases is summarizing content – e.g., articles, talk pages, task tracking software. We are always working to collect more use-cases though and welcome ideas, which can help us to understand what role would be most useful for our team to play in supporting AI for beneficial use on the Wikimedia projects.
Differential privacy. In the Research Report Noº 5 we shared with you that Wikimedia Foundation had started investing in differential privacy. We are happy to share that the Security Team has released the first differentially-private dataset (daily pageviews by project, article, and country). The Security Team is now focusing their attention on additional datasets covering Search and editor geography. (Learn More)
Funds for the Research team are provided by donors who give to the Wikimedia Foundation. Thank you!
Keep in touch with us
The Wikimedia Foundation's Research team is part of a global network of researchers who advance our understanding of the Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list. You can follow us on Twitter and Mastodon.