Research Report Nº 1
The first in a series of biannual reports from Wikimedia Research, published every June and December.
Welcome! We’re the Wikimedia Foundation’s Research team, a team of scientists and engineers who turn research questions into publicly shared knowledge. We design and test new technologies, produce empirical insights to support new products and programs, and publish research informing the Wikimedia Foundation’s and the Movement’s strategy. This Research Report is an overview of the team’s latest developments – an entry point that highlights existing and new work, introduces new staff, and details new collaborations and new considerations, including trends that we’re watching.
In the past six months, we’ve worked with other staff at Wikimedia Foundation, our Formal Collaborators, Wikimedia affiliates, and Wikimedia volunteers to address knowledge gaps on the Wikimedia projects, improve knowledge integrity, and conduct foundational work. Each success is crucial in breaking down the barriers that prevent people from freely sharing in the sum of all knowledge. We’ve developed a better understanding of Wikipedia readers and the gaps in readership. We’ve conducted research to understand how patrolling works on Wikimedia projects and where the risks and opportunities lie for addressing disinformation on the projects. And we’ve contributed in building a stronger network of researchers for Wikimedia projects.
As you read about our work below, we invite you to learn more – through links that go to other pages, links that lead to collaborative projects, and links that lead to other key programs, including events that we’re organizing or attending. The work that the Research team does is a work in progress. That’s why we’re publishing these new Research Reports: To keep you updated on what we think is important to know – about our most recent work, and about the work that will happen in the next six months.
– Leila Zia, with extensive support from Jonathan Curiel and the Research team
Addressing knowledge gapsIn this program we aim to help Wikimedia projects engage more editors, increase the diversity of editors, identify and address missing content, and reach more readers across the globe. The Wikimedia projects are a form of discourse that we want everyone to share in. Who writes the knowledge matters. Who reads the knowledge matters. What’s missing from that knowledge matters. That’s our challenge as we help address our projects’ knowledge gaps.
A deeper understanding of Wikipedia use cases and readership gaps. Since 2016, we have conducted a series of research to build a taxonomy of readership for Wikipedia and characterize Wikipedia readership and usage in a variety of Wikipedia languages across the globe. We learned through these studies about the needs and motivations of Wikipedia readers, and we observed signals that the Human Development Index (HDI) of the country from which the reader is accessing Wikipedia correlates with the way they use Wikipedia. For example, we observed that the readers in countries with low HDI tend to read the articles more in-depth while those from countries with high HDI do more quick look-ups when on Wikipedia. As a follow up to this study, in 2019, with the help of numerous Wikipedia volunteers, we launched a series of surveys in 14 Wikipedia languages to learn about readers’ motivations and information needs as well as their demographics information (age, gender, education, locale, native language). Almost 70,000 readers participated in these surveys. This new research highlights the diversity of readers across these Wikipedias and surfaces some of the challenges and opportunities different Wikipedia languages face. For example, we see that almost half of readers of English and French Wikipedias are not native speakers of these languages, signaling the potential need for the use of simpler language in writing articles in these languages. (Results, Video, Page)
A series of hypotheses for gender gaps observed in readership. One learning from the study of readers and their demographics is that across languages the surveys ran in, with the exception of Romanian Wikipedia, a gender gap exists in Wikipedia readership. We focused some of our attention in the past months on putting together a series of hypotheses that may explain the observed gaps. We expect to continue the research in this space in the coming months and will share more with you what we will learn in future reports.
An improved readership experience through better new user onboarding. Something fun that happened during the past few months: We found an alignment between our interests to improve Wikipedia usage for readers and the work that the Wikimedia Foundation’s Growth Team has been doing to improve editor retention in medium-size Wikipedias. Through this ongoing collaboration we are putting into use a hyperlink recommendation model to help onboard new editors by asking them to create the suggested link. Such a model can also help improve readership and usage of Wikipedia by better connecting the article graph of Wikipedia and providing links for serendipitous discovery of content.
A taxonomy of knowledge gaps. In October 2019, we started working on building a taxonomy of knowledge gaps for Wikimedia projects with an overarching goal of developing a knowledge gap index. The index can provide a unified framework to capture the interdependence of content, readers, and contributors. Such an index will also provide a consistent and relevant metric for measuring the progress and impact of the different initiatives in our team and the rest of the organization. To develop the index, we are following the framework developed in our white paper. Between now and June 2020, we expect to identify and organize different types of knowledge gaps into a taxonomy and to define metrics that allow for a consistent measurement across different gaps. Starting July 2020, we expect to start the work on the development of the knowledge gap index.
Improving knowledge integrityThe knowledge integrity program helps Wikimedia communities assure the integrity of knowledge on Wikimedia projects. We do this by conducting research and developing and testing technologies that can help editors detect Wikipedia policy violations more effectively. Our current focus is on the following policies: Verifiability and Sockpuppetry. Below is a summary of our ongoing projects.
A model for detecting Wikipedia statements in need of citations. The Wikipedia Verifiability policy states that content on Wikipedia should be supported by reliable sources. There are a variety of ways through which editors and patrollers assure the implementation of this policy. You may have seen sentences in Wikipedia with a “citation needed” template. Mostly manually placed, these templates are signals to readers to exercise caution when reading a statement, since not all facts may be supported. They also act as signals to editors to improve the articles. Earlier in 2019, we published the results of research on building a taxonomy of Wikipedia verifiability and an algorithm that can automatically identify Wikipedia statements in need of citation. In the past six months, we conducted a user study to recommend how to surface the output of the algorithm to the community of editors and developers, and we learned that lower-cost data dump options are the way to move forward in surfacing the service. Aiko Chou is now working with us as an Outreachy intern to help expose the data service.
A reader trust framework and the role of citations. While editors’ efforts continue to improve Wikipedia in terms of statement verifiability, we need to better understand how readers use the citations on Wikipedia and how they assess the trustworthiness of the content they read on the platform. To that end, we are conducting research to understand the role of citations in readers’ trust on Wikipedia. Through surveys, we have learned that most Wikipedia readers place a great deal of trust in the articles they read, and that the level of trust people report varies by country, by article quality, and based on their current information needs. We are currently conducting follow-up interview studies with a selection of the people who took the surveys, to learn more about the personal, contextual, and article factors that influence their trust in Wikipedia articles.
A program focused on disinformation. In July 2019, we started a new initiative with a focus on disinformation. Since then, we have conducted an extensive review of the literature of the disinformation space that allowed us to arrive at a clearer definition of disinformation (non-accidental misleading information that is likely to create false beliefs) and learn what has already been done in this space. We also did a qualitative study of patrolling on Wikipedia and wrote a report about how Wikipedia patrolling works and what the current threat models to patrolling look like. In parallel, we gathered an initial list of potential collaborators to work with in this space and gave talks, participated on panels, and organized small events to receive input from Wikimedia editor, developer, and research communities about what the most important projects to work on look like. For example, during Wikimania 2019 we co-organized a meet-up with the Policy team at Wikimedia Foundation where we discussed the state of disinformation with the editor community. (Notes from the meeting) We are currently working with other teams at the Wikimedia Foundation to finalize a list of projects and initiatives based on our learnings, and we hope to share more with you in future reports.
A Sockpuppet detection model. There are different mechanisms for the spread of disinformation and misinformation on Wikipedia and while not all are known or well-understood, we know that creating and using malicious sockpuppet accounts are one method. Detecting such accounts on Wikimedia projects, however, is a highly intensive activity that takes away substantial time from the editor community. We are interested to use machine learning to reduce the burden of detection and verification of sockpuppet accounts. Over the past months we have been working to incorporate a sockpuppet detection model developed by our formal collaborators. In essence, the model extracts features from all user edits and identifies pairs of users that are likely to be similar. Refreshing and improving the model requires additional iterations with Wikimedia checkusers and stewards to gather feedback about the model’s quality and output. The tuned model can then be deployed and used as a recommender to provide input into the decision-making process that determines whether two accounts are sockpuppets or not.
Conducting foundational workWikimedia projects are created and maintained by a vast network of individual contributors and organizations. We focus part of our efforts on expanding and strengthening this network.
Wiki Workshop. With EPFL Data Science Lab, we are co-organizing the 7th annual Wiki Workshop, which will take place in Taipei, Taiwan, on April 21, 2020 as part of the Web Conference 2020. The workshop serves as a platform for the researchers of Wikimedia projects to convene on an annual basis and to present their ongoing and completed research projects, brainstorm and build collaborations, and connect with Wikimedia volunteers and Wikimedia Foundation staff who attend the event. (Review call for contributions and submit your research)
Research track at Wikimania 2019. With Benjamin Mako Hill from University of Washington we co-organized a 2.5 day Research Track during Wikimania 2019. The track brought together more than 30 speakers and a room packed with Wikimedia editors, developers, and organizers to share and learn together.
Research Showcases. Since 2013 we have been organizing the Wikimedia Research Showcase, which occurs on the third Wednesday of every month, to showcase the research on Wikimedia projects. These events are broadcast live and recorded for offline view. Research Showcases provide a platform for Wikimedia researchers and developers to come to one place on a monthly basis and learn from each other.
Office hour series. Starting January 2020, we are introducing an experimental monthly office-hours series that will take place on the fourth Wednesday of every month. Co-organized by the Wikimedia Foundation’s Research and Analytics teams, we encourage researchers, developers, Movement organizers, and data users from other organizations to attend if they have research or data-related questions for us. You are not required to have a research background. All are welcome! The first office hour is scheduled for January 22, 2020, 17:00-18:00 UTC. (Learn more)
Mentorship through Outreachy. From May through August 2019, Doris Zhou worked on a project to better understand how to evaluate the success of articles generated through the Content Translation tool. She used a mixed-methods approach, studying translations between English, French, and Chinese. She conducted extensive qualitative coding of translated articles and also wrote code to quantify what sections are included and how they are rearranged in translations. She presented her findings at WikiConference North America in Cambridge, Massachusetts this November. (Learn more)
In December 2019, we started working with Aiko Chou as an Outreachy intern. Aiko is working on building a framework to create periodic data dumps exposing sentences that may need a reference on Wikipedia. The system will run our Citation Needed classifiers on a large number of articles in English Wikipedia, and export the results in a SQL database. The data export can support a variety of use cases: bots, web applications, and existing tools such as Citation Hunt.
Other student initiatives. From September through December, we mentored capstone projects for two teams of four masters students each at the Center for Data Science at New York University. We introduced these students to Wikimedia data, resources, and some of the important challenges that we are facing. Through their projects the students conducted research on the best approaches to scale some of our machine learning services to be more multilingual.
The people on the Research team
In September 2019, we hired Martin Gerlach as a research scientist. With a doctorate in Physics (Max Planck Institute for the Physics of Complex Systems) and three years of experience as a postdoctoral researcher at Northwestern University’s Amaral Lab, Martin brings strong expertise in multidisciplinary approaches to analyzing and modeling data from human activity in order to understand the dynamics of complex social systems. Martin has co-authored studies that attempted to quantify the spread of new words over the past two centuries and to identify reasons for biases and gaps in the production of scientific knowledge. He further led the development of new computational methods for the organization of large collections of texts and the detection of personality types in large web-based questionnaires in psychometrics.
Also, in September 2019, we hired Djellel Difallah as a research scientist. Djellel did his PhD at the eXascale Infolab at the University of Fribourg, with a thesis titled "Quality of Service in Crowd-Powered Systems," which investigated novel crowdsourcing platform designs. After that, he was a Postdoctoral Fellow at NYU’s Center for Data Science, where he applied data science methods to study crowdsourcing-based environments. In addition, Djellel has a broad interest in knowledge graphs, as he co-authored multiple papers on the topics of entity linking, disambiguation, and completeness in Wikipedia and Wikidata.
The Research team's other staff members include Senior Design Researcher Jonathan Morgan, Research Scientist Isaac Johnson, Research Scientist Miriam Redi, Research Scientist Diego Saez-Trumper, and the Head of Research, Leila Zia. Our Research Fellow is Bob West.
Wiki WorkshopApril 21, TaiwanWe invite you to join us in the 7th annual Wiki Workshop to share your Wikimedia related research with others or connect to other Wikimedia researchers. Read more
Research ShowcasesEvery 3rd Wednesday of the monthJoin us remotely for Wikimedia related research presentations and discussions. The showcases are great entry points to the world of Wikimedia research and staying in touch with other Wikimedia researchers. Read more
Research office hoursEvery 4th Wednesday of the monthJoin us in the Research office hours to have your questions related to Wikimedia data and research answered. All are welcome! Read more
Wikimedia HackathonMay 9-11, AlbaniaIf you’re interested in becoming a Wikimedia volunteer developer or have ideas that you’d like to implement with other developers, consider attending Wikimedia Hackathon. Read more
We encourage you to keep in touch with us via one or more of the methods listed in Keep in touch section to receive more information about these and other events.
Trends to watch
We’re keeping an eye on significant trends that relate to the Wikimedia projects and the broader ecosystem in which Wikimedia operates. We highlight four of them here.
Knowledge Equity. The Wikimedia 2030 strategic direction calls out knowledge equity as one of two directions for the Wikimedia Movement in the coming decade. We encourage and expect more activities, research, and development in this space as the Wikimedia Movement attempts to open more doors to those who have been left out by the structures of power. This can mean the inclusion of more languages, more cultures, more editors, and more forms of knowledge.
Disinformation. The challenges of disinformation we face today demand more research and knowledge sharing among researchers, platforms and organizations – all while respecting the privacy of the platforms’ users. There is an increasing need for research on Wikimedia projects (and other platforms) to better understand the role of Wikipedia in democracy, to characterize Wikimedia projects at risk, to study both quantitatively and qualitatively the mechanisms through which disinformation can spread in Wikimedia projects, to strengthen the reliability of the content hosted on other platforms that Wikipedia relies on, and to build services that support the incredible work of Wikimedia volunteers in patrolling content on Wikimedia projects.
Editor growth and retention. As mentioned earlier in the report, the Growth team at Wikimedia Foundation is developing software solutions for the editor growth challenges in medium-size Wikimedia projects. If you are interested in the topic of editor growth and retention or task recommendations and personalization, we recommend you keep an eye on the work of the team and explore their documentation pages. For instance, you can start by learning about the Personalized First Day initiative.
Reuse. We have begun research with the goal of better understanding the degree to which readers consume Wikimedia content outside of the Wikimedia ecosystem, for example, through voice assistants, search engines, or third-party apps. While this research is still in its early days, the hope is to provide a better understanding of how Wikimedia content is consumed outside of Wikimedia and what effects that has on outcomes such as Wikipedia's ability to attract contributors or make knowledge available in a neutral point of view. More details will be added to our public documentation page as we learn more. If you are interested in the topic of reuse, we’d also like you to know that we are interested in studies that can help us better understand the economic value of Wikipedia.
The Research team's work has been made possible through the contributions of our past and present formal collaborators. With the 2030 Strategic Direction now in place, we expect to build more formal collaborations in the coming months to help achieve the direction set in our Research:2030 white papers. To this end, we’ve recently initiated the following formal collaborations:
- Akhil Arora joined us recently as a formal collaborator. Akhil is a PhD student at EPFL and he will be working with our research fellow, Bob West, to build a model to detect where a hyperlink can be added in an article page. This is a follow-up on the research to improve Wikipedia's hyperlink structure. The overarching line of research in this space is to understand and improve Wikipedia usage by understanding how the Wikipedia graph is being used by readers today.
- Giovanni Colavizza, assistant professor of digital humanities at the University of Amsterdam, joined us in an ongoing project to understand how readers use citations in Wikipedia. Giovanni brings expertise in science studies, specifically text mining and understanding citations and their usage.
Keep in touch with us
The Wikimedia Foundation’s Research team is part of a global network of researchers who study Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list and follow @WikiResearch, which is the research community’s Twitter handle.