Welcoming New Event Fund Committee Members!
The CS&S Event Fund invests in emerging community leaders around the world to support the organizing of events that broaden participation in open...
By: Reshama Shaikh (Data Umbrella Lead) and Angela Okune (Event Fund Program Manager)
Supported by the Code for Science & Society Event Fund (Gordon and Betty Moore Foundation Grant GBMF8449), Data Umbrella has been hosting regular data sprints to grow the skills and confidence of first-time and regular contributors to open source software communities such as scikit-learn, a widely used machine learning library of Python. In this post–part of an ongoing series that focuses on the organizing processes, tools, and practices used to run successful Event Fund-sponsored Open Science events–we learn about some of the changes that Data Umbrella incorporated when shifting from in-person data sprints to an online format. Through this blog series, Event Fund seeks to highlight the processes and practices that undergird more inclusive and accessible data science event programming.
Prior to 2020, most data sprints were held in person during intensive 8-hour-long days. Data Umbrella founder, Reshama Shaikh, for example, led several in-person sprints in New York (2017, 2018, 2019), Nairobi (2019) and San Francisco (2019). Data Umbrella had always been interested in developing online resources and exploring ways to enable virtual participation, but this was not able to become a priority until 2020 when the pandemic forced everything online including data sprints. It was clear that an 8-hour in-person event could not just switch to being an 8-hour online event. So the move to online data sprints required the team to rethink the format and mechanisms of the event.
Data Umbrella reduced the synchronous time into multiple shorter events including pre-sprint office hours and post-sprint office hours and a four hour sprint. To enable participants to get to know each other in the virtual setting, their photos and location were shared on the sprint websites on an opt-in basis. A greater emphasis was put on pair programming, an aspect that was highly referenced in the feedback survey as one of most appreciated aspects of the online data sprints.
A pilot sprint was run after the pandemic began, in June 2020, to explore what running a data sprint online might look like. After a successful pilot attended by almost 40 participants from around the world, Data Umbrella, with support from the Event Fund, was able to organize three online open source sprints in 2021 (February 2021; June 2021; October 2021) with over 100 participants attending from around the world.
In order to successfully run the online data sprint events, visible (technology software) and less visible infrastructures (norms, working styles) were important. For example:
Data Umbrella began in 2019 to create a community for underrepresented persons in data science, expanding from a focus on increasing representation of women in open source communities towards more intersectional understandings of the diversity of open source software contributors. Figure 2 is a world map showing the locations of Data Umbrella’s 2021 sprint participants who joined from about 30 different countries. This map is particularly stark in contrast to Figure 1, the map of the founding team members of the software.
In running these sprints, Data Umbrella learned that documentation (text) about how to contribute can often be intimidating or inaccessible and that often other mediums (videos, transcripts, translations) help to make contributing more accessible. The Data Umbrella organizers also realized these sprints are not only important for contributors, they also provide an opportunity to the core development team to receive feedback on documentation and contributing process and to make improvements.
As a result of the data sprints, several exciting impact stories have begun to emerge. A returning sprint participant, Amanda D’Souza of India, worked on an issue related to a dataset which used ethically questionable data. She contributed to the numerous pull requests which moved this issue forward and removed the dataset from the library.
There are 1 million users of the scikit-learn library around the world. The core developer team is approximately 20 people and there are only 6 people on triage. A sprint participant from Latin America, Juan Martín Loyola, contributed extensively towards the scikit-learn library in 2021 as a result of the Data Umbrella sprints. He was recently invited to become a Triage Team member. Receiving this invitation is a significant milestone and one that Data Umbrella is very happy to have supported. Read more in “Interview with Juan Martín Loyola.”
Another contributor, Maren Westerman, was an attendee at all four Data Umbrella events and has now become an organizer for a meet-up group in her city in Germany. Maren is applying ideas she learned from the Data Umbrella sprints to organize her own meet-up and hackathon community.
A number of sprint participants wrote blogs on their sprint experience.
Data Umbrella event organizers would like to see growth in the community of contributors, especially with involvement from people outside of traditional metropolitan tech hubs in the US and Western Europe. In 2022 (Jan-Feb), Data Umbrella collaborated with PyMC (a Bayesian Python library) to create a series of videos and tutorials for the community on contributing to PyMC, which then culminated in an online sprint. Data Umbrella is organizing an open source sprint in June 2022 for the Python libraries NumPy and SciPy, with a focus on Africa and the Middle East. The group is looking for further funding to create resources that will enable others to run data sprints towards a “train-the-trainer” model.
Data Umbrella would like to thank the scikit-learn maintainers for their contributions to the sprints, from curating issues to patiently reviewing pull requests, updating the contributing documentation based on feedback and creating video resources.