Behind the Scenes: What It Takes to Run Data Umbrella’s scikit-learn Open Source Sprints

Supported by the Code for Science & Society Event Fund (Gordon and Betty Moore Foundation Grant GBMF8449), Data Umbrella has been hosting regular data sprints to grow the skills and confidence of first-time and regular contributors to open source software communities such as scikit-learn, a widely used machine learning library of Python. In this post–part of an ongoing series that focuses on the organizing processes, tools, and practices used to run successful Event Fund-sponsored Open Science events–we learn about some of the changes that Data Umbrella incorporated when shifting from in-person data sprints to an online format. Through this blog series, Event Fund seeks to highlight the processes and practices that undergird more inclusive and accessible data science event programming.

Prior to 2020, most data sprints were held in person during intensive 8-hour-long days. Data Umbrella founder, Reshama Shaikh, for example, led several in-person sprints in New York (2017, 2018, 2019), Nairobi (2019) and San Francisco (2019). Data Umbrella had always been interested in developing online resources and exploring ways to enable virtual participation, but this was not able to become a priority until 2020 when the pandemic forced everything online including data sprints. It was clear that an 8-hour in-person event could not just switch to being an 8-hour online event. So the move to online data sprints required the team to rethink the format and mechanisms of the event.

Data Umbrella reduced the synchronous time into multiple shorter events including pre-sprint office hours and post-sprint office hours and a four hour sprint. To enable participants to get to know each other in the virtual setting, their photos and location were shared on the sprint websites on an opt-in basis. A greater emphasis was put on pair programming, an aspect that was highly referenced in the feedback survey as one of most appreciated aspects of the online data sprints.

A pilot sprint was run after the pandemic began, in June 2020, to explore what running a data sprint online might look like. After a successful pilot attended by almost 40 participants from around the world, Data Umbrella, with support from the Event Fund, was able to organize three online open source sprints in 2021 (February 2021; June 2021; October 2021) with over 100 participants attending from around the world.

Summary of Online Sprint Events

Event	Website / Report	Event Dates	Participants
Global Sprint (Pilot online sprint)	Data Umbrella Global Online Sprint Report	[a] None [b] 06-Jun-2020 [c] 20-Jun-2020	[a] N/A [b] 42 [c] 6
Africa and Middle East Sprint (AFME)	afme2021.dataumbrella.org Report	[a] 30-Jan-2021 [b] 06-Feb-2021 [c] 20-Feb-2021	[a] 15 [b] 31 [c] 7
Latin American Sprint (LATAM)	latam2021.dataumbrella.org Report	[a] 19-Jun-2021 [b] 26-Jun-2021 [c] 10-Jul-2021	[a] 29 [b] 40 [c] 12
Africa & Middle East Sprint (AFME2)	afme2021rc.dataumbrella.org Report	[a] 16-Oct-2021 [b] 23-Oct-2021 [c] 06-Nov-2021	[a] 9 [b] 40 [c] 7

Sociotechnical infrastructure

In order to successfully run the online data sprint events, visible (technology software) and less visible infrastructures (norms, working styles) were important. For example:

Pre- and Post-Sprint Office Hours: In order to help begin to build relationships amongst participants, Data Umbrella organized a one-hour office hour prior to and after the sprint. These meetings were an informal way for participants to ask any questions before the “official” event and to help set expectations around pair programming and what sort of “issues” in the GitHub repository they would work on. These office hours were hosted by Data Umbrella with attendance by the open source library maintainers and so, in a small group setting, community contributors and open source maintainers were able to better get to know each other, share their screens and walk through pull requests together. Such relaxed discussion spaces helped to strengthen the interpersonal relations necessary for open source libraries to be maintained.
Sprint Preparation Checklist: Data Umbrella provided a Sprint Prep Checklist with a helpful list of resources to read and review prior to the online sprint, at their convenience. This allowed contributors flexibility in doing preparation work and to maximize the sprint time and experience with the library maintainers and their pair programming partners.
Video Content with Transcripts: Data Umbrella organizers collaborated with the open source library maintainer to create videos (15-30 minutes in duration) which sprint participants could view at their convenience prior to the sprint. Transcripts were also created so that referencing videos was more easily accessible.
Translations: Translations of the video transcripts and sprint website were done by volunteer community members for the Latin America sprint. The content was translated into Spanish and Portuguese. Translators were available at all three events (pre-sprint, sprint, post-sprint) to facilitate communication.
Pair Programming: Sprint participants were assigned pair programming partners. The Data Umbrella organizing team matched as best they could based on time zones and experience (for example, matching first-time contributors with returning contributors).
Repeat Sprints: For first-time contributors, the sprints are valuable and impactful in getting folks started with open source. The curve to contributing is still steep, particularly for a library such as scikit-learn. Repeated sprint events helped build confidence, experience, coding skills and new mentors, to name a few favorable outcomes.
Curated “Issues” on GitHub: One of the barriers to new or continuing contributors in open source is having issues in GitHub with well-defined background information and steps to contributing. Data Umbrella has worked with the open source library maintainers to improve documentation on issues so contributors have coding issues they can continue to contribute to after the sprint.

Reaching new audiences

Behind the Scenes: What It Takes to Run Data Umbrella’s scikit-learn Open Source Sprints — Figure 1. While the python machine learning library scikit-learn has an estimated one million monthly users around the world. The core developers are located primarily in the US and Europe. Half of the core developers are located in France, where the founders of the library reside. This mapping shows the locations of core developers of scikit-learn. Source: https://scikit-learn.org/dev/about.html#people‌‌

Data Umbrella began in 2019 to create a community for underrepresented persons in data science, expanding from a focus on increasing representation of women in open source communities towards more intersectional understandings of the diversity of open source software contributors. Figure 2 is a world map showing the locations of Data Umbrella’s 2021 sprint participants who joined from about 30 different countries. This map is particularly stark in contrast to Figure 1, the map of the founding team members of the software.

In running these sprints, Data Umbrella learned that documentation (text) about how to contribute can often be intimidating or inaccessible and that often other mediums (videos, transcripts, translations) help to make contributing more accessible. The Data Umbrella organizers also realized these sprints are not only important for contributors, they also provide an opportunity to the core development team to receive feedback on documentation and contributing process and to make improvements.

As a result of the data sprints, several exciting impact stories have begun to emerge. A returning sprint participant, Amanda D’Souza of India, worked on an issue related to a dataset which used ethically questionable data. She contributed to the numerous pull requests which moved this issue forward and removed the dataset from the library.

There are 1 million users of the scikit-learn library around the world. The core developer team is approximately 20 people and there are only 6 people on triage. A sprint participant from Latin America, Juan Martín Loyola, contributed extensively towards the scikit-learn library in 2021 as a result of the Data Umbrella sprints. He was recently invited to become a Triage Team member. Receiving this invitation is a significant milestone and one that Data Umbrella is very happy to have supported. Read more in “Interview with Juan Martín Loyola.”

Another contributor, Maren Westerman, was an attendee at all four Data Umbrella events and has now become an organizer for a meet-up group in her city in Germany. Maren is applying ideas she learned from the Data Umbrella sprints to organize her own meet-up and hackathon community.

A number of sprint participants wrote blogs on their sprint experience.

What’s next?

Data Umbrella event organizers would like to see growth in the community of contributors, especially with involvement from people outside of traditional metropolitan tech hubs in the US and Western Europe. In 2022 (Jan-Feb), Data Umbrella collaborated with PyMC (a Bayesian Python library) to create a series of videos and tutorials for the community on contributing to PyMC, which then culminated in an online sprint. Data Umbrella is organizing an open source sprint in June 2022 for the Python libraries NumPy and SciPy, with a focus on Africa and the Middle East. The group is looking for further funding to create resources that will enable others to run data sprints towards a “train-the-trainer” model.

Acknowledgements

Data Umbrella would like to thank the scikit-learn maintainers for their contributions to the sprints, from curating issues to patiently reviewing pull requests, updating the contributing documentation based on feedback and creating video resources.

Featured Image by David Clode on Unsplash