I have become convinced over the past few months that Public Lab can improve the way we produce software. Open source software developed by Public Lab - MapKnitter, Infragram, and SpectralWorkbench, to name a few of the most widely-used - powers much of the grassroots scientific research that is central to Public Lab's mission. Despite this, contributions to Public Lab software come overwhelmingly from a tiny core cadre of developers.
For example, as of now (December 2014), while the main Public Lab mailing list has over 3,100 members and the plots-dev mailing list has about 70 members,
- MapKnitter has five contributors
- SpectralWorkbench has five contributors
- The website publiclab.org has seven contributors
- Infragram has four contributors.
(counting only contributors to the master branch). Many of these (few) contributors commit to multiple Public Lab projects.
More seriously, code quality is not consistent. Code is seldom tested, follows no standard conventions, and (in some cases) relies on legacy libraries that are not maintained. In short, much of Public Lab's codebase flaunts software development best practices and makes joining and maintaining the code more difficult than need be.
How can we improve?
Develop Guidelines for Contributors
I'm starting by putting together a wiki page for Public Lab software contributors.
There already a wiki page covering Developing Source Tools with Public Lab - but this page is more about the philosophy of open-source and the advantages of developing open-source tools as part of a community. We could use a page that provides a technical guide to working with Public Lab. Contributing to Public Lab Software is that page.
Compiling a list of development guidelines and best practices is critical to the growth of the Public Lab software development community. First and most importantly, it will improve the quality of our code. Prominently published contributor guidelines following standard conventions - and a codebase conforming to those guidelines - are signals to developers of the integrity of an open-source project. Guidelines are also important for newcomers - folks just getting started with coding - because they provide structure and instruction.
Integrate discussion on plots-dev and GitHub Issues
Another thing we might do is to create an IFTTT recipe to send an email to the plots-dev list whenever someone opens a bug report for a Public Lab software project on GitHub. Right now, for the most part, users with issues either send a message to the plots-dev mailing list reporting a bug, or they open an issue on GitHub - but not both.
Sending a message to the plots-dev mailing list is great. Really! Sometimes people don't want to report a bug - they just have questions about how to use the software - and the mailing list is a great place for those questions. The mailing list is great because everyone who is interested in Public Lab software is on it, so sending a message to the plots-dev list gives that message a wide audience.
But GitHub is where most of the actual work gets done. New features and bugfixes are pushed to GitHub, and pull requests are discussed on GitHub. Furthermore, GitHub provides powerful, simple, and expressive tools for engaging with code, including GitHub Issues, diff comments, and todo lists. The disadvantage of GitHub (at the moment) is that the audience for GitHub issues is vastly smaller than the audience on plots-dev.
The status quo serves to maintain the gulf between technical and non-technical Public Lab contributors. Issues often start on plots-dev, then migrate over to GitHub Issues once work begins on bugfixes; splitting the discussion across multiple forums like this makes it hard to follow the trajectory of a bug. It is my hope that integrating plots-dev and GitHub Issues with an IFTTT recipe would make the bugfix process more efficient and get non-technical or non-contributing technical folks involved with the codebase on GitHub in new ways.
Goals
- Make it easier for technical folks to make high-quality contributions to the Public Lab codebase
- Make it easier for maintainers to accept contributions from the community
- Engage non-technical community members with Public Lab code
- Encourage software development best practices that will make Public Lab software more reliable, easier to maintain, less buggy, and longer-lived
Questions
As I've worked on the Contributors guide, I've realized that I'm unsure of the proper audience for the guide. Is it possible to assemble a guide for contributors that is appropriate for non-technical and technical community members alike? Is it too early to add non-technical content to the Contributors guide (i.e. What is git? What is GitHub?). This raises a few questions to which I've provided some partial answers:
What forms might non-technical engagement take?
- Get non-technical folks to create a GitHub account and post bug reports on GitHub issues rather than sending it to the plots-dev list (See above).
- In addition to kite-mapping and spectrometry, we might host workshops every now and then focused on getting people set up with GitHub and git and thinking about open-source software.
How might we reach outside of the Public Lab community to engage technical folks?
- Release general-purpose libraries.
- Libraries that are used by the open-source community at large will raise Public Lab's profile in the open-source software community and pique the interest of serious developers who are looking to get involved.
- Leaflet.Illustrate, Leaflet.DistortableImage, and Leaflet.Toolbar are examples of general-purpose libraries (all JavaScript libraries) that have been developed in service of Public Lab products.
- Are there features in other Public Lab software projects - Infragram, SpectralWorkbench - that might be drawn out and developed as a standalone library supporting the Public Lab app?
I envision Public Lab growing to include a community of engineers and software developers who are excited about building open-source software tools and teaching members of the community how to contribute. What do you think? Let's get a discussion started!
6 Comments
Justin, this is a great step you've taken.
From my own experience with companies that pay me for open source software, there tends to be this disconnect between in house development (which is paid for) and open source development (which requires practices like what you suggest).
Growing the number of developers is an important step. PL only has one paid developer and that's @warren. If you analyze the commits across GitHub for PL projects, you'll find probably 90% of all commits are his.
From my experience, there are three majors factors for getting open source assistance.
I tend to address #3 by making portable virtual machines which carry the development environment, using deployment scripts to bootstrap systems (and VMs), and as you already mentioned, documentation. Documentation is helpful, but even the best instructions suck when the user has to spend 4 hours getting the software bootstrapped. Automation is key for complex setups. Almost all web service platforms I've worked with have complex setups.
For #2, we often see that open source projects that help programmers and software engineers tend to get a lot more attention; look at GCC, LLVM, and such. They are making dog food because they want to eat that dog food, they can make that dog food, and can't sit around waiting for someone else to make the dog food they want. The Public Laboratory community itself tends to attract a community with a very small percentage of people with programming skills. Even if they want tools and resources, they aren't well equipped to help on the software side. Some non-technical folk in the community have been learning to program and learning to use tools like GitHub. I applaud that effort!
For #1, that's just marketing. It is highly related to #2 at present. Most people who learn about Public Lab enough to pay attention are the kind of people who would be passionate about Public Lab in the first place. Those folks come with a low developer ratio to other skill sets. The GSoC projects market to developers who need experience and get us exactly that. The problem is those developers are not passionate about the project (#2), so there is little follow through after the GSoC program ends. There are, of course, some exceptions :)
As far as github and plots-dev integration, adding the noise from every GitHub project to the plots-dev list might dissuade people from being part of the list, choosing instead to subscribe only to the projects they care about on GitHub. That would seem to indicate that the list would serve no desirable function. Rather than use IFTTT to merge the two channels, it might be wise to specify why they are differentiated. I don't see any obvious differentiation right now, and I personally default to using GitHub exclusively unless I'm responding to someone else on plots-dev. The only advantage I see to plots-dev is that it is not project specific, so it could be used to discuss developer organizational issues exactly like this one.
Normal users don't interact with plots-dev at all. It is strictly for folks interested in development. When troubles come up, normal users email web@publiclab.org or post issues to GitHub. The good news is that we do see fresh GitHub accounts created to post issues to our projects, so some lay users are indeed creating GitHub accounts. Very cool!
It might be worth redirecting users to email plots-dev directly. Unfortunately plots-dev would start to get a lot of spam that is not actionable by developers, for example fixing this account or that research note. Many user requests require IT management (and access to the production systems) rather than software development. Subscribers to web@publiclab.org serve as curators who can and will create issues on GitHub where appropriate or deal with the IT problems as appropriate. I'm curious if this is something we should think about changing though. I like curation, but it's clear to me that how things work are not being made transparent.
To address one of your bullet points with respect for plots2, I think we need to drop a lot of custom code and replace it with external libraries that serve the same function. Our wiki features are an example of something mostly hand written that I suspect could be replaced by more comprehensive gems. I could be quite wrong though. I'm not familiar enough with the RoR space to decisively say what I'm saying. I know in other platforms, many of the features and functions we have already exist as external libraries with almost no integration work required. I do not get that sense for plots2.
I think it is very important to improve the state of the software paradigms raised in this note. Keep in mind, however, that the development community might not be small because of the software organization itself, but rather because the people with strong software skill just don't care enough about the cause. Regardless, lowering the barrier to entry reduces the need for that passion (although it is unlikely to breed any sort of long term engagement).
Is this a question? Click here to post it to the Questions page.
Reply to this comment...
Log in to comment
Justin! I am SOO excited about this post! These are great reflections and ideas. Let me know if the community development team can help in anyway @liz or myself. keep us posted.
Reply to this comment...
Log in to comment
I just wanted to +1 this post -- and although as @btbonval points out, I am the only paid developer on these projects, I can only devote about 10-20% of my time towards them, not counting support and sysadmin work -- making these topics even more urgent. All of the goals @justinmanley outlines are ones I share, and in particular the last one:
This often falls by the wayside since we have such a use-driven development program -- that is to say, as Bryan points out, there are so many users of the software, and so relatively few developers. It's harder to explain or justify the need to take a step back and rearchitect a system, or to spend time writing tests, vs. adding a new feature folks really need. Also, new developers are rarely as interested in writing tests or refactoring code as they are in new features -- on reason we are very lucky to have involvement from folks like @justinmanley.
Finally, because of the very entrenched and opinionated user base (which as developers, we're very lucky to have), there is a very high emphasis on continuity. It's important to continue to support legacy interfaces and tools longer than would be ideal in other circumstances, because people have worked very hard to create and edit datasets which support their environmental monitoring needs.
+1 also for modularity; this was one of the biggest reasons to tackle the Leaflet.DistortableImage project. MapKnitter also represents the "worst case scenario" -- as a project which was originally intended to do literally no part of it's current featureset, and much of which was written during a relative crisis -- the BP oil spill. Until now, we've never had time to go back and rearchitect it to "do things right", which makes this a very exciting time for the tool.
I think it's also fair to say that SpectralWorkbench and especially Infragram.org are in need of a lot of this kind of work as well; publiclab.org too, but to a lesser degree, as it does actually have a barebones test suite.
Anyhow, I'm very glad to see @justinmanley and others thinking about the PL community and long-term development sustainability; this is a discussion and process which we very much need to embark upon.
Reply to this comment...
Log in to comment
@btbonval, @warren - thanks both for your comments!
@btbonval - I think you're probably right about the distinction between GitHub issues and plots-dev. Merging them with IFTTT might indeed more of a burden than a benefit. That said, it's been useful for me to think about the difference between plots-dev and gh-issues; I think that we should make our best effort to keep plots-dev for general development discussion, and direct all issue complaints, etc. to GitHub.
I think it's great that you already find people creating GitHub accounts just to report issues! Perhaps it might help with this if there is a PL wiki page / article on the internet that we can direct non-technical folks to that will tell them what GitHub is and how to set up a GitHub account, as well as instructions on how to file a good GitHub issue. Again, I come back to the issue of audience - I'm very aware that the PL community is largely non-technical, and so we need to be very careful to think about GitHub, etc. from the perspective of the non-technical user.
Re: your comment about reducing application-specific plots2 code - sounds good to me! The first step may be flagging some areas in the plots2 codebase that are ripe for this kind of refactoring. I would suggest opening a GitHub issue on the plots2 repo when you get a chance. That issue can then serve as a place for folks to note potential areas for refactoring.
Personally, I don't think that the issue is that software folks don't care about Public Lab's cause. I think it's that the word hasn't gotten out in the technical community b/c PL programming, etc. has been mostly focused towards the environmental community, not the technical community (i.e. sponsoring hackathons, etc).
Reply to this comment...
Log in to comment
I had a quick look and noticed we don't maintain wikis for any of the PL repositories. Any opinion for or against using Github wikis?
A README is great but most times it's just enough while a wiki can provide a clear point of entry for users as well as developers. Github wikis are powered by Gollum (I think) which means they're git repositories themselves and thus provide a way for non-devs to learn git if they're so inclined.
I mention this because I'm going to clone infragram-js and build it. Naturally I'll create a document for myself and anyone else interested. Maybe that's a starting point for an infragram wiki?
Another helpful document for on-boarding folks would be something that covers 'best practices' for working collaboratively with git. (Correction: there's this and this courtesy of Justin).
Snowplow is a good example of what you can do with a Github wiki. Overall I think it's an effective way to draw others in and I believe there's no cost so long as it's open source. Access can be controlled in similar way as for any pull request.
Thanks to Justin for the head's up.
Is this a question? Click here to post it to the Questions page.
Reply to this comment...
Log in to comment
Hi, @geraldmc - I guess I have a slight preference for using PublicLab.org wikis, but I'm relatively agnostic. I tend to think that putting more documentation and assets in our own ecosystem will encourage people to live in it, and also consolidate our open sourcingness somewhat. Fewer platforms can be nice. But I don't feel super strongly about it. Maybe @liz has some thoughts?
Is this a question? Click here to post it to the Questions page.
Reply to this comment...
Log in to comment
Login to comment.