🎈 Public Lab: Research Note Workflows

Research Note Workflows

by donblair | August 11, 2014 19:34 11 Aug 19:34 ... | #11035 | #11035

What I want to do

Jeff Walker and I have discussed coming up with a research note generation workflow that would fit nicely with our particular needs. The approach described here isn't going to be the right 'research note creation' approach for everyone, or even most people; but we think it'll be a very useful approach for folks who are writing up long, complex research notes, that involve intricate, special formatting. And this approach also has some other nice features (described below). The basic approach is:

Write up a research note on your local computer, in markdown, using your favorite text editor.
Use a tool that renders an html version of the research note, based on this markdown file.
Put these files (the markdown, the html, and any associated data / images) in a git repository, hosted on github.com
Use 'github pages' to serve your html file on the web
Create a research note on publiclab.org, and, in the body of the text, embed the resultant 'github page' URL as an "iframe" object in the research note. You can then tag or otherwise edit the research note, as per usual.

First, I'll just jot down what motivated us to follow this approach; then you can see our Google Hangout On Air, where we recorded a little run-through of the process -- which generated the following research note: http://publiclab.org/notes/walkerjeffd/08-11-2014/kayak-deployment-on-8-7-2014

Motivation

Local editing of research notes. Writing and editing a long, complex document -- journal articles, computer programs, or particularly in-depth research notes -- is hard. While much of our writing output these days is 'published' online -- via emails, html documents, in shared document folders, or as code that runs on a server -- most of us still choose to do the bulk of our writing on a local computer. Writing and editing locally has several advantages:

no stable internet connection is needed
one usually has access to more powerful text-editing capabilities, and can take advantage of special formatting techniques that might be too esoteric to include in publiclab.org's rendering engine
it's easier to manipulate and organize the relevant files.

For some of the longer and more complex research notes I've been writing on publiclab.org, I've begun to crave these 'local' tools.

Collaboration and version control. Further, the 'version control' tools used in the software industry have begun to infiltrate document production, and scientific research generally. It's very useful, in projects that involve collaboration and/or many drafts of the same piece of writing, to use software that can keep track of changes.

Literate programming, and replicability. Finally, ongoing and increasing concerns about the replicability of scientific research have led some practitioners to imagine new forms of scientific discourse that embed as much of the relevant data and analysis tools in the publication as possible. For example: if you're writing a research note on the geographic distribution of oil refineries in Louisiana, your research might have involved using a program to scrape some online databases for the locations and names of these refineries; then you might upload this data into a Google Map. So that others can follow in your footsteps, it'd be great if you could write up a research note that not only displays the outcomes of this procedure, but embeds the actual, executable code necessary to generate those outcomes: a script that fetches the data online, and formats it for Google Maps; another script that uploads the data to Google Maps. This way, someone who wanted to replicate your data (or generate similar data, for another location), could simply 'fork' your research note immediately and repeat your process. The procedure we're describing here allows for this.

My attempt and results

We'll be writing up the below video into a more concise recipe for folks to follow, but it hopefully gives a basic overview of the workflow that Jeff has come up with, and the one we'll be following when writing up more complex research notes:

Feel free to ask any clarifying questions / suggest improvements to the process in the comments below!

43 Comments

iframes can cause a lot of problems with SSL, which is in the works, along with some other cross site dependency issues. It's generally a bad thing to use if you can avoid it.

We could "upload" a research note by scraping the contents of the supplied page and putting that into the database, but it could lead to a synchronization loss with the upstream version on github. I don't have any recommendations other than the "edit" button could allow a rescrape of the URL.

Were you planning on exposing the markdown on github pages, or only the rendered HTML? If both are exposed, it should be easy enough for the scraper to look for the markdown text and import it directly.

There are going to be problems with embedding images if they are hosted elsewhere. Again cross-site issues. The internet is getting more paranoid, and as it does so, the web clients are getting more finicky about what they'll accept and from where.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Ah, great points. Our initial idea was around just this sort of 'markdown import' you're describing, and it does seem like it has some nice advantages (and avoids the security hiccups you're describing). In fact, we started to write up the 'markdown import' idea here: https://github.com/p-v-o-s/publiclab-research-workflow

You'll see in that README that Jeff Walker had created a mockup of what this sort of feature might look like: a place to insert a URL that points to a markdown file in a github repo; a script then translates the github relative directory references in the markdown into URLs that publiclab.org can use to render the data in the research note.

Could there be, maybe, a 'link to github repo' button next to a Research note, akin to an 'image import' button? Then, whenever it was activated with the associated URL, the content of the reserach note would be updated with the remote github repo's markdown (and associated files)? Yes, the markdown would also be exposed on github -- ould it be possibl

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

One would expect updates to be more rare than reads by a high margin.

In terms of server work duties, checking for fresh data upstream github would be more sensible as a request of the original author, reflecting when changes have been made by the author. Looking for updates on each page read is hugely inefficient given the assumed ratio of reads versus updates and could bog the server down needlessly.

Reply to this comment...

Ah, yes, exactly -- by 'activated' above I meant to indicate: "research note author edits the note, and clicks the 'link to github repo' or 'refresh' button underneath the 'github repo URL' text field. There could be a checkbox for 'refresh whenever remote repo changes', and etc.'

If we'd be avoiding iframes, the only other modifications on the publiclab.org side necessary (in addition to this 'link to github repo' button idea) might be: inclusion of mathjax script on the rendering page (for the LaTeX equations) and making sure that the code block formatting works the same way (there seems to be some issue with the 'back tick' syntax?)

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Now you've hit a bit of a problem on the head there: by scraping externally generated content, the web team is now bound to the needs of the external content makers as they come up.

I'm not saying we shouldn't support mathjax. We probably should.

However, the externally linked content could lead to a can of worms suddenly becomes unwieldly from the perspective of publiclab and requirement needs based on what external folks are doing.

Reply to this comment...

Yeah -- as soon as we wanted to include mathjax, we started to think -- oh, gosh, now any author with their own particular tools will want to have them built into the publiclab.org rendering engine ... and right, that can quickly become a mess ....

Part of the idea here is that it'd be neat for authors of complex research notes to be able to consider publicab.org to be a 'publishing platform' akin to traditional science publications, but with all of the modern awesomeness that the open web makes possible: the author writes a document over which they can version control and edit however, and this content is displayed, tagged, curated, on publiclab.org -- it seems like a really wonderful model for 'publishing', going forward. The trouble with this vision, as you say, is that every author might have their own peculiar formatting requests / needs.

This lead us to the "iframe" approach: using pre-generated HTML, the author is responsible for translating their formatting needs into standard HTML, which publiclab.org can simply display. Is there a way of following this approach -- embedding HTML from some external site -- that doesn't raise security issues?

Hmm ... maybe instead of generating HTML, we could use pandoc externally to generate a PDF of the document, and then 'display' this in a scrolling PDF display window in a publiclab.org research note?

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Just wanted to chime in on a few these topics:

iframes: i normally never use them because they're frowned upon, but in this case it seemed to make sense. i'm not familiar with the issues when using SSL but I can imagine there being some. Though, I wonder whether github pages would work if the iframe is done through ssl also. Not sure, but when you roll out SSL we can see what happens.

md document: with RStudio, I am indeed generating a markdown document as well as an html document (example that has index.Rmd (source code that I write), index.md (which is generated by Rstudio), index.html (which is generated from index.md using Pandoc)).

images: the first iteration of this process involved pulling the markdown directly from github and rendering in the client: see it here which was just a simple test to see if you could do the client-side rendering using the marked.js library (which PL uses for previewing notes and comments). The issue with that was the images. With markdown, the images and figures are referenced to a png file in the repo, so I had to change any img src to use the correct repo. BUT with the html-generated file, the images are actually embedded in the html document in base 64, so there are no external dependencies (I think this may be done by pandoc). Take a look at the source for that file and you'll see the images embedded near the top. So anyway, dealing with images when rendering markdown in the client would be a potentially significant issue (which is why iframe seemed so awesome).

Mathjax: yes, please. i think its universal enough to include in the PL templates. if you were concerned about avoiding unnecessary scripts (e.g. a research note that does not have equations should not require mathjax to be loaded) then i was thinking maybe you could have a checkbox for a note to specify that mathjax be included, and then include it when rendering the template on the server if necessary. would add another field to your data model, but would allow people to only include mathjax if necessary.

PDF: that's possible, but I don't really like that idea (sorry don).

And if you didn't see it, I also tried embedding an interactive shiny app using an iframe: http://publiclab.org/notes/walkerjeffd/08-11-2014/embedded-shiny-app . That is a just a demo without any explanation of what it is, but I think it really opens the door to a lot of interesting types of research notes. That app I made for just me and Don to better understand the relationship between a 555 timer frequency and the conductivity of water (and then do some curve fitting to calibrate). but i think being able to include interactive visualizations and apps would be a nice addition to the PL ecosystem.

So it seems to me that there are two main questions: 1) are there any security vulnerabilities in the PL system by allowing authors to embed external pages using an iframe, 2) would iframe work with SSL.

If the answers are 1) no, 2) yes then this seems like a pretty good approach to me. Trying to scrape and render markdown from github has all kinds of complications (images, js dependencies, etc.), so this seems much simpler and just works.

Reply to this comment...

@walkerjeffd,

Another issue with iframes is spam detection.

Right now there is a small cadre of folks who get every. single. research. note. ever. Including myself. It comes into the inbox and we can check it out, be all like "yeah, looks good" or "no, we don't need you selling watches on our site."

iframes would obfuscate that such that spam moderators would need to visit each and every research note to check on the iframe contents, as emailing a full render (including iframe contents) is not really going to happen.

mathjax: submit an issue ;) I think there is already an issue for LaTeX math somewhere, but I believe it got relegated to "so many choices to render, deal with it later" or something? That might be my imagination.

I think @donblair brings up an interesting point about open data and open research. Rather than importing content into Public Lab, I think it would make far more sense to export data out of publiclab. The public lab research note interface is a tool for generating notes in the first place. Once the note is created, in theory, there's no reason it ought to be hosted on publiclab or anywhere else.

If we want to discuss a federated content delivery network and protocols for doing so, I think that would be a fantastically fun thing to do. If we came up with a portable research note that was something like OpenID, then PublicLab could produce that content (using its tools) and potentially (with a whitelist or something) consume such content for display. What Don is talking about is separating tools, content, and publication. Larger questions are how to make portable documents maintain authorship and so much more. Huge huge problem space. You can look up FreeNet, Diaspora, Mozilla OBI, OpenID, OAuth, and some other federated protocols to start seeing different ways bits and pieces of the problem have been addressed.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

iframe does work with SSL so long as both the site and the iframe site have proper SSL certs. iframes have had a number of security vulnerabilities in the past. For now, I'm not aware of any. Web browsers are getting less and less good at working with cross-site content. XSS (cross-site scripting) is a common vulnerability that browsers are learning to avoid. In some innocent cases, browser chooses to block the sites as "suspicious". I suspect cross-site content is not far away from having the same thing happening. Linked content isn't necessarily just DOM content, but JS can be tossed in anywhere. Once javascript appears in a linked iframe, the web browser developers are in a heaping world of worms from a can. In these cases, the browser developers try to support conservative, best-safety measures.

ultimately iframes are a hack to support what you want. They don't address the issue that is being raised, which is a trusted content distribution platform.

Reply to this comment...

@btbonval,

mathjax: yup, we filed an issue on that already. Don called it 'latex equations' but we meant mathjax support.

iframes: yeah i certainly recognize all the cross-site vulnerabilities and issues. i tend to look on that as being the browser's responsibility, but we'll see how things develop.

spam: i can see that as being an issue. but it also seems like that approach is not too scalable. if PL grew to the point of having hundreds of notes per day (i'm not sure how many you get now), then that would get burdensome. maybe you could add a "Flag as inappropriate/spam" button to notes to alert the admins of potential spam? and let the community take on more of a moderator role?

CDN/open research: i think we have a differing view here. i would much rather push notes into public lab, than write them in public lab to pull out (hence the whole rmarkdown-based workflow where i effectively write the note in RStudio and then bring the resulting html into PL's system). for some things that do not involve coding, writing a note in PL makes sense. But as soon as your running code to generate figures, having to copy and paste into the PL editor is a rather significant burden (especially if I want to update something, which I would probably do about 10-20 times for each analysis). Some of my recent analyses that have not been posted yet on PL are rather lengthy and having to copy and paste the code and add the figures would be somewhat of a deterrent. I view the value of the PL system is in its community, and I want to be able to share my analyses with the community. But for notes involving code, I pretty much have to generate that code outside of PL and then push it in somehow. Copy and paste is basically not an option for what I'm doing, but there are certainly other ways we can come up with that don't involve iframes.

One thought I had is if PL had a REST API for the notes. So I could create/update a note from outside PL. Then I could use a git hook to automatically update the markdown for a specific note every time I commit. That would solve the server-load issue of updating the content (only when I do so via the API instead of on every read). But there would still be an issue with external images and how to deal with that.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

@walkerjeffd

So what can convince you not to use iframes? Just curious. It seems like none of the general practices or advice are good enough?

My point about browsers was exactly that it is their responsibility, and outside of our control. Well meaning sites and things get broken when the browsers make decisions like "start blocking all this content." Have you ever tried linking HTTP image from an HTTPS site? It's well meaning, but the browser says no, and you see no image. I see iframes falling into that category faster and faster. You think it works, and it does, then one day, it does not. It isn't a long term approach.

You missed the separation between content, publication, and tools. Right now, PublicLab is all three. People use PL tools to create PL content and publish it on PL. What you want is to use other tools and other content published on PL. What we need to do is split these interests up. The first and easiest task would be using PL tools to create PL content, and making that PL content extremely portable. Importing it is a can of worms, as this entire thread has noted. If PL is to become a content publisher for content that is not PL hosted: can of worms. The answer is not iframe. The answer is an appropriate federated content publication network, of which public lab becomes one such node, and through which you may publish your non-PL content. Look at the federated services I noted and read them over, please. There is a lot of good stuff in there. What you want is not what Public Lab currently offers, and in fact, NO web service currently offers. We'd have to create something completely new.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

What you want to accomplish would be like making Public Lab the AppStore or Kindle eBook Store for research notes, but without the pay. It's a distribution channel, and as such, it would need a ton of new considerations which do not presently exist. It isn't an easy problem.

iframe sounds like an easy solution, until you consider the analogy of AppStore or Kindle eBook Store. They could never merely link through to the content from authors who publish. Both are heavily curated, come with all kinds of legal concerns, etc. You are proposing PL become the AppStore of open science, a publication medium. That's very different from what it is now, and will take a lot lot lot of consideration. Not something that can be done between three people shooting the stuff in a research note comment block.

Reply to this comment...

Additionally, research notes do have spam buttons. You might need to be a moderator on the site to see them, though. We have a ticket about spam voting somewhere.

The few of us act as a first line of defense because sometimes so much spam shows up from a single user that it takes over the "recent notes" section, etc. It would be daunting to users unfamiliar with our site (we have new users every day) or to those some of the technically less advanced users (of which we have many). We needed a stop-gap to prevent that from happening. I believe we get emailed 30 minutes or something before the notes are released on PL proper so that we don't run into those kind of situations. It works quite well. We might miss some things in the long run, but that's where the button comes in.

As an aside, Jeff is the only developer, outside the GSoC interns and a nice fellow from New York whom none of us have ever met. Jeff is, of course, doing all kinds of other things for PL as well. There are a lot of ideas for PL software and websites but little power to get them done. PL needs to be conservative about the features it chooses to embrace, and those which will end up resulting in more work in the future than it could handle. Yes, we could do iframes. Jeff might even add it just to end this conversation. But in the long run, it creates so many more problems than it solves. And then the small team has to deal with those as well.

Reply to this comment...

interesting

Reply to this comment...

Wow, thriving conversation here! Looking forward to trying to thrash this through ... maybe we could even do a Hangout to chat about the various approaches ...

You missed the separation between content, publication, and tools. Right now, PublicLab is all three. People use PL tools to create PL content and publish it on PL. What you want is to use other tools and other content published on PL.

@btbonval:

I do see your point here: it's much simpler to develop and manage a 'closed ecosystem' -- using the PL editor to create content that is hosted on PL. I see how this makes dealing with security issues, spam, formatting issues, and legal issues re: content ownership much easier. And I see how opening up this ecosystem to allow e.g. content generated using external tools to be published on PL does start to create problems.

(Aside: I love the idea of the federated content network you're describing, and I'd love to push towards that vision, going forward.)

For now, though, I guess the issue we were running into was that certain types of research note content seemed easier to produce, edit, and manage using tools that are external to Public Lab -- in this case, research notes that had reached a certain level of complexity.

I see this proposed 'embedded content' idea we're floating as quite similar to (and perhaps, in its use of 'iframes', technically identical to?) the standard practice of embedding a Youtube video in a research note. Certainly, embedding external content in this manner has its risks (as you point out, if Youtube goes away, or the URL changes, then the content is not longer available on publiclab.org; embedded video could easily be a source of very-hard-to-screen spam content injection; there are licensing issues) -- but those risks seem so far to have been outweighed by the benefit of being able easily to incorporate video on publiclab.org.

For the time being, does embedding HTML (using 'iframe') or sourcing markdown from github in research notes seem much riskier / more complex to you than this practice of embedding Youtube videos in research notes (using 'iframe')? It seems to me that if it's useful for now, and if it requires no extra effort on the part of the development team (for now -- though I guess spam could change that), and if the risks are quite low, then there's not much harm in trying it, for now?

If, however, the embedded / import solutions do seem too risky / onerous, then I guess we could always simply just take the route of writing a brief research note 'wrapper' around the 'longer article' that we're hosting on github, and simply linking to this external article in the research note ...

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

@donblair Thanks for understanding. I was getting frustrated that my words had no meaning and I could not think how else to elucidate them.

I see what you are saying about embedding YouTube, but I do not believe it is proper to compare that to embedding HTML. For one, consider the ontological consequences. HTML can implement HTML recursively to an infinite degree and HTML can implement YouTube or PornHub equally. Contrast with YouTube media link, which cannot embed anything additional except for the video and player in question. This is not meant to be a practical example of a problem, but express how these two elements are radically different and thus come with completely separate considerations. Apples and oranges is what the humans would say.

I think iframes are extremely risky at a high level. At a low level, it accomplishes what you want. But shooting dictators would appear to do what you want, until you learn about blowback. Nothing is as simple as it first appears.

With respect to a wrapper, this is getting back to the idea of a portable research note format. Imagine if the wrapper was just a document format. A portable research note format, once designed and constrained properly, could certainly be a great approach to importing, linking, or exporting notes into/out of Public Lab. Of course, the into Public Lab piece is still fraught with peril.

Here's the reason I suggest federation: maybe public lab isn't actually the best place to host , find, and read research notes. Maybe it's a great place to create research notes for the average person, and publish them for the average person. But you guys, with your whats-its and do-hickeys, might want to come up with a new way to display the research notes. By separating content, tools, and publishers, we allow other experts to exceed our capabilities with e.g. publication while allowing the open standard format (hypothetical research note wrapper as a document format) to move the content freely between such publishers as Public Lab or Don-and-Jeff-Bonanza.

Reply to this comment...

If there were interest in generalizing the research note mechanism a bit here is an interesting idea that popped into my head as I was reading this:

Currently a research note is simply a Markdown file and some attachments with editing taking place through the web interface. I envision the backend being reworked to keep each research note in a standalone version control (e.g. git) repository. The primary text would be in a note.mkd file with other attachments in a directory. The web frontend would not change although edits would produce commits in the backend.

However, in addition to the web frontend one could also clone the repository (e.g. git@publiclab.org:notes/donblair/research-note-workflows) for local editing. This would allow "research notes" to become closer to "research units", allowing one to include scripts, small data, notes, and results when appropriate in the repository for others' reference. You might even include a JSON file containing the note's metadata. You'd probably want some push hooks to ensure that the integrity of the note (e.g. the existence and well-formedness of the Markdown) are preserved in pushed commits.

While this proposal doesn't really address the problems of melding code and other associated content into notes, it might bring us closer to being able to view research notes as packages of content instead of just text (which is where I think the current research publishing system falls short). Also, the proposal admittedly comes a bit close to turning PL into a Github-like service, complete with the high storage requirements and fairly non-trivial backend logic that this entails, as @btbonval has mentioned.

That being said, I like the idea of building on top of existing tools when possible and it does seem like research notes deserve real version control.

Reply to this comment...

@bgamari Generally speaking, we actually do track edits on research notes, but it doesn't use diff-match-patch style updates. It maintains complete copies. Using something a bit more streamlined (diff tracking, which is what git and code repos do) would be a huge improvement to the internal data store.

I actually like the idea of using git as a publication format, except for one huge caveat. Images. git and code repo engines are notoriously bad at dealing with images. Additionally, the export format wouldn't be as simple as exporting markdown diff trees (e.g. a git repo) due to the images.

However, people do indeed cram images into git repos (failing better options), and there would be nothing stopping us from doing the same. I also like this idea a lot because of the import consequences. In effect, research note pages would work very similar to GitHub Pages. You could push or pull.

To do this well, I'd recommend running git or mercurial repos on top of GridFS on top of MongoDB. The website would then extract images and markdown content directly from its local repo copy, which advanced users could update via git push (and only once keys are shared etc).

That last paragraph notes "advanced users." So here we're talking about using a code repository as the medium for publication, in effect. But what we fail to address is the tool to create the content. We'd have to come up with a way for the current editor to do all this code repo magic in the backend so that less advanced users can still participate.

Reply to this comment...

Just another advisory though: Public Lab has an extremely small developer team considering the number of users and the desires of the userbase. There are exactly zero full time developers, although Jeff comes close. ;)

Reply to this comment...

There might be an argument for some file-based format which may be easier generated instead of relying solely on git. Public Lab's current Rails code would require a major overhaul assuming Rails even has the right plugins (gems) to ease the process. Node.js might make rendering these things simple as dirt from a git repo, but we need some way to export the notes out of the current Public Lab database/website. This is where a simple document format might shine.

Reply to this comment...

@bgmari

I like the idea of git as a publication tool.

BUT if the backend is going to be wholly converted to git, why not just use GitHub as the backend? A site that already does something like this (uses GitHub as a backend) is http://bl.ocks.org/.

Obviously performance will be much better if PL is hosting the backend ourselves - but it seems to me that starting by using GitHub as a backend would be good for a proof-of-concept.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

"open science document", is a thread on the Public Lab mailing list. I think the thread is highly relevant to this discussion given the nature of import/export of open data. Check out that link for more on the conceptual discussions of curating content. https://groups.google.com/forum/#!topic/publiclaboratory/_v4Vcw97ZME

Reply to this comment...

writing a brief research note 'wrapper' around the 'longer article' that we're hosting on github, and simply linking to this external article in the research note ...

Yea. Do that.

Your Kayak note was one of the more interesting and appropriate notes ever at Public Lab. But the three people who care about all that R code are three of the people who are quite capable of following a link to Github to find the code and other tedious details. Don and Jeff have some important ideas about what publishing at Public Lab could become. But today Public Lab research notes are mostly a few paragraphs and pictures that explain research progress. The explanation part is crucial. A couple of paragraphs that allow more people to understand what you did, what is novel, why it matters, and how they can participate could add great value to the Kayak note. The code blocks really got in the way of my understanding what your point was. But the experiment was totally worth it to see Bryan get all riled up.

So in addition to the stuff you guys are talking about, the other issue is; If you could have the perfect Public Lab publishing system that did all the gitty backend stuff, would Public Lab research notes be any more useful to the current audience?

On the other hand, the ability to put Shiny apps in research notes will be a huge crowd pleaser. I’m all in for Shiny.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Lots of interesting stuff here... A couple thoughts:

iframes aren't searchable and do not archive the content in the PL database, and if the host site disappeared...
I think we may actually have a REST API, since rails often autogenerates one. If we don't, its only a few lines of code.

I'm not sure what the size of the audience is who are going to want to use git to write research notes, but if we have a REST way to post & update, people could pretty easily code up scripts to integrate with a git repo.

There are some great longer research notes out there of course, but one reason I like shorter ones especially is that they encourage people to break up their work into more digestible pieces, which makes for more regular sharing of work instead of saving up work for days or weeks before posting. That in turn creates more openings for others to get involved, critique, suggest changes or improvements.

Reply to this comment...

Wow, ok didn't realize this would ruffle so many feathers. I'll try to briefly explain more why I wanted to do this.

reproducibility - for the riffle, we wanted to take an open science approach where all the data, code, software, hardware, everything, is available. using github seemed to make the most sense (though we need a different way to store the raw data eventually, and probably figures too). we thought this would be useful for helping others learn how to program the riffle and analyze the data. but this doesn't require using iframes, we can just link to the repo from the note. but demonstrating how to integrate an rmarkdown or an ipython notebook into a research note is something we thought others could benefit from.
collaboration - using github as the central backend allows Don and I to work on these analyses collaboratively (we haven't yet, but that's the idea). I currently can't edit any of Don's notes. maybe that's a feature you're working on though.
ownership - this is maybe a sticky issue, but here's my perspective. i'm starting to put a lot of work into the development of the riffle, and am going to be generating a lot of these more technical analyses. if i only wrote these up on PL, then they'll only exist in the PL system (as far as i know, there's no way to download an archive of all my PL content). but if PL looses funding or for some reason shuts down, then I'll lose all that work. Thinking long term, will PL be around in 10, 20, 30 years? In other words, I want to keep my own local copies of these analyses which I could maybe do by copying and pasting the note source code out of the PL editor, but that's not a robust solution.

@cfastie - good points. When I wrote the original document I wasn't thinking about putting it on public lab so I left the code in there and didn't write it in a public lab style with as much background as I should. But I just updated that kayak note so the code is no longer shown and added a note at the top that links to the source code for those who are interested. I also added a little introduction in the public lab note (outside the rmarkdown file). I will say that I think there is some value to the code being there. I've gotten a couple people tell me they like seeing the code because it made them realize that maybe R isn't so hard and they can use this code to learn how to do things like create maps, etc. My actual motive with this analysis was to get @donblair to start learning R, which is best done by example where he can read the code. And that seems to have worked. But I think linking to the source code is probably good enough. And I'm not sure what your asking about regarding the usefulness of research notes to the current audience. which audience? Like I said before, the whole reason to try and get these analyses into PL is to engage with the community (more on this below).

@warren - yeah, I realized the issue with searchability, but I thought maybe by adding an 'abstract' to the top, using tags and having the title would be good enough for supporting searching. REST API seems like the most logical solution here to me (though would take more work on both your end and my end). The issue is figures, but maybe I could use S3 or something to do that before uploading a note via the API. As for long vs short notes, I see your point for sure. But for developing the riffle, there will be quite a few long notes like the kayak deployment. Theres just too much interesting stuff to be done with all the data we collect, and maybe it doesn't all need to go on PL, but Don and I want to put all of our work out in the open (successes and failures). But maybe PL just isn't the right place for all that (more on this below).

@btbonval "Yes, we could do iframes. Jeff might even add it just to end this conversation. But in the long run, it creates so many more problems than it solves. And then the small team has to deal with those as well." - well you already support iframes as we've demonstrated with zero work on your part. so more of a question whether you'll prevent iframes. in looking though blog posts and stack overflow Q/As on the evilness of iframes, it seems theres just a lot of differing opinion. but i get the impression they're generally accepted for certain uses along the lines of what we've done here (embedding external documents, maybe the shiny app was a stretch). but in any case i think we can argue all day about iframes, but I don't think that's the central issue here.

Summary

Don and I are realizing that maybe what we're envisioning here for the riffle is not really in line with PL's current vision or community. We're going to be generating a lot of these more long-form analyses and maybe PL is not the right place for those. As a scientist/engineer, I've been really excited about using PL as a place where I can publish research through a more accessible platform and for a more general audience than I normally write for. That makes this a really fun project for me, and something I'm excited about. But also a scientist/engineer, reproducibility is really important to me, and things like Rmarkdown and IPython notebook and other literature programming frameworks as well as version control systems like git/github are extremely powerful in this regard. I think they really have the potential revolutionize how science is done in the modern age, and its something I'm trying to pursue using Riffle as a way of demonstrating this.

So Don and I thought a lot about how we would use PL for this project, and what we can up with is in conflict with the purpose and vision of PL (and it's certainly not our place to try and change that). The iframe hack seemed like a viable way to combine our workflow with the PL ecosystem, but maybe that content doesn't belong here in the first place.

Moving forward, I think what we're going to do is keep most of the riffle development out of the PL system, and build our own github-based system for achieving what we want to do in terms of publishing reproducible analyses and research notes. We'll then write short research notes on PL that link to this external system, and maybe keep an archive of everything on a PL wiki. So effectively we'll use PL as a newsletter to alert the community about new analyses and developments (maybe like a weekly digest), with most of the technical details kept in github. My original impression was that PL was intended more as a place to post your experiments and results (basically like a lab notebook), but I'm now getting the impression that maybe our work is going to be too technical and the workflow we need to produce this research won't fit in the PL ecosystem. Its simply to burdensome to generate all these analyses with the online markdown editor. And we realize that to build in the features needed to do this new kind of science would put far too much strain on the PL developers and just opens up too many cans of worms. Iframes seemed like a viable way to do this without requiring anything from the PL developers, so we went with it. But if these types of analyses don't really belong here anyway, then we can move them offsite into our own system.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

There's a lot to talk about here, but first, just to clarify - iframes are widely accepted as a way to embed a specific piece of media or content -- like a video, an interactive data display, or even a Github code snippet (gist). I think the reluctance we have is towards using iframes to post entire texts, where the research note wouldn't actually have any content in it besides an iframe. Also, a REST API would make it pretty straightforward to download all your research notes in JSON, XML, or whatever format. That seems like a reasonable thing to add.

Reply to this comment...

@warren yup, totally get that. REST API makes the most sense to me as well (for putting notes in and getting them out), and I'd be totally be open to using that. in the interim though, i've got about 5 more riffle analyses at the moment (and will probably be generating 2-3 per week if not more), so I just don't know how long it would take to implement the API (weeks, months?). and so I'm not sure what I can do in the mean time to get those into the PL system beside iframes or copy and paste (which is not something I really have time to do and goes against this whole reproducibility principle).

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Wohooo! Okay, so I think this is one of those examples of an online conversation that is really rich in content, and has generated a lot of great ideas to work through ... and which is best continued via a 'live' conversation (Skype, Hangout, in-person) to brainstorm further.

Maybe we could reserve an 'open hour' to talk about some of these ideas? It could range from the philosophical / long-term "changing the way that scientific publishing is done" (whee!), to the nitty-gritty, short-term "how best to do highly technical collaborative development through publiclab.org".

And just to quickly elaborate on @walkerjeffd 's last (really thoughtfully written) comment -- what he and I started out discussing over the last few days is how best to publish the particularly technical development work we've been doing on publiclab.org -- really, just optimizing the workflow for pushing that development out to the Public Lab community. (Sorry -- we then also got giddy about changing the way that scientific research is communicated, in general -- but that's really a larger, longer-term conversation to have over beers, probably).

And it sounds to me that we're starting to hone in (through the course of these comments, with great input from everyone) on a nice approach, here -- as @cfastie had suggested, for some of the more long-form notes or 'reports' -- especially ones that include working R code examples, special formatting, and etc -- the best workflow might be to simply post a summary explanation of this document as a research note on publiclab.org, with a link to the document on github.

It occurs to me now: this is really just the same arrangement as what's already being implemented in any of the highly technical projects in the Public Lab community: for spectralworkbench.org, mapknitter.org, or infragram.org, there's an associated github repo that tracks changes, and allows for project management (via issues). Git is a system that has emerged as an incredibly useful tool for these things, and Public Lab is using github.com because it would be silly to try to recreate all of that functionality separately. So what's happening with the water quality-related development for us right now, I guess, is that we're realizing that e.g. analyzing hydrology data in a consistent, reproducible, scalable way is going to require a similar level of technical infrastructure as e.g. analyzing spectra, knitting and geo-referencing aerial imagery, or manipulating near-infrared imagery on a pixel-by-pixel level. It's going to be hard to fit some of that into a 'research note' format -- so it might be better to do it on github. Just as we wouldn't want to cram all infragram.org new code commits, feature requests, and development into research notes: it's not something that a general reader of publiclab.org is going to want to wade through.

One idea that just came up, which I really like: a weekly or bi-weekly summary overview of Riffle development (and Open Water - related development, generally) -- a few paragraphs, with pictures, which describe the most recent developments, provides context, and has links to all of the relevant development and technical documents that have been posted on github. (This would actually be a neat thing to do for infragram.org, spectralworkbench.org -- and really all of the Public Lab projects. And it's already being done nicely, in some form / in the newsletter. We could just sort of model that, more or less). That way, someone can read that research note, dig into the relevant documents as they like, and develop the conversation about those documents on publiclab.org ... ?

Anyhoo, more fodder for an in-person convo ... let's schedule one!

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Oh, quickly: love the REST API idea! Maybe for these really long technical documents that we host on github, we could add an "abstract.md" or "researchnote.md" file to the repo, which would have a standard format, and would be written as a nice, concise, accessible summary of the work. As long as we're consistent with that format and approach going forward, we can take our time to think about building out the functionality to "push" that markdown file as a research note to Public Lab. In the meantime, we can just enter the markdown 'manually' on publiclab.org, on whatever timeframe works best (maybe as a weekly or bi-weekly 'open water / riffle dev update' ritual) ...

Reply to this comment...

@WalkerJeffD On a technical note, co-editing of research notes is in-development. Co-authoring was just implemented with a with:@name tag, and I think co-editing will be implemented Very Soon Now.

There are a few ways to think about "open." In your proceess, you seem to be focusing most heavily on reducing the interface friction between an expert user and viewing/versioning/editing a complete record of all the collection/analysis systems' code for a specific experiment. "Openness" is defined by the speed of replication. This is a laudable goal, one that needs more work and could truly change the way science is done. But I think it this approach will be more robust if you maintain an engagement with a broader participant pool.

Public Lab has generally taken the position that broad public participation at every stage of the process is "open science" and open-source/accessible/well-versioned code is an instrumental piece of that openness. The reasoning being that those who do data analysis are not necessarily domain experts on the environment they are examining, and that the assessments of a place done by their current residents, regardless of scientific training, are central to an accurate understanding of environmental science.

For me, I want to see Riffle data as something portable that can be added to Mapknitter maps as annotations easily, or even embedded and fed in live. Sensor data next to qualitative markers, georeferenced media, and other place-based information is very powerful, especially in a platform that is broadly accessible. @justinmanley might have some thoughts on that...

Reply to this comment...

Yes - I think it'd be ideal if writing up your work weren't a burdensome obligation, but if we thought about posting in an interactive, open forum like the PL website as an opportunity to get input and critique, suggestions and requests for clarification on our work. To re-emphasize why I like shorter, regular posts, it's not just to make the information 'more digestible', or to disseminate it "to the public" -- but also to invite others to respond, add value, challenge your assumptions, and to change your beliefs and approach through discussion. It's both give and get, to riff on Mathew's great point about domain expertise. If publishing is limited to only those folks who know how to use Github, aren't we missing out on key insights from a broad swath of potential collaborators?

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Very well put, @mathew and @warren! It's so important that we make sure, when doing research that in some cases will involve a very technical and involved procedure, not to fall into a "let me teach y'all how to do science!" approach, but rather that we work hard to develop modes of doing science that are fully collaborative and participatory.

I'd also like to put a 'context wrapper' on the ideas that we were proposing in the above note -- I know it's onerous at this point (the discussion here has really blossomed!) -- but I think it's important to read the entire post, watch the video, and read the entire comment thread above in order to understand the context for the ideas that Jeff Walker and I were floating here.

But, as a quick short cut, here's what I think is a useful analogy: the kind of data analysis workflow that Jeff Walker is experimenting with right now in analyzing water quality data is, perhaps, as technically involved (and yet amenable to automation) as the workflow that is required to e.g. analyze Public Lab's spectral data or infragram data -- if you include the software development around spectralworkbench.org and infrgram.org as part of that workflow. Developing, tweaking, version controlling, and commenting on this water quality data analysis workflow (which involves using R, a software system with several powerful, but not necessarily user-friendly software packages) seems like a great fit for github, in the same way that e.g. spectralworkbench.org development is best done there. In particular, JDW and I have envisioned setting up an "executable document" of the sort that RStudio can generate -- a combination of problem statement, data, and executable analysis code -- so that not only is the entire analysis in one place, but folks who want to repeat the analysis can simply 'fork' this document, plug in their own dataset, and repeat the same analysis themselves.

Accordingly,

If publishing is limited to only those folks who know how to use Github, aren't we missing out on key insights from a broad swath of potential collaborators?

Yes -- great point -- this is a huge problem! And I'd say that this is also true for e.g. spectralworkbench.org development, in much the same way. Collaborating with the publiclab.org community on spectralworkbench.org development right now is heavily biased against people who don't know how to use github; by choosing to develop this tool on the github platform, we're alienating a lot of people who are highly motivated to contribute to the project -- ranging from people who are simply concerned about their tap water, to a lot of talented coders and chemists who possess complementary expertise.

What we're trying to figure out here is: how can we aim for the kind of collaborative, fully participatory, peer-to-peer science and technology development we want -- which sometimes is as technically demanding as the coding required for spectralworkbench.org -- and make collaboration and participation as accessible as possible for as many people as possible? What I think JDW has come up with above, in the context of analyzing water quality data, is sort of the equivalent, for spectralworkbench, of a workflow that would regularly post a publiclab.org research summary note about the latest spectralworkbench update ... and the research note would also link to an in-depth, self-contained document, on github, that leads any developer through the process of setting up, running, and adding features to spectralworkbench.org on their home computer.

Hope the above makes sense -- I may have just passed the safe caffeine threshold :)

(WOW, what an awesome discussion! Thank you, Public Lab-bers!)

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

P.s.

For me, I want to see Riffle data as something portable that can be added to Mapknitter maps as annotations easily, or even embedded and fed in live. Sensor data next to qualitative markers, georeferenced media, and other place-based information is very powerful, especially in a platform that is broadly accessible. @justinmanley might have some thoughts on that..

Just wanted to add a +100 to @mathew 's idea here!

Oh goodness -- this comment thread -- someone stop me .... Bryan, can you shut me off so I'm blocked from commenting on this thread?

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

@donblair the logout button helps add a few more steps in the way of contributing to the conversation.

I think the concept of a spoiler might suit some of the needs being explored here. Spoilers, by default, are collapsed with a simple indicator that there is "more than you might want to know" if you click on it.

The question of how to embed one document inside another is a bit harder if all documents are living. For example, tools were referenced like spectralworkbench. At present, the output is primarily an image. So you drag and drop that image, done. Same with infragram. But let's go back to this idea of research notes which embed research notes.

Perhaps you write research notes with all the crazy details. In the summary research note that @warren prefers, you could hypertext link the other notes and that solves the problem today. Or maybe you use clever tagging, and the user who wishes to explore the topic further finds everything needed by clicking on a tag for the note.

But maybe tomorrow, we want to explore embedding the detailed research notes inside the summaries, and hiding the content using a spoiler. This is a very complex operation, because both the summary and the detailed note can be independently modified, yet the summary includes the detailed note. So what happens if a user comments on the detailed note? Is that conveyed in the summary note in any way, or does one have to explore the detailed note more than just by way of expanding a spoiler? If someone likes the detailed note, does that mean it should apply to the summary note by proxy? What if the embedded detailed note is removed from the summary? I find myself very perplexed by a lot of these implications when considering how users and the community interacts with notes.

What features are you folks looking for that hyperlinks don't provide?

One possibility is the hover-preview that you see with a lot of advertisements. You put your cursor over a hyperlink (usually some innocuous word) and get a small panel that opens up and previews the spam link content you might see should you click that link. I've only seen it done in the spam context, but a hover-preview-link might help get closer to what you seek with some small form of embedding via preview while still keeping the logical pieces of research notes separate for community engagement.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Btw it sounds like, for the moment, we have moved past the idea of portable documents which are worked on outside of Public Lab but may be imported into Public Lab.

I think this is a worthy goal, although I don't think "RESTful API!" is all it would take. Research notes include community engagement, it's more than just the markdown (and images). Comments, authorship (including multiple authorship), and the markdown extensions @warren has written must be at least considered when exporting the note.

Once a note is exported, it can be copied and pasted ad nauseum, changed a little here and a little there. Maybe one copy, independently modified, is submitted back to PL, then another copy, also independently modified, is submitted back to PL. By merely allowing export and import, we create conceptual branching and merging, which gets into the hairy territory which solved by code repos (which @justinmanley recommended). But if a research note is version controlled, then we must ask how does that get imported and exported in a friendly way through a RESTful API? Github does support HTTPS for push/pull, so we might look to them for an example.

Yes, we could hand wave all of these problems. "oh just export strictly the markdown." "oh just replace all the contained markdown on upload." I feel these are hacks which do not sufficiently address the problem created. Maybe I'm incapable of taking baby steps, but I see cans of worms with even these simple hacks which should be dealt with using techniques which more properly address the issues.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Its good article. I like this approach when the goods are placed besides the videos.I like this approach when the goods are placed besides the videos.

Reply to this comment...

The TL;DR here is I really like the idea of a 'collapse' tag extension to markdown so that code and really tech/replication heavy sections of a note can say "click to expand." .

JWD's github notes are stellar-- a lab notebook layed bare and ready to follow step-by-step. too often in reading research, I get to a point of asking 'what exactly did you do?' I have experience with getting that woefully wrong, guessing at the actual procedures. (sometimes to positive effect! replication errors FTW-- ask me about balloon making).

I think the question is how to pack and unpack that info so that is approachable at different interest levels, while not burdensome on the note taker. I hear a workflow concern of JWD's being, 'do I have to re-write everything? Can't we automate this?' Reposting and re-editing one's own notes is a drag. we should be able to figure out how to avoid this. JWD's github posts are an embarassment of riches-- its usually like pulling teeth to get someone to document their process so well and I here I am complaining about it. But it actually gets hard to read. I think if we illucidate the aesthetics of the notes, their use to different readers, we can move towards a technical solution.

Looking at JWD's github posts there are several types of information interesting to me at different stages of engagement, presented all at once. I want to know that the Riffle tracks to USGS data-- but I'm not an R user-- and seeing notes on the merits of long/short data table formatting and packages for doing it is distracting. I want to setup my own Riffle -- but it's not in front of me-- I don't really care what the setup is because I'm not working on it.

These interwoven types of information actually makes it hard for me to read the things I want to read, and to catch the notes I need to catch to understand the process. My eyes go into a "scanning" mode. Switching my level of engagement while scrolling through the article is problematic.

I would love to see the "look I got this data dragging a Riffle behind my Kayak and it has some anomalies and stuff lets talk about them" note, minus these setup pieces, so I can get an overview before diving into EXACTLY what happened. as mentioned above. a 'collapse' tag, pushing the really tech/replication heavy sections of a note into a "click to expand" piece would solve this issue. I understand the technical elegance of using a traditional HTML multiple pages approach, but I don't think that makes a complex process like experimental procedure, quantititive, visual, and qualitative analysis more presentable. it would be tough to split them up. they aren't really small separate notes.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

@mathew great comments and feedback, thanks. I totally agree, having the code in there is useful for some (probably few) but quite distracting for others (probably most). These notes were originally intended for just don and I (and me trying to teach don how to use R :). it wasn't until later that we started thinking about how to get them on PL for a larger audience.

But moving forward, it will definitely take some work on my part to figure out this balance between code and other distracting details and the real meaning of the note (e.g. how well does the riffle track the USGS gage).

I'm going to add a little button to my notes that will show/hide the code blocks for users who are interested (see this example with the show/hide button on the right of the page). I think I could do a similar type of thing where each section is collapse-able (code, text, figures, and all). So I think that will get us closer to finding the right balance here.

Reply to this comment...

I love the "spoiler" idea - although i guess we'd want to rename it for this context, like a "get the details" or we could call it an "accordion" or i dunno... we could imagine a Markdown extension for that sort of thing. Right now it can be done in HTML, but it's a little cumbersome. Maybe some kind of ###> Section title syntax would make everything until the next title collapsible?

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

Ok well looks like this is already supported by writing the raw html in the note that is used then picked up by bootstrap. I just tried it on a research note and it seems to work, I can create sections that are collapsed by default and others that are open by default: http://publiclab.org/notes/walkerjeffd/08-13-2014/creating-collapse-able-sections-in-a-research-note

Would be cool to create a markup syntax that would generate the "accordian" code automatically, but I could also just write the html myself, not a big deal.

Reply to this comment...

Hi there, I'm yet another pirateship person. Don pointed this thread out about rendering research notes from flatfiles maybe on github. I'm very excited! I'm working on similar rendering of markup to html and I thought I might mention some things I'd come across.

I have a book on github, The Rime of the Ancient Mariner. This was a Project Gutenberg book, so there was a text file with the entire text of Rime. I converted the file to asciidoc (very-similar-to markdown). You'll notice that when you view the book's markup file on github that it is automatically rendered to html by github.

I rendered Rime's markup file into html on my computer, and pushed to a github branch named gh-pages. This turned on github pages and renders the html file at this url.

Extending the plots2 repo (that runs this site) to poll github for updates and render a page (with caching) shouldn't be too hard. I'd be happy to lend a hand.

Reply to this comment...

Seth,

Fantastic!! Thanks so much for this -- super exciting! Maybe you and Bryan and Jeff Walker and I can schedule a Hangout or something sometime soon just to chat through the various neat options / ideas?

Bryan and Jeff Warren have already been sketching out an API for submitting research notes, I think ...
I'm learning that a lot of researchers who do supercool environmental stuff seem to be using R, and RStudio, so it'd be great to chat about how markdown / asciidoc play nicely with that setup ...
Another cadre of researchers are into iPython, and use it in a similar fashion to RStudio ...
J-Walk pointed me to the http://jupyter.org/, which aims to extend both iPython and RStudio into some awesome hybrid ...
My as-yet-clumsy reading on all of this is that the approach you're taking is flexible enough to handle a bunch of formats and generate research notes as flat files on github that can easily be sucked into plots2 somehow (if that's a cool idea vis a vis Bryan, J-War, and the other devs) ...
Clearly, good fun is to be had, here.
Woohoo!

Is this a question? Click here to post it to the Questions page.

Reply to this comment...

There probably isn't any reason you can't use markdown in the same fashion. There is a plugin for Rstudio to do Asciidoc called knittr.

Reply to this comment...

Public Lab Research note

Public Lab

Research note

Research Note Workflows

What I want to do

Motivation

My attempt and results

43 Comments

research