Public Lab Research note


GSoC 2021 Proposal: Geographic Features Refinement

by barun1024 | March 24, 2021 22:28 24 Mar 22:28 | #26021 | #26021

About Me

Name : Barun Acharya

Github : https://github.com/daemon1024

Project Title : Geographic Features Refinements

Gitter : daemon1024

Affiliation : Jaypee Institute of Information Technology, India

Location : Noida, India

Student year: Sophomore (second-year student)

Field of study: Computer Science and Engineering, B.Tech.

Portfolio : https://daemon1024.github.io/

Project Description

Abstract/Summary (<20 Words)

Smoothen, expand and refine the various geographical features under Public Lab's website. This includes standardizing, adding new leaflet environmental layers, increasing content on https://publiclab.org/map and various other refinements.

Problems

Standardize "Spreadsheet to Layer"

We have an awesome utility to add a leaflet environmental layer using data sourced from a spreadsheet, more details at https://github.com/publiclab/leaflet-environmental-layers#spreadsheet-based-layers. But currently it's not feasible for non programmers to use this utility or how to source the data for the layer to be based upon.

Workflow

workflow

The above workflow doesn't need a server dependency as it is automated using GitHub API and actions, and in case of any failure it still provides a manual alternative.

The intended workflow would start with volunteers submitting the form, which automatically opens up an issue, which in turn triggers the GitHub action, the script runs in the action environment and finally generates a pull request with the new layer.

The proposed workflow has fallback for each and every step, in case any step has a failure there will be a provision to complete it manually.

In case every automation fails, the fallback and most basic workflow is providing the volunteer with instructions to open up an issue. Once a public lab contributor picks it up, they have to manually run the script to generate the new layer and make a pull request linking it back to the issue.

I understand that the primary audience might not all be developers and will try to maintain a smooth workflow for them while detailing out each and every fallbacks and progress while making the whole process extremely accessible.

Details about spreadsheet2layer.js:

Some additional points to note:

  • We can provide "Open Gitpod to try out new layer" option in the generated pull request.
  • Providing an independent file/code which can be included/copied over to anyone's codebase so as to include the generated layer. We can see some examples at https://github.com/publiclab/leaflet-environmental-layers#adding-layers-individually but they still require LEL dependency. We can refactor some of the internal dependency so as anyone can include them. independent layer
  • We can generate sample previews using these independent components and deploy them to static pages using github pages or alternatives.
Add and refine environmental layers as part of the Leaflet Environmental Layers library

Since we would have a new standardized process to add spreadsheet based layers, it will open up ample opportunity to add various layers.

Also there are many issues open regarding layer requests, broken layers or need for optimizations publiclab/leaflet-environmental-layers/issues. I will try to address them.

Increase amount of content shown per page-view at /map

Currently the content on https://publiclab.org/map on initial load shows very few users/pages and gradually increases.

The current possible solution includes developing a microservice which scrapes the Public Lab API. This micro-service intends to replace the current direct API access.

Details about the microservice:

  • Cache data from API to reduce server load?
  • Provide data more in line with how we want to display content
  • Provide a bbox-queryable GeoJSON service
  • Ensuring the content is ranked by some filters like recency, views, comments, likes etc. . . as well as adding an option to provide custom filters.

The current API at /api/nearbyPeople returns data in the form of

{
  "items": [
    {
      "doc_id",
      "doc_type",
      "doc_url",
      "doc_title",
      "latitude": "42",
      "longitude": "-71",
      "blurred": true,
       .
       .
    },

Since the microservice would be a GeoJSON service, the proposed form of data returned would be:

{
  "type": "PlacesCollection",
  "places": [
    {
      "type": "place",
      "geometry": {
        "type": "Point",
        "coordinates": [102.0, 0.5]
      },
      "properties": {
        "blurred": "",
        "id":"",
        "name":,
        data
        .
        .
        .
      }
    },
  ],
  "type": "PeopleCollection",
  "People": [
    {
      "type": "Person",
      "geometry": {
        "type": "Point",
        "coordinates": [102.0, 0.5]
      },
      "properties": {
        "profile": "",
        "id":"",
        "name":,
        data
        .
        .
        .
      }
    },
  ]
}

Since GeoJSON would be a standardized format of data it would help integrate PL data with other applications as well. For legacy support, we can choose to provide them both and only respond with GeoJSON when the API request type is application/geo+json.

Improving Load Times

We are currently loading individual nodes while monitoring loads. I propose to pre cluster various nodes based on zoom levels and we dynamically create requests based on the current bounding box. The simplest way is to use a plugin such as Marker Clusterer. Clusterer helps the rendering on the client side greatly as it means the client computer doesn't have to draw hundreds/thousands of points, it just draws 10-40.

https://leaflet.github.io/Leaflet.markercluster/example/marker-clustering-realworld.388.html

Ensure empty people grid squares don't generate pop-ups and ensuring grid squares with blurred locations are tinted

image

Currently every empty grid square generates this pop-up. This pop-up should only be generated on grid squares which are actually blurred in line with https://github.com/publiclab/leaflet-blurred-location and these grid squares should be tinted and should be visibly different from other grid elements.

Timeline

Time period Tasks
Community Bonding Period
  • Discuss  any changes in my project or timeline with the mentors
  • Create GitHub projects/kanban boards to have structured goals and milestones
  • Go through the project, code-base and issues, set priorities for critical bugs and features
  • Coding Period Starts
    Week 1 (June 7, 2021 - June 14, 2021)
  • Prototype spreadsheet to layer abstraction
  • Have a working spreadsheet2layer.js script which can generate leaflet layers
  • Week 2 (June 14, 2021 - June 21, 2021)
  • Add test harnesses to auto generate tests for layers
  • Refine and finalize various elements of the script
  • Establish the manual workflow for the standardized process

  • Week 3 (June 21, 2021 - June 28, 2021)
  • Set up webpage/element which contains the form
  • Set up GitHub Actions and Environment to run the script
  • Integrate the entire workflow
  • Week 4 (June 28, 2021 - July 5, 2021)
  • Inspect the process and write tests
  • Add further enhancements including gitpod/ static previews of new layers and independent code content generation
  • Add more leaflet environmental layers using the new process and help Public Lab’s community to contribute new layers through this new workflow
  • Week 5 (July 5, 2021 - July 12, 2021)
  • Write documentation and blogs to make community members as well as a wider audience acquainted with the new standardized workflow for adding new map layers.
  • Benchmark contents on PL People API 
  • Plan out the needs of the service so as to increase the amount of content shown on map
  • Week 6 (July 12, 2021 - July 19, 2021)
    Evaluations (July 12, 2021 - July 16, 2021)
  • Prototype the microservice which returns appropriate set of contents to be displayed on /map
  • Set up scripts for scraping content
  • Week 7 (July 19, 2021 - July 26, 2021)
  • Continue working on the microservice
  • Migrate the current integration to newly developed microservice
  • Week 8 (July 26, 2021 - August 2, 2021)
  • Write system and integration tests for the microservice
  • Develop solutions for ensuring appropriate pop-ups and tinted boxes on the map for blurred locations
  • Week 9 (August 2, 2021 - August 9, 2021)
  • Improve load times by bbox and tbox optimisations
  • Set up map clusters to further improve client side loads
  • Continuously benchmarking and improving load times
  • Week 10 (August 9, 2021 - August 16, 2021)
  • Buffer period for winding up any remaining tasks
  • Add finishing touches like style improvements and comments to my code
  • Do intensive bug testing and see if there are more test cases to be written.
  • Week 11 (August 16, 2021 - August 23, 2021) Code Submission and Final Evaluations

    I have only listed down the potential priority and focus tasks for each week in the above timeline, In addition to that my work in each week would be a blend of the following :

    • Breaking down issues into smaller modular tasks
    • Interacting and getting feedback from mentors
    • Team meetings
    • Helping out other contributors
    • Testing

      Testing is a very important aspect of software development. It helps in more accurate, consistent and reliable results. I plan to add all unit, functional, integration and system tests, since we need them all for all 'units' to work cohesively. I will be adding unit tests after completing each unit/functions. For integration tests, I prefer the top-down approach and testing each unit integration step by step and expanding. Functional tests for handling edge cases and other behavioral tests.

    • Outreach
      • Writing blog post sharing my progress and GSoC experience along the way
      • Opening up FTO's for new contributors and help review any pull requests.
      • Improve Documentation, making them beginner friendly so as to make new contributors feel more guided through them.

    Contributions

    I have been an active contributor throughout the Public Lab ecosystem of repositories - with significant contributions to Public Lab's Image Sequencer and Plots2. Not only contributing with code and pull requests but being an active member by reviewing PRs, helping other fellow contributors, engaging in discussions and such. I am also part of the Image Sequencer CLI maintainers team and am working on the new test suite for sequencer CLI.

    Comments

    Issues

    Pull Requests

    Experience

    I feel very passionate about developing software and automating things. I feel very happy when my software helps people out and make their lives easier and more fun!

    I primarily program in JavaScript and Go but not limited by them. I have worked on projects in various languages including Python, Ruby, Rust and C++.

    Some of my notable projects are:

    • Needzo - https://devpost.com/software/needzo | Auth0 Hackathon (Fifth-Community)
      Needzo lets those under self-isolation and other vulnerable groups request deliveries and allows local volunteers to sort through, claim, and complete those deliveries with a gamified approach.
    • EndangAR - https://devfolio.co/submissions/endangar-bfce | Bitbox Winner - EchoAR Showcase We wanted to spread and create more empathy and knowledge towards the environment and wildlife out there. EndangAR is our project through which we spread awareness of endangered flora and fauna through an interactive map interface and rendering plant and animal models in 3D as well as Augmented Reality
    • Bikeshed - https://github.com/daemon1024/bikeshed
      A friendly neighborhood GitHub bot which automates trivial tasks for an organization and aims to provides a welcoming environment for newcomers.
    • Bolt262 - https://github.com/daemon1024/bolt262
      It is CLI utility to run test262 tests for various hosts ( currently tested with node ). It currently aims to be as fast as possible and optimizations wherever possible.
    • TabuRei - https://github.com/daemon1024/TabuRei
      TabuRei is an all-in-one browser extension to manage and store your tab clutter and restore them on the fly. It is fully Open-Source, and loaded with features. Try it at https://addons.mozilla.org/en-US/firefox/addon/taburei/

    I take workshops and talks to bridge the gap between newcomers and development and spread awareness for FOSS at Open Source Developers Community(OSDC). Some of the projects I have collaborated at OSDC :

    Other open source contributions:

    Teamwork

    I have participated in various hackathons and collaboratively made solutions to various societal problems with teams of 3/4 and have accomplishments in many of them. I volunteer at OSDC and we conduct meetups and events as well as work on projects with the entire community and thus, know how to work in a community. I am a firm believer of collaborative development and working with communities which is visible through most of my open source contributions.

    Passion

    I have been involved with Public Lab for quite a while now, and I started contributing because how caring and progressive the community was and I am really passionate about giving back to the community and continuing as well as improving this welcoming environment in the future.

    Audience

    This project aims to ease the process of adding interactive statistical map layers for volunteers and scientists which in-turn will help them display there research in much more intuitive manner. It will also attract developers and contributors who can leverage the LEL ecosystem. With additional optimized content on the /map page, it will improve the discoverability of research notes and community projects, bringing the Public Lab community closer and more interactive.

    Commitment

    There is no conflict of interest in coding phase as of now and I don't plan to have any other commitment during the period. If there will be any sudden changes in university schedule, I will discuss with the mentors.

    I'd like to mention that I will, as I have done previously, continue to actively interact with the newer contributors and provide insight and any help that I can regarding their PRs and issues during (and after) my GSoC period. Hence, I firmly believe that I will deliver my assignments with commitment and promptness.

    I look forward to working with the awesome community of Public Lab for a long time!

    Needs

    Help and guidance of my mentors as well as coordination from other contributors is all I need.


    13 Comments

    Hey @warren, @cess and all community members, Here is my draft proposal for the Geographic Features Refinement project. I am still working and detailing out the implementations as I gain deeper understanding of the project. All of your feedback and review are most welcome. Thank You 😃

    Reply to this comment...


    Hi @barun1024, thanks so much for sharing your ideas. I like the extra explanations you have added for each item. I have a couple of questions/feedback:

    1. You have a section called other refinements since they are part of the project, I would recommend expounding further on them as you have in the other features.

    2. I did not see these two features mentioned in the proposal: Develop ideas and prototypes for how to ensure “important” or expected content is surfaced and Ranking by recency, views - brainstorm with community reps

    3. What are some of the tests you will include in the project? Unit, functional, system.

    4. Are there any accessibility features you can think about in the project?

    5. Any plans to make First Timer Issues as you work on your project? We appreciate issues that invite newcomers to Public Lab.

    Thanks again for your contributions to Public Lab and it's amazing to see the work you have done in the Open Source community.

    Is this a question? Click here to post it to the Questions page.

    Thanks @ruthnwaiganjo for the feedback 😃

    • You have a section called other refinements since they are part of the project, I would recommend expounding further on them as you have in the other features.

    I have added some details to them while I am working on expanding each of them further :)

    • I did not see these two features mentioned in the proposal: Develop ideas and prototypes for how to ensure “important” or expected content is surfaced and Ranking by recency, views - brainstorm with community reps

    I believe they are part of the same feature-set i.e. ranking important/expected content based on various filters like views,recency. I have currently included under details of microservice. I believe it would be something similar to how sorting works in current search api( Ref. Let me know if I didn't get it right and point me towards the correct direction :)

    • What are some of the tests you will include in the project? Unit, functional, system.

    I plan to add all unit, functional and integration tests, since we need them all for all 'units' to work cohesively. I will be adding unit tests after completing each unit/functions. For integration tests, I prefer the top-down approach and testing each unit integration step by step and expanding. Functional tests for handling edge cases and other behavioral tests. I am open to other recommendations.

    • Are there any accessibility features you can think about in the project?

    I will try to remain conformant with https://www.w3.org/WAI/test-evaluate/. Any further pointers appreciated.

    • Any plans to make First Timer Issues as you work on your project? We appreciate issues that invite newcomers to Public Lab.

    Definitely. This initiative really helped getting started with open source and with public lab, so I am very eager in continuing it.

    Is this a question? Click here to post it to the Questions page.


    Reply to this comment...


    Hi Barun! Thanks so much for this proposal, it looks great. Some thoughts I wanted to share:

    • I'm thinking about the "fallback" or most basic pathway in your excellent flow diagram - for example, an activity page on PublicLab.org might say "copy paste this text into an issue" (linking to a template text file) but we might also have an issue template set up. Thinking on how people who don't have a github account, or for whom GitHub is an unfamiliar place, might most smoothly navigate this, and if there are things we want to do to improve the process for them.
    • I love the idea that GitHub Actions can do a lot here. I also think GitPod could help - that is, once GitPod builds, we can say something like "click the 'open in gitpod' button to try out the layer" maybe?
    • I also wonder how independent the code changes will be. That is, what if someone used our system to just make their own map, quite apart from LEL entirely? Would some level of compartmentalization make this possible? Like, could there be a line that says "if you just download the code from this PR branch, you can just run it yourself" or "copy paste the contents of this file to add just this new layer into any Leaflet map?" These arent required at all, but just brainstorming -- if this functionality had broader usage it might attract a broader base of contributors to keep maintaining it!

    Regarding the micro-service idea, i'm curious what shortcuts we can take. I can imagine that this is relatively simple for a micro service. But, it's good to think about -- are there microservice templates or infrastructures we could build on to reduce the amount of code or infrastructure we'd need to build ourselves? Are there even websites out there that can be set up to scrape for us? Or, would it be possible to write the scraping script, but to have it dump into an existing geoJSON service and rely on their code's query optimization? I think there could be many different answers to this and I don't think we have to immediately know the best answer, but perhaps it's worth researching a bit to see what's out there and what the possibilities are.

    Thanks so much Barun!!! 🎉

    Is this a question? Click here to post it to the Questions page.

    Hi @warren,

    Thinking on how people who don't have a github account, or for whom GitHub is an unfamiliar place, might most smoothly navigate this

    So the primary way of this whole workflow is everything is automated. What happens when you click the submit button is that, it will call the github api and open up an issue for us based on the form data, the community volunteer wouldn't need to do anything, just observe some ticks and maybe animations going 👀. We may need some bot account, say publiclab-bot for this. In case this fails, we can display your suggestion

    copy paste this text into an issue using some template

    this sounds great.

    Update:

    I have attached an updated workflow which may better explain it.

    workflow.png

    P.s. Thanks a lot for the review, glad you liked it :)

    Is this a question? Click here to post it to the Questions page.


    Adding a gitpod integration to checkout the layer sounds great. Or we may choose to deploy it using gh pages like this https://publiclab.github.io/leaflet-environmental-layers/example and share that link instead. ( We may need to dabble into how to serve it with the build files from the PR ).


    Generating independent files/code for anyone to add a specific layer to any map sound so cool.

    if this functionality had broader usage it might attract a broader base of contributors to keep maintaining it!

    agreed.

    Has there been any existing layer, not necessarily spreadsheet based, being served as an independent file? If not I will try to experiment with some existing layers and see what exactly would we need to generate an independent code. I will open up an discussion thread on LEL repo for this if that sound right to you!

    Update: I have included some details about this in my proposal now

    Is this a question? Click here to post it to the Questions page.


    Reply to this comment...


    Are there even websites out there that can be set up to scrape for us? Or, would it be possible to write the scraping script, but to have it dump into an existing geoJSON service?

    One way would be to schedule cron jobs for scraping and dumping data and leverage a microservice to serve that data. We can use github-action to schedule this too. Ref

    I am not exactly sure what do we want to scrape, since we already have an api for people and posts, hence I used the term cache to as to reduce the load on the api. Do we plan to use some external website which we would need to scrape for data?

    Is this a question? Click here to post it to the Questions page.

    I think there are a few options on this. But the main thing is that we are making arbitrary bbox (bounding box) queries to an API using 2 latitudes and 2 longitudes, so a URL-based caching system probably won't work well, since people will zoom and drag the map in unpredictable ways.

    However, the results are JSON, so we might think about how to store JSON objects in a geographic index for fast retrieval. Our database is actually not set up for that, as we use the more general purpose tags "lat:" and "lon:" and so we can't do as efficient database indexing on those.

    https://publiclab.org/api/srch/taglocations?nwlat=47.15984001304432&selat=34.19817309627726&nwlng=-80.79345703125001&selng=-61.19384765625001

    this is the query we get when we go to https://publiclab.org/map#6/41/-71, for example. Learn about this kind of API query here: https://publiclab.org/wiki/api

    So I think we can think about 2 components - the "storage/retrieval" using a query /like/ the one above, and the mechanism to "scrape and store" from the PL API into that storage system.

    For storage/retrieval, there seem to be many good options, i'm thinking node.js for lightweight and fast, but we can be flexible: https://duckduckgo.com/?q=efficient+nodejs+geojson+database&atb=v121-6&ia=web

    For the scraping, I'm not sure. Maybe we could think about whether we are constantly scanning by region, or if accessing the microservice triggers a deferred re-fetch from the original API? These are really good design/architecture questions to weigh pros/cons on.

    Thanks @barun1024!!!

    Is this a question? Click here to post it to the Questions page.


    That clarifies things a lots.

    I understand the main issue is that we don't have an efficient database setup for geographic data.

    I came across PostGIS which is a spatial database extender for PostgreSQL. Or we can extend our existing database since MySQL supports spatial data types (https://dev.mysql.com/doc/refman/8.0/en/spatial-types.html), we will have to way in pros and cons here.

    Using a specialized database solves many of our problems like, - Indexing, https://postgis.net/workshops/postgis-intro/indexing.html most of these databases have an efficient spatial indexing already in place - Boundary Box Queries, https://postgis.net/docs/ST_Within.html and https://postgis.net/docs/ST_Contains.html are specialized geometry based queries ( boundary box in our case ) which have indexing automatically ( unless we specify other wise )

    As you mentioned, Node.Js can be used for providing the interface on top of the database and greatly comfortable with the stack.

    As for scraping and storing the data, We can provide some endpoint like /create-map-data - which serves the purpose of providing interface for storing data directly into this database whenever a relevant node is created. As you already mentioned we will have to weigh in the pros/cons here.

    cc @warren


    Hi Barun - thank you, that's good research. I'm quite in favor of an external stand-alone solution which doesn't add complexity to the plots2 database, esp because many spatial extensions are database specific, and wouldn't work with mysql, sqlite, etc -- I've definitely been down that road before and it is complicated. I am not 100% against it but my preference is to look at option which read in data via a standard access format like JSON (which is already available in plots2) and then create and maintain a self-contained efficient geodata store with premade indexing. I also like that this could involve a ready-to-use solution for storage, and that we might simply write the import script and run it periodically as a job.

    My sense of cons on database extensions is that they place a lot of narrower requirements on our database choices in plots2, and the extra processing of populating and maintaining the tables is something we could bypass by externalizing. Does this make sense? Thank you!!!!

    Is this a question? Click here to post it to the Questions page.


    Reply to this comment...


    Thanks for addressing the feedback @barun1024, I would suggest updating your proposal using the feedback points shared to make it more comprehensive.

    Thanks @ruthnwaiganjo , I have updated my proposal to include various feedbacks and improved upon it. I would really appreciate further feedback and suggestions :) Thanks Again 🎉


    Reply to this comment...


    Login to comment.