Public Lab Research note


Outreachy proposal: Improve statistics system for publiclab

by radhikadua | October 30, 2018 00:59 30 Oct 00:59 | #17427 | #17427


About me

I am a 4th year Information Technology student from Panjab University, Chandigarh, India. I have been an open source enthusiast for a couple of years. I have submitted small patches in GNOME organization in the past. I encourage everyone in my university to use linux and open source softwares in general. We celebrate Software freedom day every year in our university where we introduce open source technologies to the freshman and sophomores. I heard about Public Lab from the Outreachy itself and the best thing I love about the PublicLab community is quite active and helpful people.

Affiliation: Panjab University

Location: Chandigarh, India


Project description

Extend community collaboration statistics and visualization system for publiclab.org


Abstract/summary (<20 words):

Improve UX as well as backend for community statistics data.


Problem

plots2 is a collaborative knowledge-exchange platform in Rails. It offers easy to use platform on which people can create notes and wiki on which like minded people can explore and discuss things with each other. It has a metrics system on which statistics and insights into community trends and activity can been seen. It's quite powerful but due to addition of new features over the time, lot of pieces are spread here and there, making bit bad user experience. Also it has some performance concerns which needs to be addressed properly. So, I plan to improve the whole statistics system.


Timeline/milestones

Community Bonding Period

This time will be utilized for:

  • Discussing and perfecting the UI as well as architecture of the Statistics system
  • Working on other issues which are unrelated to this proposal
  • Brushing up my Ruby on Rails skills
  • Getting familiar with project's code base
  • Write up scripts to go back through posts and tag first-timer posts with first-time-poster tag

December

  • Week 2: Add new routes and setup basic file structure and layout for new stats page development.
  • Week 3: Add up "General Stats" as it is from old layout.
  • Week 4: Create backend GET APIs using Grape for requesting JSON/CSV data for all graphs along with unit tests.

January

  • Week 1: Continue work on last week's backend APIs.
  • Week 2: Add support for range and tags specification in those APIs.
  • Week 3: Implement caching for "All time" data and setup read only db connections.
  • Week 4: Select js chart library and add "No. of subscribers per tag" chart.

Feb

  • Week 1: Add all the range based charts but without "range" and "tags" feature.
  • Week 2: Implement tags feature on frontend side and add "download" button for charts.
  • Week 3: Implement range feature on frontend side.
  • Week 4: Implement on bar click functionality for charts.

March

  • Week 1: Deprecate old views and redirect them to new one with smooth transition.

User Interface

You can check out the UI mockups that I have created at https://app.moqups.com/radhikadua/9nqXCzGzQK/view. On the first page, you can find the general stats which are not based on time range. These stats include total no. of contributors, users, answers, comments, questions etc. On that page, you can click on "Range based stats", where user will be presented all the tags and the time range. According to these two parameters, it will generate charts below those widgets. You can click on "blog" tag to see charts for that tag. I have used sample data for these mockups.

I couldn't display some features in the mockups due to technical limitations.

  1. When user will click on date picker, it will allow user to pick up date range.
  2. You can click on icon to download the raw stats data. After clicking on this icon, you will be displayed option to select data format(csv, json).
  3. You can click on individual bar in the chart which will display a popup in which nodes from that time range would be shown.

I'm attaching the screenshots here for reference.




Implementation

I'll start the development on the new routes at /stats/new. According to my UI implementation details, I'll just add one page where all the stats would be shown. So, other than API calls, there would be just one view. In that view, all the data for the general statistics page will be filled in by rails views and for the "Range based" stats, all data would be fetched using AJAX calls.

All the old links like https://publiclab.org/stats/range/10-05-2015/10-05-2016 or https://publiclab.org/stats?time=September%2026,%202017%2022:43 will have permanent redirect to the new views.

To reduce the load on site, I propose following two ideas.:

  1. I'm not sure about the production configuration right now but we can add in slave db to which we will only create read only connections. We should use data from that db for serving requests for stats, otherwise main db would have lot of load and overall performance would be degraded.
  2. We can cache the data we want to serve. I found that caching views in Ruby on Rails is quite easy but we also have option for user by which they can select date range for which they want to see the stats and caching that data isn't possible since date ranges can have large no of combinations. I will discuss and try to come up with more complex caching technique if required.

If time allows, I'll implement plugin in the GrimoireLab project and submit it to them so that data can be fetched using perceval tool.


Request flow:

  1. User opens /stats: GET request to backend to which backend responds with cached view. So, no hit to the db. In that cached view, all the non-range based stats would already be filled in. Also, in the "No. of subscribers per tag" chart, user will just be able to see count of subscribers but won't be able to see who all are subscribed to it, hence preserving privacy of the users.
  2. User clicks on "Range based charts" (needs better name ?) option: Send n no. of AJAX requests where n is no. of charts, to the backend and backend responds with JSON data for each request. I can think about accumulating the data into one single request in case it improves performance. This JSON data will also be cached in the backend.
  3. User clicks on "Download" button: UI will send the request to same backend endpoint to which it had sent earlier to receive JSON data. That endpoint can also take optional query parameter format to describe format in which data has to be downloaded. Again that download data won't have any user specific information in respect to protecting user's privacy, just accumulated stats would be available to be downloaded. So, this way we don't have to limit this functionality only to moderators and adminstrators, instead have it available to end user too.
  4. User clicks on specific "Tag" or specifies time range in the calendar widget: GET request to the same endpoint will be sent with query parameters: tag, start_date, end_date. Backend will have to query the database because it's not possible to cache all the possible combinations. So, it will query db and return results in the same format. But for database, I can use slave/readonly db (if we have one configured ? Or I can learn about setting up a new one if required.), so that it doesn't increase production load.
  5. User clicks on specific "bar" in the charts: Depending on the bar clicked in the graph, we will send request to the backend in which we will return data which would contain node_id, title, link etc depending on the node type. Request for this API end would be much smaller and we won't need to cache this up.

All the selections related to tags and range would be saved in the browser URL. So, one can also share or bookmark the URL.

Routes for the above flow would look like this:


get 'stats' => 'stats#index'

(Served by rails views)


get 'stats/:type' => 'stats#type'

(Used by AJAX calls. Returns json or csv.)

Response: [ {x: 1, y: 2}, {x: 2, y: 10} ]

Query parameters for above routes:

  • download: If this parameter is passed, then backend will respond with response header related to downloading this file.
  • format: csv or json, response would be in given format.
  • tag: Nodes which have this tag specified, would be considered for returned data.
  • start_date: Default value would be date on which first node was created.
  • end_date: Default as well as max value for end_date would be yesterday's date.

get 'stats/:type/data' => 'stats#type_data'

(Used by AJAX calls. Returns json or csv.)

Response: [ {node_id: 1, title: "Note 1", link: "/notes/link"}, {node_id: 2, title: "Note 2", link: "/notes/link2"} ]

Query parameters for above routes:

  • tag: Nodes which have this tag specified, would be considered for returned data.
  • date: Date on which user clicked in the bar charts.

So, this way, above API would be generic to support all the operations. And all the statistics related UI would be at one place.

Most probably I'll be using charts.js library. Here's some sample code describing the way I'll be implementing things.

<canvas id="myChart" width="400" height="400"></canvas>
<script>
var ctx = document.getElementById("myChart").getContext('2d');

function getGraphData(type, tag, start_date, end_date) {
    return $.ajax({
            url: "/stats/${type}",
            type: "get",
            data: {
                format: "json",
                tag: tag,
                start_date: start_date,
                end_date: end_date,
            },
            success: function(response) {
                // response: [
                //    {x: 1, y: 2},
                //    {x: 2, y: 10},
                // ]
                // resolve promise here
                ...
            },
    });
}

var myBarChart = new Chart(ctx, {
    type: 'bar',
    data: getGraphData(type, tag, start_date, end_date),
    onClick: function (activeElements) {
        // Single element in case of bar chart
        var element = activeElements[0];
        // Fetch data from server
        $.ajax({
            url: "/stats/${type}/data",
            type: "get",
            data: {
                format: "json",
                tag: tag,
                date: element.date,
            }
            success: function(response) {
                // response: [
                //    {node_id: 1, title: "An awesome note", link: "/notes/link"},
                //    {node_id: 2, title: "An awesome note 2", link: "/notes/link2"},
                // ]
                // Show modal to user using above data
                ...
            },
        });
    }
});

Code distribution:

app/
  assets/
    stats.js - It will include charts library util functions which can be called from index.html.erb
  controllers/
    stats_controller.rb - It will have `index` renderer which will render index.html.erb and other util functions used in the index.html view.
  api/
    stats.rb - It will include GET APIs which will specify formatter and be called through AJAX. All the business logic related functions will called inside the API functions.
  /lib/
    stats.rb - Module which contains business logic and will be imported in the api/stats.rb
  views/
    stats/
      index.html.erb - It will contain all the initial view related code when user opens /stats
  models/ - There doesn't exist any model for stats right now

Contributions

Contributions I have made in this project are as following:

  1. https://github.com/publiclab/plots2/pull/3759 (Merged)
  2. https://github.com/publiclab/plots2/pull/3775 (Merged)
  3. https://github.com/publiclab/plots2/pull/3801 (Merged)
  4. https://github.com/publiclab/plots2/issues/3798 (Deferred to new contributors)
  5. https://github.com/publiclab/plots2/issues/3524 (In review)
  6. https://github.com/publiclab/plots2/issues/3869 (In review)
  7. https://github.com/publiclab/plots2/issues/3870 (In review)
  8. https://github.com/publiclab/plots2/issues/3384 (In review)
  9. https://github.com/publiclab/plots2/issues/3628 (In review)

Other obligations

I have my semester exams in the first two weeks of December (Dec 01 - Dec 11 tentatively), but I have adjusted my schedule accordingly. Also, I'll account for the delay if and when it arises by working extra during the coding period.


Experience

The first programming language I learned was C++ and even now it's my favourite programming language. I learned basic data structures/algorithms and implemented them in C++. Reading books to get the basic concepts helped me a lot and then implementing the codes helped clearing all my doubts. Once I completed the basics in C++, it became easy to move onto any other programming language. I practiced coding for few months that later helped me solve bugs and write optimize and efficient codes. I contributed in open source that helped me in dealing with big codebase. I participated and won in few hackathons where I learned new technologies used by other teams in their projects. I attended some conferences where I got to learn many new fascinating things that made me more excited to learn new emerging technologies.

Some projects:


Teamwork

I was a former summer intern in Samsung IOT lab in IIT Delhi for Celestini project. I worked on air pollution prediction using IOT analytics based on estimators. In this project, we were a team of two members with awesome mentors to help us throughout the project. I was a former summer intern at variance.ai where I worked on few projects including tracking breathing rate using camera by using Eulerian Video Magnification algorithm. In this project, we were a team of 5 members and we all had a healthy discussion on topics and helped each other throughout the internship. I have also worked in a team of 4-5 members in many college projects.


Passion

I love contributing to the open source projects and feel very happy when my work is used by many people and I am also part of a projects which will be used by many people. The best thing I love about the PublicLab community is quite active and helpful people. From my personal experiences, the difficulties I faced and hurdles which I had to cross, to make my first patch online, are much lower in this community as compared to others. I got to learn a lot in this project and looking forward to learn much more from the experienced mentors.

Audience

This project will help in the analysis of trends using contributor statistics giving insights of research posts, comments, questions and answers asked by contributors. The analysis and insights will help the organization to come with some decisions and help in proper planning corresponding to the trends observed.


18 Comments

@bansal_sidharth2996 @warren @sagarpreet @gauravano

Hello everyone, please review my proposal. If anything is missing or needs any change, please let me know. I'll fix it asap. Thanks!

Reply to this comment...


Thanks so much for your proposal! We'll read it over and get feedback to you ASAP. 🎉

Reply to this comment...


One industry program for statistics ( or sadistics, as a nicname) is six sigma. This is a relatively expensive program that includes ( not kidding) black belts.
You might want to investigate this program.

Reply to this comment...


Hi @Ag8n,

I tried looking at six sigma. Seems like it's a certification program in which one learns and improves their analytical skills. I'm not sure how will it help in improving our statistical system's architecture. Any suggestion ?

Thanks!

Is this a question? Click here to post it to the Questions page.

Reply to this comment...


There are a number of parts to six sigma. The different parts go from statistical analysis to design of experiment (DOE). While there are some businesses that require the certification, I don't think that would be beneficial for public lab. Many of basic processes they teach would be good for Public Lab.

One part that seemed helpful was looking at manufacturing processes and having techs come up with possible causes that could lead to the errors. The D.O.E. Was helpful in a pharmaceutical environment. As was the statistics. How you would implement them for your work, I don't know.

Reply to this comment...


@Ag8n Ah, I see. Yeah, I think that's quite advanced and more towards mathematical statistics stuff. In this project, we need very basic knowledge of statistics. But still, thanks for sharing! :)

Reply to this comment...


Hi @radhikadua, your proposal is really nice, mockups really look awesome. If I am not wrong, you also added some explanation with routes in some issue related to backend project, feel free to add that here if that's relevant. Also, how about filling the remaining sections, to complete your proposal. Thanks!

Reply to this comment...


Hi @gauravano,

Thanks a lot for the review! :)

I have added the missing sections and added more explanation about the routes, code flow etc. In case, more detail is required in any section, please let me know. I'll elaborate them even further. Thanks!

Reply to this comment...


@radhikadua your proposals are so detailed and amazing, and you are so cool!

Reply to this comment...


This is a great proposal, thanks!

I appreciate your attention to detail on things like redirects from old paths -- great thought!

for Week 4: Select js chart library and add "No. of subscribers per tag" chart. - I do wonder if this could get the project bogged down a bit, since there have been several chart lib changes and there might be a lot of refactoring. I am slightly in favor of leaving this until later in the project, and making due with simpler charts that are possible with the current libraries, until the relatively complex back-end work is more developed. But I could be persuaded! :-)

Week 1 and 2: Create backend GET APIs for requesting JSON/CSV data for all graphs along with unit tests.

This sounds great, esp. the unit tests. I did want to ask if you thought the API (grape/swagger) was the right place to implement this -- i think it may be, just want to think through it! Would that make integration with, for example, GrimoireLab easier, because Grape/Swagger is so standardized? Does Grape/Swagger have standard ways to generate CSV? Do you prefer this to making additional formats available in the stats_controller, and why? This is an interesting point so I just wanted to dig into it a bit and hear your thoughts, thanks!

Your idea for a secondary read-only helper database (let's avoid the term slave as other software projects have done, happily) is pretty interesting -- do you have any links or resources related to doing this in Rails? Thanks!

I like all the thinking you're putting into optimization. I think actually a lot of optimization can come from what the UI defaults are -- for example, if we offer monthly stats, and they're for specific calendar months instead of just 30-day ranges, people will tend to use those over and over, making caching more effective. We don't have to forbid other types of queries, but the defaults will be much more heavily used. But all these approaches will be useful, no doubt. Great details here!

Thanks a lot for both this detailed proposal and for your excellent energy in taking on features and problems across the codebase! And for helping other newcomers by breaking out and documenting issues.

The only thing I'd finish with is to say that you should be sure to pace yourself -- don't exhaust yourself, and find a pacing which you can sustain, and have fun! Thanks again!

Is this a question? Click here to post it to the Questions page.

Reply to this comment...


@thayshi Thanks a lot! It really feels good to hear this from the fellow applicant. If you need any help with proposal or some issue you are working on, just reach me out on Github or on gitter(@Radhikadua123) and I'll try my best to help you out(though I'm also a beginner)! :)

Reply to this comment...


@radhikadua Maybe you can give me some advice how to improve my proposals? I did it in the last moment because I was a little confused and didn't know how to do it.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...


@thayshi Sure, I'll contact you on gitter in few mins. :)

Reply to this comment...


@warren Thanks a lot for the detailed review of the proposal!

for Week 4: Select js chart library and add "No. of subscribers per tag" chart. - I do wonder if this could get the project bogged down a bit, since there have been several chart lib changes and there might be a lot of refactoring. I am slightly in favor of leaving this until later in the project, and making due with simpler charts that are possible with the current libraries, until the relatively complex back-end work is more developed. But I could be persuaded! :-)

I understand your concerns and yeah, I can work on backend stuff before working on frontend.

This sounds great, esp. the unit tests. I did want to ask if you thought the API (grape/swagger) was the right place to implement this -- i think it may be, just want to think through it! Would that make integration with, for example, GrimoireLab easier, because Grape/Swagger is so standardized? Does Grape/Swagger have standard ways to generate CSV?

Oh, I didn't know about grape/swagger. Thanks for specifying about it. Seems like with Grape, we can specify the content type and it's upto the developer how she wants to format the content. They have provided the custom formatter option in which we can specify how to format our content. Also it allows specifying download response header for an endpoint. So, I think it will be perfect for creating API endpoints. I'll research more about it to know it better.

About the GrimoireLab, I think I misunderstood some things about it. I have created a comment on https://github.com/publiclab/plots2/issues/3498#issuecomment-435880361 to clear some things. Though for the GrimoireLab's perceval tool, it would work fine both ways - with and without using Grape/Swagger.

Do you prefer this to making additional formats available in the stats_controller, and why? This is an interesting point so I just wanted to dig into it a bit and hear your thoughts, thanks!

Yes, earlier I planned to add them into the controller but after you told me about grape, I think they will go into the api instead of controller.

I'll be adding option of additional formats in the api. In my implementation of these range based charts, I'll just add GET APIs which won't render any view. They would directly be responding with data in some format. Since for browser/js, json works best, so I'll be responding with json by default. Also, using this approach, exact same API would be used for requesting data for download either in json or csv format.

Let me describe it a bit by code to make things more clear:

stats_module:

def get_some_foo_stats(type) stats = [] # Fetch stats from db and append it to array return stats end

stats_api:

class NotesServer < Grape::API default_format :json formatter :csv, CSVFormatter resource :stats do desc 'Return stats for publiclab' params do requires :type, type: String, desc: 'Stats type.' end route_param :type do get do # Get data from Stats module Stats.get_some_foo_stats(params[:type]) end end end end

stats_controller: Empty for range based APIs

stats_view: Empty for range based APIs

I'll add up more details in the proposal about where I'll be adding these parts of code.

Your idea for a secondary read-only helper database (let's avoid the term slave as other software projects have done, happily) is pretty interesting -- do you have any links or resources related to doing this in Rails? Thanks!

Here's the project which we can use : https://github.com/thiagopradi/octopus

We only have to use replication feature of this project. Sharding feature is more complex one which we don't need as of now.

I like all the thinking you're putting into optimization. I think actually a lot of optimization can come from what the UI defaults are -- for example, if we offer monthly stats, and they're for specific calendar months instead of just 30-day ranges, people will tend to use those over and over, making caching more effective. We don't have to forbid other types of queries, but the defaults will be much more heavily used. But all these approaches will be useful, no doubt. Great details here!

Yes, you are right. We will first cache the results of the UI defaults. For the range based queries, caching logic would be quite complex if we want to make generic one. Read-only db can be helpful in that case.

I can try to explain caching logic in case of range based queries:

Steps:

  1. We will create a script which will cache all the results weekly. On first deployment, we will run it over all the weeks till now. Now in the cache, we have stats saved weekly.

  2. On every weekend, we will run a cronjob which will run on the server and cache the stats of the last week.

  3. When user selects range like 2-10-2018 to 29-10-2018 - we will get weekly stats from cache for all the weeks which include above given dates. Since the starting week and ending week, which contains start date and end date respectively, have some extra days which has to be excluded from the stats. So, we will get those stats by querying db for those extra days and substract it from previously computed week stats.

This is just a basic idea in which we need to think about the generic solution, edge cases etc. I don't plan to implement this technique because implementing this won't be easy due to time constraints.

The only thing I'd finish with is to say that you should be sure to pace yourself -- don't exhaust yourself, and find a pacing which you can sustain, and have fun! Thanks again!

Haha, it's just that I'm quite happy as well as exicted to work on these problems. So, I might be over doing things a bit.

Is this a question? Click here to post it to the Questions page.

Reply to this comment...


@radhikadua I really liked the way you have explained the implementation details. Also, you have completed many contributions towards plots2. I really appreciate that. I saw your projects. They are great.

Reply to this comment...


It will be great if you can take some time in the timeline for engaging other contributors to get involved in the project by creating first-timer-only issues.

Reply to this comment...


Thanks! @bansal_sidharth2996

Yeah sure. I can create first-timer-only issues for the tasks which I plan to do in later weeks, so that I can keep up with the timeline without blocking myself and along with that engage new contributors with tasks which I plan to do later on.

Reply to this comment...


Hi @shubhamsangamnekar9 , please do testing by opening a separate test note, as your every comment send notification to all the people present on the thread. Thanks!

Reply to this comment...


Login to comment.