Recently I have been asked a few questions about how the tag graph was implemented. So I figured ...
Public Lab is an open community which collaboratively develops accessible, open source, Do-It-Yourself technologies for investigating local environmental health and justice issues.
As an open source community, we believe in open licensing of content so that other members of the community can leverage your work legally -- with attribution, of course. By joining the Public Lab site, you agree to release the content you post here under a Creative Commons Attribution Sharealike license, and the hardware designs you post under the CERN Open Hardware License 1.1 (full text). This has the added benefit that others must share their improvements in turn with you.
1) If 2 tags belong to same node , they have an edge between them ?
The tags don't belong to nodes, the nodes are actually the tags themselves. Each "tag node" as it were, has an edge between them when they occur on the same page on the plots website. I believe that goes for any page be it a research note or a wiki page. Take the following page (whoa meta) for example: Here you see the tags are:
Follow up: @sagarpreet, in exploring export options for a different project, I just discovered that you can extract the visual attributes (i.e. color, size, positions) by exporting to one of the files that supports this (see matrix image here). Open the .gephi file, then go to File --> Export --> Graph File. When you choose a supported file format, the "Options" button should become "clickable". Click on it, and make sure to check off the boxes for any attributes you'd like.
Personally, since I do have an interest in web visualizations, but I do not have an interest in figuring out how to implement algorithms for things like community detection and calculating node sizes, I think this is a great way to quickly translate the static visualizations from gephi into something dynamic. The other plus is that I'd prefer to see changes I made update in real time without having re-run the program again.
Again, for all others, we'll get some files up real soon!
Reply to this comment...
@bsugar added some really excellent research on finding associated tags that we can incorporate into the API planning. Just copying in here to keep as a reference as this moves forward. I'll also link into the long issue where this has been worked on: https://github.com/publiclab/plots2/issues/1502
Okay, so translating the co-occurrences to Ruby will require a bit more code then I thought it would be (see second paragraph). Unfortunately, I can't find any co-occurrence libraries in Ruby (you might have better luck).
However, there are a bunch of recommendation engines available in Ruby which I guess makes more sense given Ruby's popularity as a back end for websites which inevitably include e-commerce. Reccomendify made sense to me right away. What you could do is instead of users purchasing products , you could treat each post as a user who "purchases" tags. There's also Recommendable but it was more opaque to me.
If you want to use the method that was employed to make the graph above minus the making of the graph, you'd have to translate this code and run it in the Ruby version of these commands. You do not need the export2graphml function.
Essentially, you want a file that looks like this. It won't be hard to calculate the counts of the tags individually, but trickier part is counting the co-occurrences. But once you have those figures, calculating observed to expected ratio is easy and detailed here
Reply to this comment...
@bsugar i had a quick question -- would it be OK to only collect /some/ of the related tags of each tag, i.e. limit the number of edges that we look up for each tag record? As of recently we have an optimized Tag.related(tagname) method that optimally returns the 5 tags most used with the given tag.
In this code, I was able to collect the 5 most-related tags for each of the top 250 tags, and it runs in about 8-11 seconds on the production site:
Sorry @warren! Looks like this may have been addressed in the github conversation. However, for those that come after, given the goals which are probably satisfied by an approximation, I don't see why I wouldn't suffice.
I think the downside is the one that I mentioned in comment. The edge weights are created using something called the observed vs. expected odds ratio:
oe_ratio = (all_questions_count * tag_count_AB) / (tag_count_A * tag_count_B)
Pulling from the github comment
This method is one way to take care of the issue where an edge or node node may be important but of low usage. For example, at a store 100 people might have a 85% probability of buying coffee and cream, but five of those people always purchase coffee, cream, and eggs. So I definitely want to keep 5 cartons of eggs in stock.
So, will five work? I think so. Yes. But I think what will technically happen is that you won't always know to keep "the eggs" (specific tag) in "stock" (on the graph), as it were, since you've presumed that you only want the top five associated "products" (tags).