The python file process_stats.py is what I ran to generate the clustering, so if there's a conflict between what the code says and what I wrote here, the site is doing what that script says.
I first used a PageRank-inspired formula to calculate affinities between repositories. A common way to interpret the numerical value that the PageRank algorithm outputs is by imagining a random web surfer: the surfer starts at an arbitrary page, and with 85% probability, clicks a random link on that page, and with 15% probability, jumps to a completely random page. If the surfer ever gets to a page with no outbound links, they jump to a completely random page. For any given web page, the PageRank is the probability that at any given instant, the surfer will be visiting that page.
In order to apply that to GitHub, I imagine a random GitHub surfer (Steve). Steve starts at an arbitrary project, and then chooses a contributor (Courtney) to that project with a probability that's a function of how many of the contributions Courtney is responsible for; the probability is log(1+numContributions) / sum (log(1+numContributions)). Steve then chooses a new repository to move to as follows: 1/3rd of the time, he jumps to a randomly chosen repository that Courtney has contributed to; 2/3rds of the time, Steve jumps to a randomly chosen repository that she starred. There was nothing magic about the choice of 1/3 - this was an arbitrary number that produced fairly reasonable results early on, and I didn't try try tweaking it.
The affinity of repo A for repo B is the probability that Steve starts at repo A and ends up at repo B. To be precise, this would be true if I had crawled all of GitHub - since I didn't, the total outgoing probabilities for most of the repos are less than 1, either because I didn't crawl all of the possible target repos, or because I didn't crawl all of the intermediate contributors.
Once I had the affinities, I followed the paper on affinity propagation pretty closely, with two minor modifications: first of all, instead of using a constant number for the damping factor, I did something with a simulated annealing flavor: I started with a factor close to 1 (so at each step, it paid very little attention to the message sent in the previous step) and kept multiplying by that number at each iteration, so at the first step the damping factor was 0.95, the next step 0.95^2, then 0.95^3, etc. Second, I didn't bother to check for convergence, I just stopped it after a set number of iterations (less code, and when I was testing, it had generally converged before the number of steps I set.)
For a given set of inputs, affinity propagation produces a subset of inputs known as the exemplars, and chooses an exemplar for every input. In the code I refer to the all of the inputs with a given exemplar as the children of that exemplar. (The exemplar is its own child.) Once the algorithm had produced a set of exemplars, I re-ran the algorithm on only those exemplars, in order to produce a hierarchical structure. I ran the algorithm four times in total, including the first, so the leaves can be up to five levels deep. To understand how this might work, or if you're confused about what's being displayed, imagine that we started with the following eight repos:
Imagine that the first round produced the following set of exemplars:
And then the next round produced:
And then the third round produced:
For these results, the app would produce two top-level circles. Their tooltips would say:
Zooming in on either the jquery + bootstrap circle or the golang + docker circle would produce two more circles, each with two leaf nodes; the tooltips for the four intermediate-level circles would say:
In other words, represented as a tree, the structure is:
I excluded invalid-email-address
and the four most obvious bot accounts I found.
GitHub's API doesn't return the contributor list for linux because it's too large. I assigned 100% probability of surfing from linux to Linus. If you have a manual fix for this, please feel free to submit a pull request.
As I mentioned earlier, this is supposed to be fun and hopefully useful - although I think it did a reasonable job overall, it's not supposed to represent my opinion about where any particular project belongs. It's the result of a fairly simple algorithm. Please enjoy it and don't take it too seriously!