I’m always on the lookout for new datasets that we can use to show off the power of the CHAOSSEARCH platform. So I was incredibly excited when I learned about the GHTorrent project — which is an effort to build an offline version of all the data available in the GitHub APIs. It’s a very cool project that you should check out, and even consider donating one of your GitHub API keys to.
There are many ways to gain access to the GHTorrent data. They do a great job making the data available for download in CSV format for restoring into a MySQL database, or even the full MongoDB dumps of all the objects.
They also make the data available on Google Big Query for free — which is how I got access to the datasets. Google Big Query makes it easy to export data directly into Google’s object storage, so I was quickly able to download a full corpus of data including GitHub Users and Projects.
All the data was already in NDJSON, which is one of the many formats of data that CHAOSSEARCH can natively analyze. After uploading it up to Amazon S3, I was able to get the data indexed with the platform in just a few minutes. The great thing about the CHAOSSEARCH platform is that we don’t require users to set up index schemas or define mappings for their data. We can quickly and easily discover all the fields within your datasets, which items are strings, integers, etc.
When my data has finished indexing, I am now ready to dive into and start asking some questions about the GitHub users and what kinds of projects they are building. The first thing I wanted to visualize was the growth of user creation across the entire dataset. This dataset that was available on Google Big Query seems to be about 2008 to 2016.
Also, what we can see when we check the rate of user creation is a fairly even growth, with a few days of large spikes of users getting created. What’s most interesting about this graph is that because we are aggregating across years of data, we can see the seasonality of user creation — a small but noticeable dip of user creation towards the end of the year, likely coinciding with the holiday season.
From here I wanted to see if I could get some high-level counts of users to see where all the GitHub users live. Now — one significant point of this dataset is that it’s all user-generated data. The user decides to input their country, state, and city within their profile. For many of the following questions, a majority of users didn’t bother to enter in ANY details around their location. So these questions I’m asking apply to the group of GitHub users that chose to fill out their user profile completely.
Given the caveat of user-generated profile data, let's find out what the top countries where GitHub users live are.
It’s no real surprise that the United States is listed at the top of the user count, given GitHub being a company founded in the United States and marketed initially within this region. I’m also not surprised to see that India is number 2 on the list. However, China being so high in the list was very interesting and a surprise to me. I suppose given the sheer number of people that live in China, given some percentage is going to be a software engineer, means a potentially large number of GitHub users who also live in China.
Just for fun — I decided to graph the growth of users over time for each of these countries. You can see that in China and India user creation started growing around 2011 - 2012.
Diving into just the users that are in the United States, I’m always curious to see which are the most popular states for GitHub users.
When doing a Kibana aggregation for all the users in the US, I was stunned to see that my current home state of Massachusetts ranked so far down on the list. That we fell behind both Texas and New York (we have quite the rivalry with NY up here). I had reached out to a few friends to discuss this anomaly, and our current hypothesis is that this shows how many big enterprises are in Massachusetts as well as the Greater Boston Area. These big companies don’t have a ton of modern software development going on, such as healthcare, biotechnology, and manufacturing. Also, these companies would have a tough time recruiting the types of folks that would be using GitHub in the first place.
Additionally, this dataset only lists where users CURRENTLY reside. So, while Boston has some excellent universities, it's even more likely that a large number of these software engineers leave for places like San Francisco, Seattle, and Austin, which further skews the numbers.
Given that analysis, let's dive in further and see what cities are the most popular for GitHub users.
There is a lack of standardization in the results because it’s user-generated data, but again more details showing that all the top cities for GitHub users are ALSO top cities for technology companies and technology startups in general. Those types of companies are likely to be doing modern software development and thus using GitHub to manage their source code repositories.
Finally, I wanted to do a quick search to see if I could track down MY user in GitHub — but I have TWO users that I created. Years ago when I created a GitHub account, I had created a user “pcheslock” but then realized that everywhere else online my username was “petecheslock” (such as my twitter @petecheslock). So let’s do a quick prefix-wildcard search for “*cheslock”.
What is incredible about CHAOSSEARCH is that due to the way we structure the data indexes in your Amazon S3 account, both a prefix and postfix wildcard query returns at the same speed. No need to deal with fancy tokenizers like you might have to with Elasticsearch; we can handle these queries natively. Within seconds I’m able to search and find BOTH of my user accounts within this dataset!
In my next post, I’m going to continue diving into this excellent dataset to learn more about the different projects that these users are creating in GitHub. We’ll find out what some of the most popular software languages are, as well as seeing what the most and least used open source software licenses are across these projects.