One of the highlights of working at a data search and analytics company — CHAOSSEARCH — is that I get to dive into some complex datasets and run queries in search of answers to really intriguing questions. Recently, I noticed that Blue Bikes, Boston’s great bike share program, has an incredibly rich dataset that is free to download and query.
Since the data is public on Amazon S3, I can use the Amazon CLI tool to download the data and upload it to my Amazon S3 bucket.
aws s3 sync s3://hubway-data /path/to/my/hubway/ |
Processing the data in the CHAOSSEARCH platform happens in minutes. In my case, it actually took me longer to upload the data to my S3 bucket from my laptop than it did to have CHAOSSEARCH do the discovery and indexing.
How does it work? CHAOSSEARCH discovers the schema of the data at index time. Because of that, we don’t need to do any sort of parsing or configuration on the platform side. Simply index and start asking questions.
Initially, I wanted to start with broad questions: simple counts of Subscribers and Customers on the network. In the Blue Bikes world, “Customers” are considered the non-subscribers to the network. These are folks who don’t have a monthly plan and are either doing single trips or daily passes. “Subscribers” are the ones who subscribe to the monthly or annual plans.
Now, let’s generate some date histograms showing the rate of rides over time. In this case, we’re looking at a per-day count of rides from 2015 through 2018. And of course you can see this multi-year seasonality in my data, as winter descends on Boston, only the truly dedicated of cyclists hop on a Blue Bike to get where they need to go.
Let’s roll up this data on a month-by-month basis, and see if we can visualize the usage between the Subscribers and non-subscribers to the network.
Blue Bikes non-subscribers usage has held steady over the last four years without any significant increases in growth. But Subscribers to the Blue Bikes program have seen pretty consistent 20% growth over the last few years. We can also see that the non-subscribers usage of Blue Bikes really drops off during the wintertime months. This tells us that if you don’t have a subscription to the Blue Bikes service, you probably are not going to be trying it out in January.
This dataset has most user identifiable information removed but does include gender information for users who choose to input those details. When searching to see the number of men and women that ride Blue Bikes, we can see a clear imbalance.
Men outnumber women about 3 to 1. Since this data is self-reported and is also not required there could be bias in the data due to the fact that maybe men feel more comfortable reporting their gender which skews the results. Given that, it is still interesting to visualize this representation over time.
So where are people going? What are the most popular stations for the users of the system? The most popular starting station for all Customers and Subscribers is the “MIT at Mass Ave” station with 156,123 trips started from there over the last 4 years. That is also the most popular end station throughout the entire system, with a nearly equal amount of trips ending there.
But where are people starting their rides from? Turns out a majority of the trips that end at MIT begin at the Beacon St and Mass Ave station. So a good segment of the folks heading to the MIT station cross the Mass Ave bridge via a Blue Bike.
Those are the stats for the entire network — for both normal Customers as well as Subscribers. What are the most popular hubs for the non-commuters of the network? These are basically the more casual riders and tourists visiting Boston.
When changing our query to display the most popular start station for Customers, the results change in an interesting way. No longer are all the top start and ending locations various MIT Blue Bike stations — Harvard takes the top spot.
The same MIT station follows quickly behind simply due to the sheer volume of rides to and from that Blue Bike station. But now the next few stations afterward skew towards South Station, and the 3 stations all around the very popular tourist destinations of Copley Square, Back Bay, and the Boston Public Library.
This got me wondering — what’s the LEAST popular station in the entire Blue Bikes network? Where did almost no one start or end at? If we cut out some of the noise from various testing stations which are unable to be correlated to the main map at the Blue Bikes station map, we find a station located at “Belgrade Ave at Walworth St” where only 9 rides started from. (As a note, this is a new station to the network — just opened in November 2018.)
So where do these users ride to?
Interestingly, many of those trips ended at a different end station. And none of the rides ended at the same start station, which we’ve seen more often when trips leave from areas more frequented by tourists.
We can learn more to reinforce our hypothesis by inverting our search and trying to find all the start stations where “Belgrade Ave at Walworth St” is the end station.
This now shows many of the same stations that we found in our last query, strongly suggesting the idea that this station is not being used for local neighborhood trips, but round trips to and from a small group of destinations.
Stay tuned for Part 2 of our Blue Bikes data dive next week where we investigate more details about the ridership across the system, including where we find a Blue Bike that took a nearly year-long trip around Boston!
Do you wish you could also quickly and easily get insight and answers into your data? All without having to move that data out of your Amazon S3 account? Sign up today for a free trial of CHAOSSEARCH and start getting answers.