For those of you who don’t know, Strava is a website and mobile app used to track athletic activity via GPS. I have been using Strava religiously for tracking all my cycling activities since April 2012 which has obviously left quite a large footprint in terms of data points. The website itself serves as a fantastic interface to track and analyse all your activities, however being the data junkie that I am, I was after having all my historical data stored in a local database so I could easily query my own data using SQL and build some dashboards using Tableau.

The script I wrote to retrieve all my historic data can be found here: Data Fetcher Script
To run the script, follow the instructions shown in the README.md

In a nutshell, the script grabs all my historical data using the stravalib library which allows interaction with v3 of the Strava API, carries out some transformations on this data and finally inserts it into a local PostgresSQL database table strava_strava (version 9.5 or higher is required).

The script finishes off by printing out a summary of some key metrics every cyclist likes to know. In my case I got the following output:

Hello Aaron Olszewski. You have 199 followers on Strava

        You have recorded 2,911 activities between 2012-04-10 and 2017-08-08

        Cycled 56,224 miles

        Climbed 2,171,571 feet

        Burned 2,210,160 calories

Each “activity” or bike ride in my case represents a row in my strava_strava table which looks like this:

        COLUMN         |       TYPE       | Modifiers
-----------------------+------------------+-----------
 activity_id           | INTEGER          | NOT NULL
 name                  | text             | NOT NULL
 _date                 | DATE             | NOT NULL
 distance_miles        | NUMERIC(10,4)    |
 avg_power             | NUMERIC(10,4)    |
 moving_time_seconds   | NUMERIC(18,4)    |
 elapsed_time_seconds  | NUMERIC(18,4)    |
 kudos_count           | INTEGER          |
 elevation_feet        | NUMERIC(18,4)    |
 kilojoules            | NUMERIC(10,4)    |
 city                  | text             |
 country               | text             |
 latitude              | DOUBLE PRECISION |
 longitude             | text             |
 is_stationary_trainer | BOOLEAN          | NOT NULL

We can now write some relatively simple SQL queries to get some high level statistics on all my cycling data.

Below are some averages for all my activities to date:

SELECT
SUM(distance_miles) / COUNT(CASE WHEN is_stationary_trainer = FALSE THEN activity_id ELSE NULL END) AS average_miles_per_activity,
SUM(kilojoules) / COUNT(activity_id) AS average_calories_per_activity,
SUM(elevation_feet) /  COUNT(CASE WHEN is_stationary_trainer = FALSE THEN activity_id ELSE NULL END) AS average_feet_climbed_per_activity

FROM strava_strava;

average_miles_per_activity | average_calories_per_activity | average_feet_climbed_per_activity
----------------------------+-------------------------------+---------------------------------
        20.4816248270840917 |          759.5428914835164835 |              791.7968756825627958

Whilst I was working in London (and using Strava) my commute to and from work ranged between 10 and 15 miles (depending on where I was living and where I was working). My commuting therefore made up a lot of my yearly mileage. We can look at my most popular distances traveled to confirm this.

SELECT
distance_miles::INT,
COUNT(1)::FLOAT / (SELECT COUNT(1) FROM strava_strava WHERE is_stationary_trainer = FALSE) * 100 percentage_of_total

FROM strava_strava
WHERE is_stationary_trainer = FALSE
GROUP BY 1
ORDER BY 2 DESC LIMIT 10;

 distance_miles | percentage_of_total
----------------+---------------------
             13 |     15.728296403923
             14 |    11.7326552851435
             12 |    7.55539411551035
              9 |    6.97420995277879
             15 |    5.73919360697421
             11 |    4.50417726116963
             16 |    2.97856883399927
             20 |    2.68797675263349
             10 |    2.47003269160915
             21 |    1.77987649836542
(10 ROWS)

We can clearly see from the above query that just over 52% of ALL my activities were done where the distance traveled was between 10-15 miles which was the spread in distances between my flat at the time and my job in Central London.

It comes as no surprise that Fridays are the weekday I have covered the least amount of mileage on. This is most probably due to going out in the evening / gearing up for a big ride on the weekend. The results speak for themselves.

SELECT
-- force the beginning of the week to be Monday
CASE WHEN EXTRACT(dow FROM _date) = 0 THEN 7 ELSE EXTRACT(dow FROM _date) END day_of_week_index,
to_char(_date, 'day') day_of_week,
SUM(distance_miles)

FROM strava_strava

GROUP BY 1, 2
ORDER BY 1 ASC

 day_of_week_index | day_of_week |    SUM    
-------------------+-------------+-----------
                 1 | monday      | 8254.7693
                 2 | tuesday     | 9068.1937
                 3 | wednesday   | 7942.8680
                 4 | thursday    | 7468.0378
                 5 | friday      | 6575.2385
                 6 | saturday    | 8624.4097
                 7 | sunday      | 8511.0013
(7 ROWS)

One calculation which I always like to monitor is how I compare to this time last year or this time three years ago. To calculate this we need a YTD calculation for the current year and all previous years.

elect
dates._year,
COUNT(CASE WHEN _date::DATE BETWEEN dates.first_of_year AND dates.relative_current_date THEN activity_id ELSE NULL END) ytd_activities,
SUM(CASE WHEN _date::DATE BETWEEN dates.first_of_year AND dates.relative_current_date THEN distance_miles ELSE 0 END) ytd_miles,
SUM(CASE WHEN _date::DATE BETWEEN dates.first_of_year AND dates.relative_current_date THEN elevation_feet ELSE 0 END) ytd_feet,
SUM(CASE WHEN _date::DATE BETWEEN dates.first_of_year AND dates.relative_current_date THEN kudos_count ELSE 0 END) ytd_kudos_count,
SUM(CASE WHEN _date::DATE BETWEEN dates.first_of_year AND dates.relative_current_date THEN photo_count ELSE 0 END) ytd_photo_count,
SUM(CASE WHEN _date::DATE BETWEEN dates.first_of_year AND dates.relative_current_date THEN kilojoules ELSE 0 END) ytd_calories

FROM  

-- sub query to get year, first day of year and a relative current date to the year in question

(SELECT
EXTRACT(YEAR FROM _date) _year,
to_char(_date,'YYYY-01-01')::DATE first_of_year,
(CURRENT_DATE - INTERVAL '1 year' * (EXTRACT(YEAR FROM CURRENT_DATE) - EXTRACT(YEAR FROM _date)))::DATE relative_current_date

FROM strava_strava
GROUP BY 1, 2) dates, strava_strava

GROUP BY 1
ORDER BY 1 ASC;


YEAR | ytd_activities | ytd_miles |  ytd_feet   | ytd_kudos_count | ytd_photo_count | ytd_calories
-------+----------------+-----------+-------------+-----------------+-----------------+--------------
  2012 |            207 | 4374.5080 | 146327.1048 |              71 |               0 |  142649.8000
  2013 |            450 | 7518.2320 | 292851.3890 |             545 |               0 |  317055.6000
  2014 |            320 | 6828.2657 | 224237.5397 |             519 |               0 |  274539.8000
  2015 |            346 | 7330.2409 | 284370.4167 |            1147 |               0 |  321416.0000
  2016 |            251 | 4601.6460 | 182319.8876 |             718 |              20 |  189761.2000
  2017 |            196 | 5341.2694 | 393057.7551 |             942 |              81 |  216410.2000
(6 ROWS)

As you can see, so far in 2017 I have pretty much climbed double the amount I did in 2016. This comes as no surprise as I moved to San Francisco at the beginning of 2017 and there is no hiding away from hills in the Marin Headlands.

You can really start to analyze your data with ease once you have all your data-points in a database somewhere.

Finally, a Tableau Report I created using the above data extract: Strava Tableau Dashboard

Leave a Reply

Your e-mail address will not be published. Required fields are marked *