Implementing DAGs with Python

Consider the ERD (Entity Relationship Diagram) in the picture above. How does an ETL Developer work out which order to load the tables in to make sure no errors are thrown due to foreign key constraints? This is where DAGs come in very handy. DAG stands for Directed Acyclic Graph. Let’s start of with the G, Graph; A collection of nodes and edges (which connect the nodes). For most ETL services, a specific job / task would resemble a node and a dependency between any two jobs / tasks would resemble an edge. The D, Directed tells us that an […]

Exploring Cycling Data with Python, Strava and PostgreSQL

For those of you who don’t know, Strava is a website and mobile app used to track athletic activity via GPS. I have been using Strava religiously for tracking all my cycling activities since April 2012 which has obviously left quite a large footprint in terms of data points. The website itself serves as a fantastic interface to track and analyse all your activities, however being the data junkie that I am, I was after having all my historical data stored in a local database so I could easily query my own data using SQL and build some dashboards using […]

The psycopg2 library and execution speed

The psycopg2 library is by far the most popular PostgresSQL adapter to use with Python. I have personally used it extensively to build a number of ETL frameworks across many organisations and have found it extremely easy to use and very versatile. One area which I have found the library to struggle with is inserting/updating large amounts of data into database tables. The library comes with (what I thought at the time) a nice and performant method executemany(query, params) which executes a SQL query against a list of vars. I assumed that the executemany() method would be the fastest way […]