The psycopg2 library is by far the most popular PostgresSQL adapter to use with Python. I have personally used it extensively to build a number of ETL frameworks across many organisations and have found it extremely easy to use and very versatile.

One area which I have found the library to struggle with is inserting/updating large amounts of data into database tables.

The library comes with (what I thought at the time) a nice and performant method executemany(query, params) which executes a SQL query against a list of vars.

I assumed that the executemany() method would be the fastest way to insert large amounts of data. I also assumed it would outperform looping over a list of vars and calling the execute(query, vars=None) method against each var (or tuple in this case).

After a bit of digging around, I read a few blog posts which confirmed my suspicions around the executemany() method being extremely slow. I also came across a post which claimed that building your insert statement as a long string (after argument binding) could be 60 times faster versus using the executemany() method… 60 TIMES FASTER!

I decided to put these theories to the test.

Below is a script which measures the time it takes to insert 100,000 / 1,000,000 / 10,000,000 rows into a local table using 4 different methods.

Method 1: Loops over our data-set and calls the execute() method row by row.
Method 2: Calls the executemany() method against a list of params.
Method 3: Builds a long string using the mogrify() method (to bind our args) which is then called once using the execute() method.
Method 4: Same as Method 3 without needing to use the mogrify() method.

The timeit() decorator first truncates the table we are using so that we have a empty table to work with for each method call. We start the timer once this is done, call our desired method and commit the rows and stop the timer once it has completed.

import time
import psycopg2

ROWS = [100000, 1000000, 10000000]

TABLE_NAME = 'test_table'


def get_data(rows):
    """
    Builds a list of tuples containing data to insert into our table
    """

    values = []
    for i in range(rows):
        values.append((i+1, 'foo'))
    return values


def timeit(method):
    """
    Decorator which first truncates our table and then measures the time it takes for a method call.
    In this case, the method call is inserting data into a table.
    """

    def timed(*args, **kwargs):
        # truncate the table before every method call
        kwargs.get('cursor').execute("truncate table {table} restart identity".format(table=TABLE_NAME))
        connection.commit()
        ts = time.time()
        method(*args, **kwargs)
        connection.commit()
        te = time.time()
        print "{method} took {secs} seconds inserting {rows} rows".format(
            method=method.__name__, secs=round(te - ts, 5), rows="{:,}".format(len(kwargs.get('data'))))
    return timed


@timeit
def execute_method(cursor, data):
    """
    Loops over our data-set and calls the execute method row by row
    """

    for row in data:
        cursor.execute("insert into {table} values (%s, %s)".format(table=TABLE_NAME), row)


@timeit
def execute_many_method(cursor, data):
    """
    Calls the executemany method with a list
    """

    cursor.executemany("insert into {table} values (%s, %s)".format(table=TABLE_NAME), data)


@timeit
def execute_mogrify_method(cursor, data):
    """
    Builds a string from our data-set using the mogrify method which is then called once using the execute method
    """

    args_str = ','.join(cursor.mogrify("(%s, %s)", row) for row in data)
    cursor.execute("INSERT INTO {table} VALUES".format(table=TABLE_NAME) + args_str)


@timeit
def execute_method_custom_args_to_string(cursor, data):
    """
    Builds a string from our data-set using the mogrify method which is then called once using the execute method
    """

    args_str = ",".join("('%s', '%s')" % (x, y) for (x, y) in data)
    cursor.execute("insert into {table} values".format(table=TABLE_NAME) + args_str)


def main():
    """
    Entry point for running all 4 methods for each number of rows
    """

    cursor.execute("create table if not exists {table} (id int primary key, name text)".format(table=TABLE_NAME))
    cursor.connection.commit()
    for row in ROWS:
        print "{:,} row section:".format(row)
        print "-------------"
        data = get_data(rows=row)
        execute_method(cursor=cursor, data=data)
        execute_many_method(cursor=cursor, data=data)
        execute_mogrify_method(cursor=cursor, data=data)
        execute_method_custom_args_to_string(cursor=cursor, data=data)
        print "\n"


with psycopg2.connect(host='127.0.0.1', user='postgres', database='postgres', port=5432) as connection:
    with connection.cursor() as cursor:
        main()

Below you can see the output of the script and the duration for each method call when inserting each number of rows.

The results are pretty drastic to say the least.

100,000 row section:
-------------
execute_method took 5.44358 seconds inserting 100,000 rows
execute_many_method took 5.23796 seconds inserting 100,000 rows
execute_mogrify_method took 0.48862 seconds inserting 100,000 rows
execute_method_custom_args_to_string took 0.36966 seconds inserting 100,000 rows


1,000,000 row section:
-------------
execute_method took 60.37871 seconds inserting 1,000,000 rows
execute_many_method took 57.15132 seconds inserting 1,000,000 rows
execute_mogrify_method took 5.76778 seconds inserting 1,000,000 rows
execute_method_custom_args_to_string took 4.10175 seconds inserting 1,000,000 rows


10,000,000 row section:
-------------
execute_method took 578.74676 seconds inserting 10,000,000 rows
execute_many_method took 547.02219 seconds inserting 10,000,000 rows
execute_mogrify_method took 53.10888 seconds inserting 10,000,000 rows
execute_method_custom_args_to_string took 42.77445 seconds inserting 10,000,000 rows

Conclusion:

– As expected, the slowest out of all 4 methods was the execute_method() given that execute() gets called once for every tuple in our get_data() sequence.

– The execute_many_method() method took roughly the same time as the execute_method() did. This suggests that the executemany() method is essentially looping over a given list of variables and executing each time.

– The biggest surprise to me was the sheer difference in duration between the execute_mogrify_method() and the execute_many_method(). Going from 57 seconds to 5 seconds (for 1,00,000 rows) and 547 seconds to 53 seconds (for 10,000,000 rows) is a HUGE saving in time.

– Avoid using the executemany() method at all costs if you want fast insert times when dealing with large amounts of data.

Leave a Reply

Your e-mail address will not be published. Required fields are marked *