#

You don’t know till you know

I find that whenever I’m in a rush to get something working, I have a bad habit of just getting into writing code and getting the thing done. More often than not, this results in messy code, and/or slow code. And then when I need to do something similar again, I just re-use the original code and end up with a potential compounded problem.

One of the more common tasks that I repeatedly do is take a bunch of files and insert the data into a Postgres database. Up until recently, I haven’t had to deal with large enough datasets, so my poorly written code was still acceptable in terms of execution time. Usually this means executing the insert script locally on a test database first, and then on production. I’m using Python 3 and the Psycopg2 postgres driver.

Psycopg2 execute and execute_values

The original code looked something like:

def insert_data(filename, date):

    sql = """
    INSERT INTO test (a, b, c, d)
    VALUES (%s, %s, %s, %s)
    """

    with open(filename) as csvfile, get_cursor() as c:
        reader = csv.reader(csvfile)
        header = next(reader)

        for row in reader:
            n = row[0][2:]
            values = (row[1], date, n, row[2])
            c.execute(sql, values)

This looks like a pretty innocent looking insert function, it takes the file loops over every row and inserts it into the table.

The refactored function looks like:

def insert_data(filename, date):

    sql = """
    INSERT INTO test (a, b, c, d)
    VALUES %s
    """

    with open(filename) as csvfile, get_cursor() as c:
        reader = csv.reader(csvfile)
        header = next(reader)
        values_list = []

        for row in reader:
            n = row[0][2:]
            values = (row[1], date, n, row[2])
            values_list.append(values)

        execute_values(c, sql, values_list)

The difference between these two functions is the execute and execute_values. Each time you use execute, psycopg2 does a complete return trip from the database to your computer, so this means it will execute the row INSERT to the database server, and then return. A functionality with Postgres is that you can insert multiple rows at a time, and this is what execute_values does.

Instead of inserting a single row query, the refactored version creates a query with multiple rows to insert. This reduces the number of round trips to the database server drastically, and results in much faster performance.

Run times

I ran the two different functions on a small subset of data being 288478 rows, which happens to be 3% of the files I was inserting.

execute code:

real    0m54.761s
user    0m12.752s
sys     0m4.876s

execute_values code:

real    0m7.545s
user    0m2.092s
sys     0m0.544s

Well that’s a 700% increase on just a small number of rows. I didn’t bother to compare what the times would be like for the complete dataset, but running the refactored version took about 25 minutes, so the original version would’ve taken hours!

Lesson learned

Would I have saved time if I had looked more in depth at the docs before writing my code? Most likely. I think lessons like this is what separates senior and junior level programmers, those who are able to grasp the scope of a problem and it’s solutions before they even begin writing code. Meanwhile, the juniors have to take time to write bad code in order to learn.

Update, another test

So my friend pointed me to another Python package called dataset saying it’s what he uses because he’s a lazy Python user who is allergic to SQL. He also said that Python > SQL, so I decided to prove him wrong, and also because I didn’t believe that another package that uses SQLAlchemy would be faster than just using Psycopg2. (SQLAlchemy is built on Psycopg2)

db = dataset.connect('postgresql://test@127.0.0.1:5432/testdb')
def insert_demand_data(filename, date):
    with open(filename) as csvfile:
        reader = csv.reader(csvfile)
        header = next(reader)
        values_list = []
        for row in reader:
            n = row[0][2:]
            values = dict(node=row[1], date=date, n=n, r=row[2])
            values_list.append(values)
        table.insert_many(values_list)

And what do you know.

real    0m53.392s
user    0m17.512s
sys     0m3.112s

testdb=# select count(*) from test ;
 count
--------
 288478
(1 row)

comments powered by Disqus