You don’t know till you know
I find that whenever I’m in a rush to get something working, I have a bad habit of just getting into writing code and getting the thing done. More often than not, this results in messy code, and/or slow code. And then when I need to do something similar again, I just re-use the original code and end up with a potential compounded problem.
One of the more common tasks that I repeatedly do is take a bunch of files and insert the data into a Postgres database. Up until recently, I haven’t had to deal with large enough datasets, so my poorly written code was still acceptable in terms of execution time. Usually this means executing the insert script locally on a test database first, and then on production. I’m using Python 3 and the Psycopg2 postgres driver.
Psycopg2 execute and execute_values
The original code looked something like:
def insert_data(filename, date): sql = """ INSERT INTO test (a, b, c, d) VALUES (%s, %s, %s, %s) """ with open(filename) as csvfile, get_cursor() as c: reader = csv.reader(csvfile) header = next(reader) for row in reader: n = row[2:] values = (row, date, n, row) c.execute(sql, values)
This looks like a pretty innocent looking insert function, it takes the file loops over every row and inserts it into the table.
The refactored function looks like:
def insert_data(filename, date): sql = """ INSERT INTO test (a, b, c, d) VALUES %s """ with open(filename) as csvfile, get_cursor() as c: reader = csv.reader(csvfile) header = next(reader) values_list =  for row in reader: n = row[2:] values = (row, date, n, row) values_list.append(values) execute_values(c, sql, values_list)
The difference between these two functions is the
execute_values. Each time you use
execute, psycopg2 does a complete return
trip from the database to your computer, so this means it will execute the row
INSERT to the database server, and then return. A functionality with Postgres
is that you can insert multiple rows at a time, and this is what
Instead of inserting a single row query, the refactored version creates a query with multiple rows to insert. This reduces the number of round trips to the database server drastically, and results in much faster performance.
I ran the two different functions on a small subset of data being 288478 rows, which happens to be 3% of the files I was inserting.
real 0m54.761s user 0m12.752s sys 0m4.876s
real 0m7.545s user 0m2.092s sys 0m0.544s
Well that’s a 700% increase on just a small number of rows. I didn’t bother to compare what the times would be like for the complete dataset, but running the refactored version took about 25 minutes, so the original version would’ve taken hours!
Would I have saved time if I had looked more in depth at the docs before writing my code? Most likely. I think lessons like this is what separates senior and junior level programmers, those who are able to grasp the scope of a problem and it’s solutions before they even begin writing code. Meanwhile, the juniors have to take time to write bad code in order to learn.
Update, another test
So my friend pointed me to another Python package called dataset saying it’s what he uses because he’s a lazy Python user who is allergic to SQL. He also said that Python > SQL, so I decided to prove him wrong, and also because I didn’t believe that another package that uses SQLAlchemy would be faster than just using Psycopg2. (SQLAlchemy is built on Psycopg2)
db = dataset.connect('postgresql://[email protected]:5432/testdb') def insert_demand_data(filename, date): with open(filename) as csvfile: reader = csv.reader(csvfile) header = next(reader) values_list =  for row in reader: n = row[2:] values = dict(node=row, date=date, n=n, r=row) values_list.append(values) table.insert_many(values_list)
And what do you know.
real 0m53.392s user 0m17.512s sys 0m3.112s testdb=# select count(*) from test ; count -------- 288478 (1 row)