Improving Multiple Inserts with Psycopg2
You don't know till you know #
I find that whenever I'm in a rush to get something working, I have a bad
habit of just getting into writing code and getting the thing done. More often
than not, this results in messy code, and/or slow code. And then when I need
to do something similar again, I just re-use the original code and end up with
a potential compounded problem.
One of the more common tasks that I repeatedly do is take a bunch of files and
insert the data into a Postgres database. Up until recently, I haven't had to
deal with large enough datasets, so my poorly written code was still
acceptable in terms of execution time. Usually this means executing the insert
script locally on a test database first, and then on production. I'm using
Python 3 and the Psycopg2 postgres driver.
Psycopg2 execute and execute_values #
The original code looked something like:
def insert_data(filename, date): sql = """ INSERT INTO test (a, b, c, d) VALUES (%s, %s, %s, %s) """ with open(filename) as csvfile, get_cursor() as c: reader = csv.reader(csvfile) header = next(reader) for row in reader: n = row[2:] values = (row, date, n, row) c.execute(sql, values)
This looks like a pretty innocent looking insert function, it takes the file
loops over every row and inserts it into the table.
The refactored function looks like:
def insert_data(filename, date): sql = """ INSERT INTO test (a, b, c, d) VALUES %s """ with open(filename) as csvfile, get_cursor() as c: reader = csv.reader(csvfile) header = next(reader) values_list =  for row in reader: n = row[2:] values = (row, date, n, row) values_list.append(values) execute_values(c, sql, values_list)
The difference between these two functions is the
execute_values. Each time you use
execute, psycopg2 does a complete return
trip from the database to your computer, so this means it will execute the row
INSERT to the database server, and then return. A functionality with Postgres
is that you can insert multiple rows at a time, and this is what
Instead of inserting a single row query, the refactored version creates a
query with multiple rows to insert. This reduces the number of round trips to
the database server drastically, and results in much faster performance.
Run times #
I ran the two different functions on a small subset of data being 288478 rows,
which happens to be 3% of the files I was inserting.
real 0m54.761s user 0m12.752s sys 0m4.876s
real 0m7.545s user 0m2.092s sys 0m0.544s
Well that's a 700% increase on just a small number of rows. I didn't bother to
compare what the times would be like for the complete dataset, but running the
refactored version took about 25 minutes, so the original version would've
Lesson learned #
Would I have saved time if I had looked more in depth at the docs before
writing my code? Most likely. I think lessons like this is what separates
senior and junior level programmers, those who are able to grasp the scope of
a problem and it's solutions before they even begin writing code. Meanwhile,
the juniors have to take time to write bad code in order to learn.
Update, another test #
So my friend pointed me to another Python package called dataset saying
it's what he uses because he's a lazy Python user who is allergic to SQL.
He also said that Python > SQL, so I decided to prove him wrong, and also
because I didn't believe that another package that uses SQLAlchemy would be
faster than just using Psycopg2. (SQLAlchemy is built on Psycopg2)
db = dataset.connect('postgresql://firstname.lastname@example.org:5432/testdb') def insert_demand_data(filename, date): with open(filename) as csvfile: reader = csv.reader(csvfile) header = next(reader) values_list =  for row in reader: n = row[2:] values = dict(node=row, date=date, n=n, r=row) values_list.append(values) table.insert_many(values_list)
And what do you know.
real 0m53.392s user 0m17.512s sys 0m3.112s testdb=# select count(*) from test ; count -------- 288478 (1 row)