Improving Multiple Inserts with Psycopg2
You don't know till you know #
I find that whenever I'm in a rush to get something working, I have a bad
habit of just getting into writing code and getting the thing done. More often
than not, this results in messy code, and/or slow code. And then when I need
to do something similar again, I just re-use the original code and end up with
a potential compounded problem.
One of the more common tasks that I repeatedly do is take a bunch of files and
insert the data into a Postgres database. Up until recently, I haven't had to
deal with large enough datasets, so my poorly written code was still
acceptable in terms of execution time. Usually this means executing the insert
script locally on a test database first, and then on production. I'm using
Python 3 and the Psycopg2 postgres driver.
Psycopg2 execute and execute_values #
The original code looked something like:
def insert_data(filename, date):
sql = """
INSERT INTO test (a, b, c, d)
VALUES (%s, %s, %s, %s)
"""
with open(filename) as csvfile, get_cursor() as c:
reader = csv.reader(csvfile)
header = next(reader)
for row in reader:
n = row[0][2:]
values = (row[1], date, n, row[2])
c.execute(sql, values)
This looks like a pretty innocent looking insert function, it takes the file
loops over every row and inserts it into the table.
The refactored function looks like:
def insert_data(filename, date):
sql = """
INSERT INTO test (a, b, c, d)
VALUES %s
"""
with open(filename) as csvfile, get_cursor() as c:
reader = csv.reader(csvfile)
header = next(reader)
values_list = []
for row in reader:
n = row[0][2:]
values = (row[1], date, n, row[2])
values_list.append(values)
execute_values(c, sql, values_list)
The difference between these two functions is the execute
and
execute_values
. Each time you use execute
, psycopg2 does a complete return
trip from the database to your computer, so this means it will execute the row
INSERT to the database server, and then return. A functionality with Postgres
is that you can insert multiple rows at a time, and this is what
execute_values
does.
Instead of inserting a single row query, the refactored version creates a
query with multiple rows to insert. This reduces the number of round trips to
the database server drastically, and results in much faster performance.
Run times #
I ran the two different functions on a small subset of data being 288478 rows,
which happens to be 3% of the files I was inserting.
execute
code:
real 0m54.761s
user 0m12.752s
sys 0m4.876s
execute_values
code:
real 0m7.545s
user 0m2.092s
sys 0m0.544s
Well that's a 700% increase on just a small number of rows. I didn't bother to
compare what the times would be like for the complete dataset, but running the
refactored version took about 25 minutes, so the original version would've
taken hours!
Lesson learned #
Would I have saved time if I had looked more in depth at the docs before
writing my code? Most likely. I think lessons like this is what separates
senior and junior level programmers, those who are able to grasp the scope of
a problem and it's solutions before they even begin writing code. Meanwhile,
the juniors have to take time to write bad code in order to learn.
Update, another test #
So my friend pointed me to another Python package called dataset saying
it's what he uses because he's a lazy Python user who is allergic to SQL.
He also said that Python > SQL, so I decided to prove him wrong, and also
because I didn't believe that another package that uses SQLAlchemy would be
faster than just using Psycopg2. (SQLAlchemy is built on Psycopg2)
db = dataset.connect('postgresql://test@127.0.0.1:5432/testdb')
def insert_demand_data(filename, date):
with open(filename) as csvfile:
reader = csv.reader(csvfile)
header = next(reader)
values_list = []
for row in reader:
n = row[0][2:]
values = dict(node=row[1], date=date, n=n, r=row[2])
values_list.append(values)
table.insert_many(values_list)
And what do you know.
real 0m53.392s
user 0m17.512s
sys 0m3.112s
testdb=# select count(*) from test ;
count
--------
288478
(1 row)