Consuming APIs and slow loops
Slow code is okay sometimes #
Whenever I’m testing out new features, or don’t have much time to properly think through a problem, I often end up writing slow, but functional code. This is where having years of experience helps out in writing code, because you’ve seen the problem before so you know how to avoid certain traps. Recently we deployed a new internal API to do some financial reporting (meaning there were calculations to be done). I’ll note here now that I didn’t write the original code and mostly did the deploying and plugged in some external services.
It’s okay to loop in Python #
By default, loops aren’t a terrible choice in Python. The problem arises when you loop over non-optimal structures, ones that built-in libraries can’t handle very well. There’s a lot written on the internet about Python 2 vs 3, and using built-in functions.
Within the main calculation loop we’re dealing with there is a method that goes to fetch a single value from a data object that is already in memory. The data object in mention is a JSON response from an external API, and in the raw form it is a list of dictionaries.
Original method:
def get_value_for_time(timestamp, json_response):
return next(
(
dict for dict in json_response
if dict['nzdatetimestamp'] == timestamp.isoformat()
),
None
)['value']
and the json response:
[
{'nzdatetimestamp': 2019-01-01T00:00:00+13:00, 'value': 500},
{'nzdatetimestamp': 2019-01-01T00:30:00+13:00, 'value': 100},
{'nzdatetimestamp': 2019-01-01T01:00:00+13:00, 'value': 50},
...
]
Of course I replaced the variable names with something that doesn’t show off too much but essentially this is what the original code looked like. Now remember that this is within the main loop, the one that might have up to 20000 cycles, and the json_response could have length 20000 as well.
Looping within loops #
The problem with this is that we’re formatting the JSON response, a list of dicts, every loop. This means that the generator dict for dict in json_response
loops over every item in the list every single loop. Generators themselves are lazy, meaning they don’t take up much memory until they need to, but for this purpose of getting a single value, it’s not needed. The functionality we want is to pass in a timestamp, and get the value associated to it.
Solution #
The QA tester (my boss) came to me saying “it’s taking over 10 minutes per request. Can you spend an hour or two having a look at obvious/easy improvements?”
Since I didn’t write the code I added some log statements that timed methods to see where I could improve speed, and the above method was the slowest one. After understanding what the method does, and the data structures that were being used, it became apparent to me that looping over a list of dicts was inefficient.
Outside of the loop, I reformatted the json response (list of dicts) into something more efficiently used.
value_dict = {
dict['nzdatetimestamp']: dict['value']
for dict in json_response
}
This is a one time operation that creates a new dict, with keys being a timestamp, and the value being the wanted value.
Then the resulting method looks like:
def get_value_for_time(timestamp, value_dict):
try:
return value_dict[timestamp.isoformat()]
except:
return None
Calculation times #
Before:
Avg loop time: 0.031869984712902005s
total calculation time 558.7511730194092s
After:
Avg loop time: 4.202004644896958e-06s
total calculation time 2.774477005004883s
As usual, the lesson is to avoid unnecessary loops when possible. And to do that, look at your data structures and how you’re using them.
As an aside, make sure you remove log statements that you don’t need. There is a noticeable overhead in having too many log statements when you’re trying to shave seconds off.