Friday, March 17, 2017

The Importance of Simple Query Optimization

After my customary, inexplicable year off from posting, I decided that the best way to get back on the horse is with a quick note about how really minor tweaks to SQL queries can make a huge difference in performance. This issue came up while I was doing some preliminary tinkering with a potential recommendation engine-style project, and I thought that it was a really good example, especially for people who are new to SQL or are starting to do more complex queries against relational databases, but haven't had to worry about execution time too much yet, either because the database is small, or it isn't code that will run in production, or what have you.

Below you will see screenshots of two nearly identical queries, which serve to basically just identify the most common video series viewed by users who have watched one other series. You could simplify the concept by saying, 'Users who watched X also watched Y' (and we only want the top five results).

Version 1:


Version 2:


Now, if you look in the lower right corner of each of these, you will see how long the query took to execute, and notice immediately that the Version 1 query took 5x as long as the Version 2 query.

(Note: 105 seconds is brutal, but 21 seconds is still too slow for most applications, but we are focused on relative difference for now. This was querying a PostgreSQL database, but running the same query against an Amazon Redshift db took less than 5 seconds!)

The results are the same for both, so what's the difference here? If you look at the middle sub-query, aliased here as '_other_views,' you will notice that in Version 2, there is a GROUP BY clause that is missing from the first version. Why does it make it so much faster? Think about what this query is trying to accomplish by segment.

First segment - identify all users who watched series with an ID of '1688' since [date]

Second segment - of those users, identify each OTHER series that every one of them has watched since [date]

Third segment - get a count of how many users have watched each other series, order them from most viewers to fewest, and take the top five

That second segment is the tough one. Looking at a longer time period, with potentially thousands of users watching thousands of series dozens of times each, we are still performing a search over millions of rows in the database. In the first query version, however, for each instance of an individual watching a series we are creating a row in the temporary table _other_views, but the reality is that we only care about the first instance of viewing each series (this is also a place we could consider using an 'EXISTS' statement).

Instead of :
Person 1 :: Series 1 [:: View 1]
Person 1 :: Series 1 [:: View 2]
Person 1 :: Series 1 [:: View 3]
Person 1 :: Series 2 [:: View 1]
Person 1 :: Series 2 [:: View 2]
etc.,

We just want:
Person 1 :: Series 1 [:: View 1]
Person 1 :: Series 2 [:: View 1]
etc.

And that's what the GROUP BY in that sub-query achieves. It makes sure that we don't create duplicate rows of each combination of User/Series in our temp table, that the final SELECT statement then has to churn through in order to get the count of distinct viewers for each series.

Another way to demonstrate the difference is to only run the first two sub-queries, in order to see how big of a result set your final SELECT is forced to sift through:

Version 1 (without GROUP BY) - 3,628,502 records returned

Version 2 (with GROUP BY) - 11,903 records returned

So just that one little change in our middle sub-query results in an immense dimension reduction of the result set, with over 300x fewer records to count and order by in the final step. Both ways are technically 'correct' in that they will execute fully and return the same output, but it's a great example about how a little attention to what's actually going on under the hood can pay huge dividends in performance.