Performance Tuning SQL Statements: The Where Clause
October 03, 2008 in SQL by Michael Chrisman 0 Comments
In the last entry in this series, I talked about how to performance tune your SQL statement by using the FROM clause. The From clause is where you will see the biggest gain in performance and you only need to know the database and data model. However, this is not the only place you can improve performance. In this entry, we will look at the WHERE clause. Gains here will be very small (like factions of a second per record). So if you are only dealing with a couple of hundred records, performance tuning here will probably not do anything for you. However, if you are processing millions and millions of records, then those fractions really begin to add up.
All of the tuning tips here require that you have an in depth understanding of the data. Not just the data model, but the actual data in those records.
WHERE CLAUSE ORDER
The biggest boost here is understanding that when the database processes your where clause, it stops processing a record as soon as an "AND" clause is found false. As such, you want to order your where clause so that the lines that eliminate the most records are first (remember the purpose of the WHERE clause is to eliminate records from the records returned by the FROM clause.
Let's look at an example. Let's say we have a sales history database with 1,000,000 records. We want all the records from California for the sales person Bob. Our query would look like this:
1: SELECT * FROM salesHistory2: WHERE state = 'CA'3: AND salesperson = 'BOB'
In order to determine which to put first (state or salesperson), we have to examine the data. As we look, we see that there are 500,000 records for the state of California while there are only 250,000 records for Bob. We also find that Bob had 100,000 record in California. The way we have it currently written, the database would drop 500,000 records from the state = 'CA' check. The other 500,000 would then be check against salesperson = 'BOB', dropping another 400,000. This means that 400,000 dropped records had to be processed by two conditions to be dropped.
Now, if we write the query like this:
1: SELECT * FROM salesHistory2: WHERE salesperson = 'BOB'3: AND state = 'CA'
Now the first check (salesperson = 'BOB' would drop 750,000 records. The remaining 250,000 records would then be checked against State = 'CA', dropping the remaining 150,000. Now only 150,000 records had to do both checks before being dropped. (Remember, I did say the performance gains here were very small.)
This type of where clause tuning take a lot of data research to determine the best approach.
IN Statement
Along the same lines, when a database processes an "IN" statement:
1: SELECT ….2: WHERE category in ('A','B','C');
The database stops check as soon as it finds a match and it checks from left to right. As such, you want to put the most common one first, followed by the next most common, and so forth. In our example, if there are 100,000 category 'A', 250,000 category 'B' and 50,000 category 'C' then we should make our statement look like this:
1: SELECT ….2: WHERE category in ('B','A','C');
IN verses EXISTS
When you have to check to see if a value is in a sub select:
1: SELECT ….2: WHERE category IN (SELECT category FROM categories)
In this case, the database will build a full records set of categories and then check to see if category is in it. But if we use and EXISTS:
1: SELECT ….2: WHERE exists (SELECT 1 FROM categories where categories.category = salesHistory.category)
Now the database will return just one record and only cares if it exists. Plus the sub-query can be performance tuned.
OR Statement
Again, back to how the database processes, we want to put the part of the OR the will pass (not drop) the most records. If the first part of the OR pass a record, the database will not perform the second part.
Avoidance
Avoid putting stupid comparisons in your where clause. I have seen a lot of
1: WHERE 1=1 …
This is typically done when the developer is dynamically building the select statement and is too lazy to deal with weather to put and "AND" in front or not. (If you always start your SQL statement with "WHERE 1=1" then all additions to the statement will begin with "AND". But really, it's not that hard of logic to add to determine to add an "AND" or not. When you put things like that, the data has to make that comparison for record record it processes.
INDEXES
If you can create an index, then you will want to create an index that includes the fields in your WHERE clause. Sometimes, though, this becomes problematic. Remember, anytime there is an insert/update/delete the table, each index has to be updated. As such, calculated fields (like taking the YEAR(salesdate)) in an index can bring the performance of the main application down to its knees. This is why DBA do not like them much. Also, fields that have data that changes a lot make bad candidates for indexes for the same reason that calculated fields don't.
I am sure there are other little things you can do in the WHERE clause that I have missed, but as long as you remember how the database processes the SQL statement and understand your data, you can make gains in performance by tweaking your WHERE clause.