It’s been awhile since my last post. Been busy with a server relocation. Here’s my latest.
The SELECT statement is the most often used statment in any relational database. It is therefore necessary to optimize any such statement so that data can be returned as quickly as possible. SELECT statements can block, cause deadlocks, memory pressure and CPU pressure depending on how they’re written. In this post we’ll discuss how to write a SELECT statement to mitigate some of these issues.
Here is a list of DOs and DON’Ts when writing a SELECT statement:
- DO return only the columns that you need.
- DO return only the rows that you need.
- DO use only the appropriate index(es) in your statement.
- DO use the appropriate locking granularity.
- DO cache often used datasets into a table variable or temp table.
- DO use the appropiate JOIN hint.
- DON’T use SELECT INTO if possible.
- DON’T use FUNCTIONs if possible.
- DON’T use CASE statements if possible.
- DON’T put SELECT statements inside a TRANSACTION.
- DON’T use temporary tables, use table variables if possible.
Now let’s discuss each one of these items in more detail.
1. DO return only the columns that you need.
I often see code like this:
select * from dbo.someTable
There’s nothing wrong with code like this if the client really needs all the columns from the table. More often than not though, the client is only interested in a subset of the columns. It’s a waste to return the additional columns if the client doesn’t need to consume them.
2. DO return only the rows that you need.
Again, we’ll use the code example in item #1. Notice how it returns all rows in the table? Same rationale applies here: It’s a waste to return rows that the client may not need. Rather, the SELECT statement should be written to return only the rows that the client is interested in like so:
select col1, col2, col3
from dbo.someTable
where
col4 = 'xyz'
3. DO use only the appropriate index(es) in your statement.
If the table we’re selecting from has an appropriate index, we can help MSSQL to select the right index to use by using an index hint or by specifying a criteria in the WHERE clause that would help MSSQL pick that index.
Assume that dbo.someTable has an index on col4. The code example in item #2 would use this index. What if the index on dbo.someTable was defined such that the index is on col4, col5? MSSQL can still use this index because the WHERE clause partially matches the index.
MSSQL would not use the index if it was defined using col5, col4. In otherwords, MSSQL can still use an index if it’s a leading index. That is, an index in which the columns that are indexed partially matches the WHERE clause starting from LEFT to RIGHT.
Sometimes however, no matter how you write your WHERE clause, MSSQL may not use the index you want. In such a case, you could use an index hint. An index hint is request to MSSQL that it should use a given index when executing your SELECT statement. You would use it like so:
select col1, col2, col3
from dbo.someTable with (index(IX_COL4))
where
col4 = 'xyz'
Assuming that the index IX_COL4 is indexed such that COL4 is the first column indexed, it instructs MSSQL to use this index.
4. DO use the appropriate locking granularity.
When selecting data, you need to use the right locking level for your environment. If you can get away with using a dirty read then use the NOLOCK hint like so:
select col1, col2, col3
from dbo.someTable with (nolock, index(IX_COL4))
where
col4 = 'xyz'
If you can’t then use an UPDATE lock in conjunction with either a ROWLOCK or PAGLOCK. A ROWLOCK will lock all the rows that qualify in your WHERE clause, whereas an UPDLOCK will lock all the pages that contain the rows that qualify in your WHERE clause.
Having said that, it may seem that using a ROWLOCK is the way to go. While ROWLOCK will give you high concurrency, it does make MSSQL work harder. For every row that’s locked, it must maintain information about that lock. The same goes with a PAGLOCK. So what’s the difference?
Well, assume that your query finds 10000 rows that qualify. If you’re using a ROWLOCK, MSSQL has created 10000 lock records internally. Now assume that you’re using a PAGLOCK and that the same 10000 rows were found on 1000 pages. MSSQL would have created only 1000 lock records internally.
That’s a lot less resources that MSSQL has to maintain and free up once your query is done.
5. DO cache often used datasets into a table variable or temp table.
You’ll often find that you need to re-use a particular dataset within a given stored procedure or script. Instead of retrieving the same data for each statement you should cache this dataset for later retrieval. For example, suppose you want to get a list of all the customers that have brought you more than $5000 in revenue since 1/1/2008. Furthermore, suppose that you want to get a list of the invoices. One way to do this is to use the code below:
-- Get customers with revenue for year that is > $5000.
select
c.customerId
,c.balance
,c.customerCreateDate
,c.customerName
from dbo.tblCustomer c with (updlock, paglock)
join dbo.tblInvoiceHdr h with (nolock)
on h.customerId = c.customerId
where
h.tranDate between '1/1/2008' and getdate()
and c.balance > 5000
-- Get the invoice info.
select
h.invoiceNumber
,h.invoiceSysId
,h.tranDate
,h.invoiceTotal
from dbo.tblInvoiceHdr h
join
(
select
c.customerId
,c.balance
,c.customerCreateDate
,c.customerName
from dbo.tblCustomer c with (updlock, paglock)
join dbo.tblInvoiceHdr h with (nolock)
on h.customerId = c.customerId
where
h.tranDate between '1/1/2008' and getdate()
and c.balance > 5000
) custBal5K
on custBal5K.customerId = h.customerId
Notice how we re-use the SELECT statement from the first statement to generate the invoice list? We could cache this output into a table to get better performance like so:
-- Create temp table.
declare @t table
(
customerId int not null
,balance money not null
,customerCreateDate datetime not null
,customerName varchar(100) not null
)
-- Get customers with revenue for year that is > $5000.
insert into @t
(
customerId
,balance
,customerCreateDate
,customerName
)
select
c.customerId
,c.balance
,c.customerCreateDate
,c.customerName
from dbo.tblCustomer c with (updlock, paglock)
join dbo.tblInvoiceHdr h with (nolock)
on h.customerId = c.customerId
where
h.tranDate between '1/1/2008' and getdate()
and c.balance > 5000
-- Echo customer list.
select * from @t
-- Echo invoices.
select
h.invoiceNumber
,h.invoiceSysId
,h.tranDate
,h.invoiceTotal
from @t t
join dbo.tblInvoiceHdr h
on h.customerId = t.customerId
The code above caches the customer list into a temp table. The reason why this query performs much better is it doesn’t have to fetch the data from dbo.tblCustomer again as in the first example. Obviously, this is a trivial example. However, if the base table that you’re fetching from has a lot of rows, caching the data in this manner can be more efficient.
6. DO use the appropiate JOIN hint.
Most of the time, SQL Server is smart enough to use the most efficient type of JOIN to retrieve the data it needs. There are times however where you may want to suggest to SQL the type of join to use. SQL uses three types of JOINs. They are as follows:
LOOP JOIN
HASH JOIN
MERGE JOIN
A loop join is one in which the records in one table are scanned for matches using every record in another table. This type of join is called a loop join because it’s akin to searching for a match by looping over a set of values to do the comparison. This type of join is CPU-intensive and is the slowest of the three types of joins.
A hash join on the other hand uses a hash table instead of a loop to find the matches, hence its name. In this method, a hash table is built to hold the values of the smaller table. This is called the build phase. Once the hash table is built, the rows in the larger table are used as inputs to scan the hash table for matches. This is termed the probe phase. Any matches are then sent back to the client. This type of join is faster than the LOOP JOIN but it is memory-intensive.
A merge join is used when the columns to be compared have a unique index in both tables where the sort order is the same. In other words, the inputs being compared are already sorted in exactly the same order. A row by row comparison is made between the sorted inputs. If a match is found it is sent to the client. If the two rows don’t match, the row that has the lesser value is discarded. SQL then advances to the next row in both tables and repeats this process until there is no more rows to compare. This type of join is the fastest method since the comparision is being done on rows that are identically sorted.
So how do you specify a JOIN hint in a SELECT statement? Take a look at the code examples below:
LOOP JOIN EXAMPLE
select
a.col1, b.col1
from tableA a
inner loop join tableB b
on b.col1 = a.col1
HASH JOIN EXAMPLE
select
a.col1, b.col1
from tableA a
inner hash join tableB b
on b.col1 = a.col1
MERGE JOIN EXAMPLE
select
a.col1, b.col1
from tableA a
inner merge join tableB b
on b.col1 = a.col1
7. DON’T use SELECT INTO if possible.
Often times it’s necessary to cache data into a temporary table. This is true if you intend to re-use the data often in your script or stored procedure. Some DBAs (including myself) often take the lazy way out and write code to store the data using this form:
select col1, col2, col3 into #t from dbo.someTable
While there’s nothing wrong with using this form, there is a drawback with using this method. The SELECT…INTO statement is doing two operations here. It is creating a temporary table (#t) and then inserting the data from dbo.someTable into #t.
Keep in mind that every MSSQL statement that is not a SELECT statement creates an implicit transaction. It makes sense when you think about it. If a given statement fails for any reason, MSSQL has to rollback any changes as if the statement never executed. Therefore, the first part of the SELECT…INTO operation, the one that is creating the temporary table #t, is also wrapped in a transaction until the INSERT portion is completed.
When an object is being modified, MSSQL places a schema modification lock on the object to ensure that no process can access the object until the modification has completed. That is also true of temp tables. However, MSSQL also places an additional schema lock on tempdb itself. This lock on tempdb can block other processes that need to create temp tables in as well. To get around this issue it is better to create the temp table first and then do an explicit insert as the code below demonstrates:
----- Create temp table. -----
-- While the temp table is being created, a schema lock on tempdb is enforced.
create table #t
(
col1 int not null
,col2 varchar(50) not null
,col3 numeric(10,2) not null
)
----- Now do the INSERT. -----
-- Schema lock on tempdb is no longer in affect here!
insert into #t
(
col1
,col2
,col3
)
select col1, col2, col3 from dbo.someTable
The code above accomplishes the same task but the schema lock placed on tempdb is so short-lived that other processes won’t be affected.
8. DON’T use FUNCTIONs if possible.
Let’s face it, as programmers and DBAs, we use functions all the time. For the most part, using functions is not a big issue; unless it’s part of a large data-retrieval. You see, functions by nature are executed one row at a time, one expression at a time. Therefore, if you have a function that is part of a SELECT statement, the function is executed for each row of data that’s returned. Consider the following code:
----- Get data from table. -----
select
customerId
,customerName
,isnull(onCreditHold, 0) as onCreditHold
from dbo.someTable (nolock)
In the select statement above, the function ISNULL( ) is being called for each row that is returned. That’s not an issue if we’re talking about a small number of rows. If however, you’re returning a large amount of rows, it could impact performance since the column ONCREDITHOLD has to be evaluated for each row.
Instead, you may want to create a DEFAULT constraint so that ONCREDITHOLD is 0 by default for inserts. That way you wouldn’t have to cast a NULL value to a 0.
9. DON’T use CASE statements if possible.
CASE statements are like FUNCTIONs in that they are examine for each row. Avoid using them if at all possible. If not keep them as simple as possible. Forgo using nested CASE statements.
10. DON’T put SELECT statements inside a TRANSACTION.
It doesn’t make sense to me but I see a lot of T-SQL code where SELECT statements are wrapped by a transaction. You should use transactions when inserting, deleting or updating data. Transactions help you to rollback any changes that might have occurred in the event of an error. Since a SELECT query doesn’t change or affect data it’s a waste and poor coding to have them inside a transaction.
11. DON’T use temporary tables, use table variables if possible.
Both local and global temporary tables are more resource intensive then their table variable counterparts. Use table variables in lue of temporary tables.
As always I welcome any feedback, corrections or insights regarding this post.
Minh