Database Normalization – 2NF HOW-TO

November 8, 2009

Welcome back everyone. In this post we continue our discussion on the data normalization process and in particular second-normal-form, otherwise known as 2NF. Data normalization is an additive process. That is, if your data is 2NF compliant, it is also said to be 1NF compliant. Therefore, if you haven’t read my post on 1NF, please read it before reading this post, otherwise read on.

So what is 2NF? Well, 2NF is an extension of 1NF. Whereas 1NF’s intent was to eliminate repeating groups, 2NF’s intent is to eliminate redundant data. A table is said to be in 2NF if all redundant data has been eliminated. Suppose that for each student in our table, we also maintain their mailing address. Let’s assume that the student table currently looks like this:

Student Table

StudentId LastName FirstName Addr1 Addr2 City State Zip
1 Smith John 1123 Westchester Blvd. Covina CA 91311
2 Ericson Robert 655 5th Street Apt. 2 Marina Del Rey CA 91365
3 Edwards Abby 944 Main Street P.O. Box 11 Beverly Hills CA 91202
4 Richmon Stanley 3900 Arbor Drive Apt. 7 West Hills CA 91374

This table is in 1NF but has redundant data, namely, the STATE column. Now, some of you may argue that ADDRESS1 and ADDRESS2 are repeating groups. Afterall, I said in my last post that for the most part, any time you see a numeral behind a column name, it’s usually indicative of a repeating group. That’s still the case. However, with regard to address information there is a lot of debate as to wether ADDRESS1 and ADDRESS2 would be considered a repeating group. After all, both columns contain address information right?

Keep in mind that while ADDRESS1 represents the main (street) address, ADDRESS2 may often represent a sub-unit within the building. While it’s true that both addresses represent a location, they are in fact different location types. From a relational standpoint, types often map to attributes which map to columns in a table. Therefore, since ADDRESS1 and ADDRESS2 represent different types, many DBAs (including myself) do not consider ADDRESS2 to be part of a repeating group.

You can see that ADDRESS2 could hold P.O. boxes, apartment units, departments, etc. It’s just easier to label the column ADDRESS2. Now back to our discussion. As mentioned earlier, this table is in 1NF but not 2NF because of the redundant data in the STATE column. Although the CITY and ZIP columns are not redundant, they have the potential to be. Therefore, they need to be in their own tables as dictated by 2NF.

This table also violates 2NF because the non-key columns ADDRESS1, ADDRESS2, CITY, STATE and ZIP are not fully dependant on the primary key (StudentId).  Let me put this as a question.  Given a student’s last name and first name; would you also need to know the values of ADDRESS1, ADDRESS2, CITY, STATE and ZIP in order to determine the StudentId?  The answer is obviously NO.  You would only need to know the student’s last and first name.  Given this answer, it’s clear that these non-key columns are not dependent on the primary key.  Therefore these columns need to be moved to their own table.  Below is the address table broken out:

Student Table

StudentId LastName FirstName
1 Smith John
2 Ericson Robert
3 Edwards Abby
4 Richmon Stanley

Address Table

StudentId AddressId Addr1 Addr2 City State Zip
1 1 1123 Westchester Blvd. Covina CA 91311
2 2 655 5th Street Apt. 2 Marina Del Rey CA 91365
3 3 944 Main Street P.O. Box 11 Beverly Hills CA 91202
4 4 3900 Arbor Drive Apt. 7 West Hills CA 91374

Closer but this still violates 2NF.  The columns CITY, STATE and ZIP can still be redundant.  Therefore they need to be moved into their own tables as well.  Below is a new CITY table:

City Table

CityId City
1 Covina
2 Marina Del Rey
3 Beverly Hills
4 West Hills

Now for the STATE table:

State Table

StateId State
1 CA

Now for the ZIP table:

Zip Table

ZipId Zip
1 91311
2 91365
3 91202
4 91374

Each of the lookup tables above are also in 2NF because the lookup description is dependent on the primary keys. That is, you must know the ID for the given record in the given table to find the description and vise-versa.

Here is the revised ADDRESS table. It now uses foreign keys to reference the lookup tables that we just created above.

Address Table

StudentId AddressId Addr1 Addr2 City State Zip
1 1 1123 Westchester Blvd. 1 1 1
2 2 655 5th Street Apt. 2 2 1 2
3 3 944 Main Street P.O. Box 11 3 1 3
4 4 3900 Arbor Drive Apt. 7 4 1 4

That’s it for 2NF. In the next post we will take this another step and cover 3NF. See you then.

Minh


Database Normalization – 1NF HOW-TO

July 9, 2009

In this post we are going to tackle the concept of database normalization and how it can be applied. Before we get into that however, let’s look at the intent of data normalization. In short, normalization, as it relates to the field of relational database design, is an effort to reduce or eliminate redundant data. This is done by breaking the data down into well-defined constructs and then associating them to each other through the use of defined relationships.

So what is redundant data? By definition, it is the unnecessary duplication of data. Why should we strive to eliminate it? First, it will reduce the overall storage requirement of the data. Secondly, it will make querying the data much easier. Thirdly, it makes it easy to change data. We’ll see an example of these benefits later in this series on normalization.

Notice that I did not say that normalization will guarantee data integrity. This is also important, however, strictly speaking, normalizaiton promotes data integrity but does not necessarily govern it. This is, in my experience, left to the DBA to define and implement.

So what is an example of redundant data? Take a look at the table below. It’s depicts a fictional table used to store student registration data. Can you see the redundant data?

Student Class1 Instructor1 Class2 Instructor2
Smith, John Biology Darwin, Charles Math Pascal, Blaise
Ericson, Robert Biology Darwin, Charles Chemistry Bohr, Niels
Edwards, Abby Physics Newton, Isaac Biology Darwin, Charles
Richmon, Stanley Computer Science Boyer, Robert Archaeology Howard, Carter

This table design violates first-normal-form, also referred to as 1NF. 1NF dictates that a table must not have any repeating groups. A repeating group occurs when a given instance, in this case STUDENT, has an attribute, in this case CLASS and INSTRUCTOR that contains multiple values. In this example, the columns CLASS1 and CLASS2 are representitive of a repeating group. They both convey an attribute type of CLASS with multiple values for each STUDENT. The values are stored as CLASS1 and CLASS2. The same is true of the INSTRUCTOR attribute. More than likely, when you see any column with a numeral such as this example, it’s a safe bet that it’s a repeating group. A novice DBA might redesign the table to look like this in an effort to acheive 1NF:

Student Class Instructor
Smith, John Biology | Math Darwin, Charles | Pascal, Blaise
Ericson, Robert Biology | Chemistry Darwin, Charles | Bohr, Niels
Edwards, Abby Physics | Biology Newton, Isaac | Darwin, Charles
Richmon, Stanley Computer Science | Archaeology Boyer, Robert | Howard, Carter

Close but no cigar. 1NF also dictates that a column (attribute) must be atomic. That is, it can contain only one value. This second example violates that because the STUDENT, CLASS and INSTRUCTOR columns contain multiple values. Below is the same data in 1NF:

Student Table

StudentId LastName FirstName
1 Smith John
2 Ericson Robert
3 Edwards Abby
4 Richmon Stanley

Class Table

ClassId ClassName
1 Archaeology
2 Biology
3 Chemistry
4 Computer Science
5 Math
6 Physics

Instructor Table

InstructorId LastName FirstName
1 Bohr Niels
2 Boyer Robert
3 Carter Howard
4 Darwin Charles
5 Newton Isaac
6 Pascal Blaise

Notice how we’ve broken down our original table into three, well-defined tables? Each table contains the same data as our original table in terms of the entity or instance they represent. The repeating group is gone and each column in the three tables are atomic. What’s missing though is some construct (table) that relates the three to each other. For that, we use what is known as a relation table. A relation table is one which relates an entity to another entity. Below is the relation table that would be used to associate the three entities to each other:

Class Relation Table

StudentId ClassId InstructorId
1 2 4
1 5 6
2 2 4
2 3 1
3 6 5
3 2 4
4 4 2
4 1 3

As you can see, the table above relates each of the three entities to each other. We can tell that Smith, John (StudentId = 1) is enrolled in Biology (ClassId = 2) and Math (ClassId = 5). We know that Biology (ClassId = 2) is taught by Charles Darwin (InstructorId = 4) and that Math (ClassId = 5) is taught by Blaise Pascal (InstructorId = 6).

You will find that normalized data have many tables but few columns. This is a by-product of the normalization process. For those of you who are astute readers, you may be pondering the fact that the relation table (Class Relation Table) seems to have some duplicative data. You are correct. The StudentId, ClassId and InstructorId do indeed occur multiple times in this table. That’s fine though. Remember that redundant data is defined as the unnecessary duplication of data. In the case of the relation table, the duplication is necessary (and perfectly legal) to relate the three entities to each other.

Come back in a few days. We’ll continue our discussion of the normalization process by exploring the second level of normalization, namely 2NF. Feel free to comment on this post with your thoughts, suggestions or critiques.

Minh


SQL – Optimizing SELECT Statement for Speed

July 2, 2009

It’s been awhile since my last post. Been busy with a server relocation. Here’s my latest.

The SELECT statement is the most often used statment in any relational database.  It is therefore necessary to optimize any such statement so that data can be returned as quickly as possible.  SELECT statements can block, cause deadlocks, memory pressure and CPU pressure depending on how they’re written.  In this post we’ll discuss how to write a SELECT statement to mitigate some of these issues.

Here is a list of DOs and DON’Ts when writing a SELECT statement:

  1. DO return only the columns that you need.
  2. DO return only the rows that you need.
  3. DO use only the appropriate index(es) in your statement.
  4. DO use the appropriate locking granularity.
  5. DO cache often used datasets into a table variable or temp table.
  6. DO use the appropiate JOIN hint.
  7. DON’T use SELECT INTO if possible.
  8. DON’T use FUNCTIONs if possible.
  9. DON’T use CASE statements if possible.
  10. DON’T put SELECT statements inside a TRANSACTION.
  11. DON’T use temporary tables, use table variables if possible.

Now let’s discuss each one of these items in more detail.

1.  DO return only the columns that you need.

I often see code like this:

select * from dbo.someTable

There’s nothing wrong with code like this if the client really needs all the columns from the table. More often than not though, the client is only interested in a subset of the columns. It’s a waste to return the additional columns if the client doesn’t need to consume them.

2. DO return only the rows that you need.

Again, we’ll use the code example in item #1. Notice how it returns all rows in the table? Same rationale applies here: It’s a waste to return rows that the client may not need. Rather, the SELECT statement should be written to return only the rows that the client is interested in like so:

select col1, col2, col3
from dbo.someTable
where
   col4 = 'xyz'

3. DO use only the appropriate index(es) in your statement.

If the table we’re selecting from has an appropriate index, we can help MSSQL to select the right index to use by using an index hint or by specifying a criteria in the WHERE clause that would help MSSQL pick that index.

Assume that dbo.someTable has an index on col4. The code example in item #2 would use this index. What if the index on dbo.someTable was defined such that the index is on col4, col5? MSSQL can still use this index because the WHERE clause partially matches the index.

MSSQL would not use the index if it was defined using col5, col4. In otherwords, MSSQL can still use an index if it’s a leading index. That is, an index in which the columns that are indexed partially matches the WHERE clause starting from LEFT to RIGHT.

Sometimes however, no matter how you write your WHERE clause, MSSQL may not use the index you want. In such a case, you could use an index hint. An index hint is request to MSSQL that it should use a given index when executing your SELECT statement. You would use it like so:

select col1, col2, col3
from dbo.someTable with (index(IX_COL4))
where
   col4 = 'xyz'

Assuming that the index IX_COL4 is indexed such that COL4 is the first column indexed, it instructs MSSQL to use this index.

4. DO use the appropriate locking granularity.

When selecting data, you need to use the right locking level for your environment. If you can get away with using a dirty read then use the NOLOCK hint like so:

select col1, col2, col3
from dbo.someTable with (nolock, index(IX_COL4))
where
   col4 = 'xyz'

If you can’t then use an UPDATE lock in conjunction with either a ROWLOCK or PAGLOCK. A ROWLOCK will lock all the rows that qualify in your WHERE clause, whereas an UPDLOCK will lock all the pages that contain the rows that qualify in your WHERE clause.

Having said that, it may seem that using a ROWLOCK is the way to go. While ROWLOCK will give you high concurrency, it does make MSSQL work harder. For every row that’s locked, it must maintain information about that lock. The same goes with a PAGLOCK. So what’s the difference?

Well, assume that your query finds 10000 rows that qualify. If you’re using a ROWLOCK, MSSQL has created 10000 lock records internally. Now assume that you’re using a PAGLOCK and that the same 10000 rows were found on 1000 pages. MSSQL would have created only 1000 lock records internally.

That’s a lot less resources that MSSQL has to maintain and free up once your query is done.

5. DO cache often used datasets into a table variable or temp table.

You’ll often find that you need to re-use a particular dataset within a given stored procedure or script. Instead of retrieving the same data for each statement you should cache this dataset for later retrieval. For example, suppose you want to get a list of all the customers that have brought you more than $5000 in revenue since 1/1/2008. Furthermore, suppose that you want to get a list of the invoices. One way to do this is to use the code below:

-- Get customers with revenue for year that is > $5000.
select
  c.customerId
  ,c.balance
  ,c.customerCreateDate
  ,c.customerName
from dbo.tblCustomer c with (updlock, paglock)
join dbo.tblInvoiceHdr h with (nolock)
on h.customerId = c.customerId
where
   h.tranDate between '1/1/2008' and getdate()
   and c.balance > 5000

-- Get the invoice info.
select
   h.invoiceNumber
   ,h.invoiceSysId
   ,h.tranDate
   ,h.invoiceTotal
from dbo.tblInvoiceHdr h
join
(
   select
     c.customerId
     ,c.balance
     ,c.customerCreateDate
     ,c.customerName
   from dbo.tblCustomer c with (updlock, paglock)
   join dbo.tblInvoiceHdr h with (nolock)
   on h.customerId = c.customerId
   where
      h.tranDate between '1/1/2008' and getdate()
      and c.balance > 5000
) custBal5K
on custBal5K.customerId = h.customerId

Notice how we re-use the SELECT statement from the first statement to generate the invoice list? We could cache this output into a table to get better performance like so:

-- Create temp table.
declare @t table
(
   customerId int not null
   ,balance money not null
   ,customerCreateDate datetime not null
   ,customerName varchar(100) not null
)

-- Get customers with revenue for year that is > $5000.
insert into @t
(
   customerId
   ,balance
   ,customerCreateDate
   ,customerName
)
select
  c.customerId
  ,c.balance
  ,c.customerCreateDate
  ,c.customerName
from dbo.tblCustomer c with (updlock, paglock)
join dbo.tblInvoiceHdr h with (nolock)
on h.customerId = c.customerId
where
   h.tranDate between '1/1/2008' and getdate()
   and c.balance > 5000

-- Echo customer list.
select * from @t

-- Echo invoices.
select
   h.invoiceNumber
   ,h.invoiceSysId
   ,h.tranDate
   ,h.invoiceTotal
from @t t
join dbo.tblInvoiceHdr h
on h.customerId = t.customerId

The code above caches the customer list into a temp table. The reason why this query performs much better is it doesn’t have to fetch the data from dbo.tblCustomer again as in the first example. Obviously, this is a trivial example. However, if the base table that you’re fetching from has a lot of rows, caching the data in this manner can be more efficient.

6. DO use the appropiate JOIN hint.

Most of the time, SQL Server is smart enough to use the most efficient type of JOIN to retrieve the data it needs. There are times however where you may want to suggest to SQL the type of join to use. SQL uses three types of JOINs. They are as follows:

  • LOOP JOIN
  • HASH JOIN
  • MERGE JOIN
  • A loop join is one in which the records in one table are scanned for matches using every record in another table. This type of join is called a loop join because it’s akin to searching for a match by looping over a set of values to do the comparison. This type of join is CPU-intensive and is the slowest of the three types of joins.

    A hash join on the other hand uses a hash table instead of a loop to find the matches, hence its name. In this method, a hash table is built to hold the values of the smaller table. This is called the build phase. Once the hash table is built, the rows in the larger table are used as inputs to scan the hash table for matches. This is termed the probe phase. Any matches are then sent back to the client. This type of join is faster than the LOOP JOIN but it is memory-intensive.

    A merge join is used when the columns to be compared have a unique index in both tables where the sort order is the same. In other words, the inputs being compared are already sorted in exactly the same order. A row by row comparison is made between the sorted inputs. If a match is found it is sent to the client. If the two rows don’t match, the row that has the lesser value is discarded. SQL then advances to the next row in both tables and repeats this process until there is no more rows to compare. This type of join is the fastest method since the comparision is being done on rows that are identically sorted.

    So how do you specify a JOIN hint in a SELECT statement? Take a look at the code examples below:

    LOOP JOIN EXAMPLE

    select
       a.col1, b.col1
    from tableA a
    inner loop join tableB b
    on b.col1 = a.col1
    

    HASH JOIN EXAMPLE

    select
       a.col1, b.col1
    from tableA a
    inner hash join tableB b
    on b.col1 = a.col1
    

    MERGE JOIN EXAMPLE

    select
       a.col1, b.col1
    from tableA a
    inner merge join tableB b
    on b.col1 = a.col1
    

    7. DON’T use SELECT INTO if possible.

    Often times it’s necessary to cache data into a temporary table. This is true if you intend to re-use the data often in your script or stored procedure. Some DBAs (including myself) often take the lazy way out and write code to store the data using this form:

    select col1, col2, col3 into #t from dbo.someTable
    

    While there’s nothing wrong with using this form, there is a drawback with using this method. The SELECT…INTO statement is doing two operations here. It is creating a temporary table (#t) and then inserting the data from dbo.someTable into #t.

    Keep in mind that every MSSQL statement that is not a SELECT statement creates an implicit transaction. It makes sense when you think about it. If a given statement fails for any reason, MSSQL has to rollback any changes as if the statement never executed. Therefore, the first part of the SELECT…INTO operation, the one that is creating the temporary table #t, is also wrapped in a transaction until the INSERT portion is completed.

    When an object is being modified, MSSQL places a schema modification lock on the object to ensure that no process can access the object until the modification has completed. That is also true of temp tables. However, MSSQL also places an additional schema lock on tempdb itself. This lock on tempdb can block other processes that need to create temp tables in as well. To get around this issue it is better to create the temp table first and then do an explicit insert as the code below demonstrates:

    ----- Create temp table. -----
    -- While the temp table is being created, a schema lock on tempdb is enforced.
    create table #t
    (
       col1 int not null
       ,col2 varchar(50) not null
       ,col3 numeric(10,2) not null
    )
    
    ----- Now do the INSERT. -----
    -- Schema lock on tempdb is no longer in affect here!
    insert into #t
    (
       col1
       ,col2
       ,col3
    )
    select col1, col2, col3 from dbo.someTable
    

    The code above accomplishes the same task but the schema lock placed on tempdb is so short-lived that other processes won’t be affected.

    8. DON’T use FUNCTIONs if possible.

    Let’s face it, as programmers and DBAs, we use functions all the time. For the most part, using functions is not a big issue; unless it’s part of a large data-retrieval. You see, functions by nature are executed one row at a time, one expression at a time. Therefore, if you have a function that is part of a SELECT statement, the function is executed for each row of data that’s returned. Consider the following code:

    ----- Get data from table. -----
    select
       customerId
       ,customerName
       ,isnull(onCreditHold, 0) as onCreditHold
    from dbo.someTable (nolock)
    

    In the select statement above, the function ISNULL( ) is being called for each row that is returned. That’s not an issue if we’re talking about a small number of rows. If however, you’re returning a large amount of rows, it could impact performance since the column ONCREDITHOLD has to be evaluated for each row.

    Instead, you may want to create a DEFAULT constraint so that ONCREDITHOLD is 0 by default for inserts. That way you wouldn’t have to cast a NULL value to a 0.

    9. DON’T use CASE statements if possible.

    CASE statements are like FUNCTIONs in that they are examine for each row. Avoid using them if at all possible. If not keep them as simple as possible. Forgo using nested CASE statements.

    10. DON’T put SELECT statements inside a TRANSACTION.

    It doesn’t make sense to me but I see a lot of T-SQL code where SELECT statements are wrapped by a transaction. You should use transactions when inserting, deleting or updating data. Transactions help you to rollback any changes that might have occurred in the event of an error. Since a SELECT query doesn’t change or affect data it’s a waste and poor coding to have them inside a transaction.

    11. DON’T use temporary tables, use table variables if possible.

    Both local and global temporary tables are more resource intensive then their table variable counterparts. Use table variables in lue of temporary tables.

    As always I welcome any feedback, corrections or insights regarding this post.
    Minh


    SQL – Stored Procedures

    May 7, 2009

    In this post we take aim at the stored procedure.  What is it?  In a nutshell, it is compiled code that resides in a given database.

    Why would you want to have code in a database and not in a traditional application in some cases?  Well, for one, if you have to manipulate data that resides in a database, it is often faster to manipulate the data at the server and then return it to the client.

    Secondly, if the logic is in the database, you can make modifications to the code in the database without having to re-deploy a new client to each and every user machine.

    Thirdly, stored procedures perform much better than a pass-through statement.  We’ll cover that in a later post.  So just how do we create a stored procedure?  Let’s look at a basic one that’s used to return a scalar value.  Take a look at the code below:

    Code

    create procedure dbo.uspTest
    as
    set nocount on
    select 1
    set nocount off
    return 0

    There are three things to take note of here. Notice the name of the procedure has the prefix “dbo.” in front of it. This tells SQL that the procedure is “owned” by the dbo user. “dbo” stands for “database owner”. In affect we’re saying that this procedure is owned by the account that has been defined as the owner of the database in which this procedure resides. By default the database owner is the “sa” account in MSSQL.

    The second thing to note is the use of the command “set nocount on”. This command instructs MSSQL to forgo keeping track of how many records were modified, deleted, inserted or returned to the client. This small addition can increase performance quite a bit so I recommend that you use it as much as possible.

    The third thing is the return code or exit code. Notice how we’re returning the value zero (0). A stored procedure is not required to have a return code but it is good practice to do so. Traditionally, a return code of zero indicates success, any other value indicates failure. That is because the return code typically indicates the error number. Therefore, returning a value of zero indicates that there was no error.

    Let’s now look at how to create a stored procedure that takes arguments and returns a dataset. Take a look at the code below:

    Code

    create procedure dbo.uspTest
    (
       @custid int
    )
    as
    set nocount on
    select * from dbo.tbCustomer where customerId = @custid
    set nocount off

    The procedure above takes one argument of type int[eger]. The name of the parameter is @custid. In MSSQL parameters are prefaced with the @ sign. The parameter type is defined as an int[eger]. The parameter name and it’s type must be seperated by at least one or more spaces.

    Look at the SELECT statement. See how it’s constructed? In this example, we’re retrieving any row from dbo.tblCustomer where the column customerId contains the value that @custid evaluates to. To use this procedure to retrieve customer records where the @custid value is 50289, we would execute the procedure using one of the following forms:

    Example 1

    exec dbo.uspTest 50289

    Example 2

    exec dbo.uspTest @custid = 50289

    Example 3

    declare @custIdValue int
    set @custIdValue = 50289
    exec dbo.uspTest @custid = @custIdValue

    Example 4

    declare @custId int
    set @custId = 50289
    exec dbo.uspTest @custid = @custId

    All of the above execute statements will return the same dataset. However, example 2 uses a named parameter. Named parameters allow you to pass in the arguments in any order so long as you indicate the parameter that the argument is mapped to. In example 2, we’re indicating that the value 50289 maps to parameter @custid. I favor named parameters since you don’t have to worry about passing in the arguments in the exact position required by the procedure.

    Example 3 also uses a named parameter but passes in the value as a variable.

    Example 4 deserves special attention. Notice that in this example, we’ve declared a variable that has the same name as the parameter. Believe it or not, this will execute without error. The reason is that MSSQL is smart enough to know that the token to the LEFT of the equal sign is the parameter and that the token to the RIGHT of the equal sign is the argument.

    Default Parameter Values

    Often times, it’s convenient to supply a default to a parameter. To specify a default value for a parameter you simply add the equal sign to the right of the parameter name with a default value. Below is an example:

    Code

    create procedure dbo.uspTest
    (
       @custid int
       ,@creditHold bit = 0
    )
    as
    set nocount on
    
    select *
    from dbo.tbCustomer
    where
       customerId = @custid
       and onCreditHold = @creditHold
    
    set nocount off

    In this example, we’ve added a second parameter named @creditHold. This flag indicates wether we should return customer records where the customer is on credit hold with us. The default value is zero (0). This indicates that if no value was specified for this parameter, the default value would be zero.

    To return customers that are not on credit hold, we could execute the procedure in one of the following forms:

    exec dbo.uspTest @custid = 50289
    
    exec dbo.uspTest @custid = 50289, @creditHold = 0
    

    Output Parameters

    A stored procedure can also be defined with an output parameter. This type of parameter is one in which any changes made to the argument passed in will be retained once control has been returned to the caller. It is in essence MSSQL way of specifying a parameter that is passed by reference.

    Parameters in MSSQL stored procedures are by default input only. That is, any changes made to the argument by the procedure is not retained when control passes back to the client. The example below shows how to define an output parameter for our procedure:

    Output Parameter Example

    create procedure dbo.uspTest
    (
       @custid int
       ,@creditHold bit = 0
       ,@accountBalance numeric(19,2) output
    )
    as
    set nocount on
    select * from dbo.tbCustomer where customerId = @custid and onCreditHold = @creditHold
    select @accountBalance = amountOwed from dbo.tbCustomer where customerId = @custid
    set nocount off

    If we were to call the procedure like so:

    declare @amountOwed numeric(19,2)
    set @amountOwed = 0
    
    exec dbo.uspTest
       @custid = 50289
       ,@creditHold = 0
       ,@accountBalance = @amountOwed output
    
    select @amountOwed
    

    Assuming that the given customer has a balance on the account, you would see that the value of @amountOwed is not zero (0), which is what we initialized it to, but rather, what the customers remaining balance is.

    Well, that’s it for this post. There is so much more to stored procedures that I didn’t cover. I hope however that I’ve been able to shed some light on what a stored procedure is and how to use them. I recommend that you do further reading on them using MSSQL’s on-line help.

    Sincerely,
    Minh


    SQL – Using Transactions

    April 29, 2009

    In this post we look at the concept of a transaction and how to use it to maintain data integrity.  So exactly what is a transaction?  A transaction is a mechanism used by DBA/developers to treat 1 or more units of work as one.  What does that mean?  Stay with me.  Let’s look at the classic real-world example of a transaction.

    Every time you go to the bank you do one of three things:

    1. Deposit money into your account.
    2. Withdraw money from your account.
    3. Transfer funds from one account to another.

    Each of these activities to the average consumer would constitute a single unit of work.  In reality, it is made up of more than one activity.  For example,  when you are transferring funds from one account to another there are multiple steps (activities) that take place.  Let’s focus on the core activities involved in this example.

    Funds are withdrawn from the first account (account 1) then those funds are deposited into the second account (account 2).  Notice that these are two units of work.  However, they must be treated as one unit of work to guarantee a stable transaction.  What do I mean by that?

    Well, suppose that these two activities are not treated as one unit of work, that is, each step can fail and it wouldn’t affect the other step.  That’s bad.  What happens if the withdraw from account 1 succeeded but the deposit to account 2 failed?  The bank would show that you took out, say $100 dollars from account 1 but you would not see the $100 dollars in account 2.  You just lost money.

    Let’s look at the flip side.  The withdraw from account 1 fails but the deposit to account 2 succeeded.  You just gained money and the bank just lost money.  Not good for them.

    This is where the transaction comes in.  The bank’s system would wrap these two steps into a transaction.  It forces both steps to succeed in order for it to be considered a valid transaction.  If either step fails, the transaction is rolled back as if neither one ever took place.  So how would we use a transaction?  Take a look at the code below.  It’s a simple example of moving a record from one database to another.  This is typical when you want to archive a record.

    Code

    declare @emsg nvarchar(2048), @e int
    
    set @emsg = ''
    set @e = 0
    
    begin try
       begin transaction trnArchive
    
       -- Copy row in tbCust to tbCustArchive
       insert into dbArchive..tbCustArchive
       (
          custid
          ,custname
          ,address1
          ,address2
          ,city
          ,state
          ,zip
          ,phone
          ,fax
       )
       select
          custid
          ,custname
          ,address1
          ,address2
          ,city
          ,state
          ,zip
          ,phone
          ,fax
       from dbCurrent..tbCust with (updlock, paglock)
       where
          custid = 12345
    
       -- Now delete the row in tbCust.
       delete t
       from dbCurrent..tbCust t with (updlock, paglock)
       where
          t.custid = 12345
    
       -- commit transaction.
       commit trnArchive
    
    end try
    begin catch
    
       -- capture error info.
       set @emsg = error_message()
       set @e = error_number()
    
       -- roll back transaction.
       rollback transaction trnArchive
    
       -- raise error to caller.
       raiserror('Error: %s.  Code: %d.', 16, 1, @emsg, @e)
    end catch
    

    The code above wraps both the INSERT and DELETE operation inside a BEGIN TRY and END TRY block. This allows us to detect exceptions when and if they occur. In addition, since both operations are wrapped inside a TRANSACTION block, we can then ROLLBACK the transaction inside the BEGIN CATCH block.

    This ensures that we don’t have a copy of the record in both tables should the INSERT succeed but the DELETE fails. Now that you know how to use transactions, I must point out one important fact. KEEP YOUR TRANSACTIONS AS SHORT AS POSSIBLE. In other words, don’t try to process a large amount of records in one transaction.

    In the code example above, if we were to try and move 200000 customer records to archive in a high I/O environment, blocking might occur since we’re issuing a UPDLOCK and PAGLOCK when doing the INSSERT and DELETE. Process smaller amounts of record for each transaction to minimize blocking.

    Minh


    SQL – Using Dirty Reads

    April 28, 2009

    In this post we are going to look at dirty reads. Specifically, we’re going to look at why and when we might want to implement a dirty read and if so, what are the pitfalls of using them and how can we address them.

    Remember that a dirty read is one in which we are requesting data that may not have been committed yet.  What we’re saying to SQL is “Please return the data I asked for.  I don’t care if it’s the very latest snapshot at the time you pull it.”.

    So why would we ever want to implement a dirty read?  The reasons vary depending on your environment, application architecture and business model.  A good example is when you are running a report against a database that has moderate to high updates.  Remember that the default setting in MSSQL for reads is to use a shared-read lock.

    This means that if another SPID has been granted an update lock on a page that you need to read from, your SPID will be blocked until the UPDATE statement from the other SPID has been completed or rolled back.

    In this case, it may be more desirable to have the report return quickly with the possiblity that the data is dirty than to have it block (you probably won’t know how long it will be blocked) and keep the user waiting.

    Yet another example is when you need to return data to an application that will then load the data into some type of grid for the user to select from.  Again, the task here is to show the data.  The user may not even select any record to process or view.  This is often the case when an application needs to initalize some sort of data grid.

    What happens however when the user does want to edit the record based on what he/she sees in the grid?  They’re making a decision based on data that might have changed already.  We would need a mechanism to determine if the row they selected has really changed.  If it has then we need to notify them and refresh the row.  Afterall, the refresh may indicate to them that they don’t need to make the edit.

    So how can we detect that data has changed if we’re using dirty reads?  Well, there are several methods to detect change in data.  The first method and least desirable is to use a timestamp.  A timestamp is a type of column that will get updated with a unique binary number every time a record is updated.  It’s often used as a means to “version-stamp” one or more rows in a table.

    So assuming that a table has a timestamp column called current_stamp, here’s the code to fetch the data along with the timestamp value. We’ll need it to compare against the current timestamp of the row when the UPDATE request is sent.

    Code

    -- Get row data along with timestamp.
    select
       custid
       ,custname
       ,address1
       ,address2
       ,city
       ,state
       ,zip
       ,phone
       ,fax
       ,current_stamp
    from dbo.someTable (nolock)
    where
      custname = 'acme, inc.'
    

    Now assume that the application stores the custid and current_stamp values that were returned. Further, assume that the user has made some changes to the data via the application. The application then makes a call to a stored procedure to save the change.

    We could implement a version-check inside the proc to see if the data being saved is out of date. Below is the code:

    Code

    create procedure dbo.uspSave
    (
       @custid int
       ,@custname varchar(30)
       ,@address1 varchar(50)
       ,@address2 varchar(50)
       ,@city varchar(50)
       ,@state varchar(2)
       ,@zip varchar(9)
       ,@phone varchar(20)
       ,@fax varchar(20)
       ,@last_stamp timestamp
    )
    as
    set nocount on
    declare @currentStamp timestamp, @e int
    set @e = 0
    
    select
       @currentStamp = a.current_stamp
    from dbo.someTable a with (updlock, paglock)
    where
       a.custid = @custid
    
    if (@currentStamp = @last_stamp)
    begin
       -- No change in data.  Allow update.
       update a
       set
          a.custname = @custname
          ,a.address1 = @address1
          ,a.address2 = @address2
          ,a.city = @city
          ,a.state = @state
          ,a.zip = @zip
          ,a.phone = @phone
          ,a.fax = @fax
       from dbo.someTable a with (updlock, paglock)
       where
          a.custid = @custid
    end
    else
    begin
       raiserror('Error.', 16, 1)
       set @e = @@error
    end
    return @e
    

    The key is the IF statment where we compare the timestamp at the time of the read to the timestamp of the record as it exists before the UPDATE statement is run. Timestamps are a nice feature because the value is updated every time the record is updated. If the timestamp is the same, we know that the data hasn’t changed, otherwise we toss an error.

    Timestamps are a nice feature but it means that we’d have to create a TIMESTAMP column in the table. The drawback is when you’re dealing with a very large table.

    If your table has 20M rows, there would be a unique timestamp for each record. This would be true even if the record is never updated after the initial INSERT. That’s a lot of data overhead to have in order to detect change.

    Another method is to use the CHECKSUM() function. Using this method, we would do a checksum of the entire record (or a subset of columns depending on your needs) during the read, then do another checksum of the record prior to the update. We then compare the values of the two checksums to see if they are the same.

    For example, using the SELECT statement and stored procedure above, we could modify them like so:

    Code

    -- Get row data along with value of checksum.
    select *, checksum(*)
    from dbo.someTable (nolock)
    where
      custname = 'acme, inc.'
    

    create procedure dbo.uspSave
    (
       @custid int
       ,@custname varchar(30)
       ,@address1 varchar(50)
       ,@address2 varchar(50)
       ,@city varchar(50)
       ,@state varchar(2)
       ,@zip varchar(9)
       ,@phone varchar(20)
       ,@fax varchar(20)
       ,@last_checksum int)
    as
    set nocount on
    declare @currentChecksum int, @e int
    set @e = 0
    
    select
       @currentChecksum = checksum(a.*)
    from dbo.someTable a with (updlock, paglock)
    where
       a.custid = @custid
    
    if (@currentChecksum = @last_checksum)
    begin
       -- No change in data.  Allow update.
       update a
       set
          a.custname = @custname
          ,a.address1 = @address1
          ,a.address2 = @address2
          ,a.city = @city
          ,a.state = @state
          ,a.zip = @zip
          ,a.phone = @phone
          ,a.fax = @fax
       from dbo.someTable a with (updlock, paglock)
       where
          a.custid = @custid
    end
    else
    begin
       raiserror('Error.', 16, 1)
       set @e = @@error
    end
    return @e
    

    Yet another method would be to create a table that would function as your “lock” table. As users access data, a “lock” record could be created. Once the “lock” is released, the “lock” record could be deleted from your table. If the “lock” record to be created already exists in your table, you would raise an error or return an error code indicating that someone is currently viewing the record.

    Since the user can’t view the record unless the “lock” can be created, there’s little chance that they would be looking at a “dirty” record.

    IMHO, if your model is one that needs to support high concurrency with minimal blocking and you or your staff control the code, it’s worth the effort to implement provided you have the time and resource.

    Some of you might have surmised (correctly so) that it’s a lot of work to implement dirty reads into your application architecture or business model. You need to weigh the non-blocking nature of dirty reads and the cost of implementing them. I have only touched on some of the techniques that can be used. Each has it’s own strengths and weaknesses. You’ll need to investigate and choose the one that works best for your environment.

    As always, I welcome any counter-point(s) or comments regarding my posts.
    Minh


    SQL – Loosing Your Identity

    April 25, 2009

    In this post we’ll discuss the use of IDENTITY columns in MSSQL. For those who haven’t used an identity column before, it’s simply a column that contains system generated sequential numbers. It is akin to Microsoft Access’ AutoNumber column.

    First let’s look at how you would define an identity column. Take a look at the DDL statement below:

    Code

    create table dbo.tbUser
    (
       username varchar(10) not null default ('')
       ,rowid int identity(1,1) not null
    )
    

    Notice the column named rowid. It’s defined as an integer column (int) that will contain auto-generating values. The generation of these numbers will be taken cared of for you by SQL. Note that although I’ve used the int data type for my ordinal values, SQL allows you to use any ordinal type. Therefore, you could use decimal as the type for an IDENTITY column as well.

    The tokens identity(1,1) informs SQL that the column will be used as an identity column and that the seed value for the first record in this table will be 1 and that every subsequent number generated will be 1 greater than the last. SQL stores this information internally so you don’t have to worry about it.

    Let’s do some inserts into the table above and see what we get. Run the code below as many times as you like:

    Code

    insert into dbo.tbUser
    (
       username
    )
    values
    (
       'jsmith'
    )
    

    Now do a SELECT against this table by running the following:

    Code

    select * from dbo.tbUser (nolock)
    

    Notice how SQL has generated a unique value for the rowid column? You might be asking “How can I retrieve the value that SQL generated after the INSERT?”. The answer is to use one of two built-in functions supplied by MSSQL. The first is the @@IDENTITY function. The second is the scope_identity() function.

    Let’s talk about the @@IDENTITY function first. This function allows you to retrieve the last value that was inserted into a table. So to get the value of the rowid column for a record that was just inserted into the table above, we would write the following code:

    Code

    declare @newRowId int
    set @newRowId = 0
    
    insert into dbo.tbUser
    (
       username
    )
    values
    (
       'jsmith'
    )
    
    set @newRowId = @@identity
    select @newRowId
    

    That all seems good. However, the problem with the @@IDENTITY function is that it returns the last value used from the last insert made to ANY table. Let me elaborate. Suppose you had a TRIGGER on the dbo.tbUser table such that every time an INSERT was done, the TRIGGER would do an insert into another table (call this table dbo.tbLoginAudit) and this table also had an identity column. What do you think would happen when you make a call to @@IDENTITY? Well, let’s find out. First let’s define the additional dbo.tbLoginAudit table. Run the following code to create the table:

    Code

    create table dbo.tbLoginAudit
    (
       username varchar(10) not null
       ,logindate datetime not null default (getdate())
       ,rowid int identity(1,1) not null
    )
    go
    

    This table will be used by the trigger below to record when a user logs into some system.

    Code – Create the TRIGGER

    create trigger dbo.trgIns on dbo.tbUser for insert
    as
    set nocount on
    insert into dbo.tbLoginAudit
    (
       username
    )
    select
      i.username
    from inserted i
    set nocount off
    go
    

    The trigger above will now automatically insert a record into the dbo.tbLoginAudit table every time a new row is inserted into the dbo.tbUser table. Keep in mind that the dbo.tbLoginAudit table also has an IDENTITY column. Now before we do any more inserts into the dbo.tbUser table, let’s run a SELECT statement to find out what the current value for the rowid column is for this table. Run the code below and take note of the highest value returned:

    Code

    select * from dbo.tbUser (nolock) order by rowid
    

    Now run the code below. What you would normally expect is that the value returned by the code below would be the next value. So if your current value for the column is 22, you’d expect the value returned by the code below to be 23. Let’s try it out.

    Code

    declare @rowid int
    set @rowid = 0
    
    insert into dbo.tbUser
    (
       username
    )
    values
    (
       'jsmith'
    )
    
    set @rowid = @@identity
    select @rowid
    

    Did you get the expected value? My bet is you didn’t. The reason is that the trigger we defined earlier on the dbo.tbUser table fires AFTER the insert statement. The trigger ALSO does an INSERT into the dbo.tbLoginAudit table, which ALSO has an IDENTITY column.

    Remember, I said that the problem with the @@IDENTITY function is that it returns the LAST identity value used. Well, since the TRIGGER fires after the INSERT statement, it returns the IDENTITY value that was used for the dbo.tblLoginAudit table NOT the dbo.tbUser table.

    So how might we get around this issue? The solution is to use the other function I mentioned at the beginning of this post. Remember the scope_identity() function? This function doesn’t have the same issue as @@IDENTITY. That is because as its name implies, it is scope aware. This function will return the last IDENTITY value used for any INSERT that has the same scope as the executing statement.

    A trigger executes in a different scope from the executing statement. The fact that @@IDENTITY returned the value from the trigger implies that it’s scope is global. scope_identity() is local, therefore, it will return the IDENTITY value that was inserted in the INSERT statement’s scope. Try out the code below to see for yourself:

    Code

    declare @rowid int
    set @rowid = 0
    
    insert into dbo.tbUser
    (
     username
    )
    values
    (
     'jsmith'
    )
    
    set @rowid = scope_identity()
    select @rowid
    
    select * from dbo.tbUser (nolock)
    

    See how the IDENTITY value that is returned is NOT the one that was used in the dbo.tbLoginAudit table, but rather, the dbo.tbUser table?

    Have fun with SQL!
    Minh


    Detecting Orphaned Rows

    April 24, 2009

    This post is a continuation of the post on SQL joins. Hopefully you have a better understanding of the types of joins available to you. Now it’s time to see how we can use them to detect orphaned rows.

    An orphaned row is a child row that has no parent row. Going back to our example of the invoice metaphor, an invoice is usally broken down to two or more tables. The header table contains rows that carry information about the invoice as a whole. It would contain columns that usually have the following type of data:

    • CUSTOMER_ID
    • INVOICE_ID
    • INVOICE_NUMBER
    • INVOICE_DATE
    • INVOICE_REF

    The invoice detail table usually contains information specific to an invoice line item. An example of some columns that can be found in this type of table are:

    • INVOICE_LINE_ID
    • INVOICE_ID
    • LINE_NUMBER
    • ITEM_NUMBER
    • QTY
    • UNIT_PRICE
    • LINE_TOTAL

    In the example above it is often said that the invoice header is the parent of the invoice detail and that the invoice detail is the child of the invoice header. Normally, this relationship is easy to enforce. Suppose however that the relationship hasn’t been enforced and you suspect that you have some orphaned detail records. How could you go about finding them?

    This is where the OUTER JOINS that we covered in our previous post comes in handy. Again, using our previous test data that was defined in the previous post let’s write a query that finds any invoice detail line that does not have a invoice header associated to it. Run the query below:

    Code

    select
       det.*
    from dbo.invHdr hdr (nolock)
    right join dbo.invDet det (nolock)
    on det.invnum = hdr.invnum
    where
       hdr.invnum is null
    

    Notice that it returns one row. If you were to visually inspect the data in the detail table you would find that this is the row that is missing an invoice header. Notice that we are doing a RIGHT JOIN here. Remember that a RIGHT JOIN returns all matching rows and all non-matching rows from the right. Keep in mind that the header table is the “left” table and that the detail table is the “right” table. How is it then that the query above returned only one row?

    The answer lies in the WHERE clause that was specified. It filters out the data after the join is made and tells SQL to return only the rows from the detail table where there is no matching header row. This is an example of an orphaned record.

    To detect header rows that have no detail (these aren’t considered orphaned rows because the term applies to child tables) we can run the query below:

    Code

    select
       hdr.*
    from dbo.invHdr hdr (nolock)
    left join dbo.invDet det (nolock)
    on det.invnum = hdr.invnum
    where
       det.invnum is null
    

    Now let’s assume we want to find any rows in the invoice header or invoice detail table that is missing it’s associated parent or child row. We could write the query like this:

    Code

    select
       hdr.*, det.*
    from dbo.invHdr hdr (nolock)
    full outer join dbo.invDet det (nolock)
    on det.invnum = hdr.invnum
    where
       hdr.invnum is null
       or det.invnum is null
    

    Here we have instructed SQL (via the WHERE clause) to include any row where the invoice number is not in the header table or the detail table. This in affect returns all rows where either a child is missing it’s parent or where the parent has no detail.

    I hope you find this post and it’s companion post useful. Please feel free to make any comments or suggestions regarding this post.

    Minh


    SQL Joins

    April 23, 2009

    What is a JOIN?  Simply put, it is an operation that combines records from two or more tables.  There are three types of joins in SQL.  They are as follows:

    • INNER JOIN (most common type)
    • OUTER JOIN
    • CROSS JOIN

    Let’s look at an INNER JOIN.  What does it do?  An inner join is an operation that combines two or more tables into one resultset based on matches found in one or more columns in each of the participating tables.  For example, let’s say you have an INVOICE_HEADER table and an INVOICE_DETAIL table.  Assume that the two tables are defined as follows:

    Create Tables

    create table dbo.invHdr
    (
      invnum varchar(5) not null
      ,customer varchar(30) not null
      ,invdate datetime not null default (getdate())
    )
    go
    
    create table dbo.invDet
    (
      invnum varchar(5) not null
      ,trandate datetime not null default (getdate())
      ,itemnum varchar(10) not null
      ,itemdesc varchar(100) not null
      ,qty numeric(10, 2) not null default (0)
      ,unitprice numeric(10, 2) not null
    )
    go

     

    Now let’s create some test data for our examples:

    Populate Tables

    ----- Insert header records. -----
    insert into dbo.invHdr
    (
      invnum, customer
    )
    values
    (
      '00001', 'CUSTA'
    )
    
    insert into dbo.invHdr
    (
      invnum, customer
    )
    values
    (
      '00002', 'CUSTA'
    )
    
    insert into dbo.invHdr
    (
      invnum, customer
    )
    values
    (
      '00003', 'CUSTB'
    )
    
    ----- Insert detail records. -----
    insert into dbo.invDet
    (
      invnum
      ,trandate
      ,itemnum
      ,itemdesc
      ,qty
      ,unitprice
    )
    values
    (
      '00001'
      ,'4/1/2008'
      ,'I-BEV1'
      ,'COKE, 6PACK'
      ,3
      ,5
    )
    
    insert into dbo.invDet
    (
      invnum
      ,trandate
      ,itemnum
      ,itemdesc
      ,qty
      ,unitprice
    )
    values
    (
      '00005'
      ,'4/10/2008'
      ,'I-BEV2'
      ,'COKE, 12PACK'
      ,1
      ,9
    )

    INNER JOIN

    Now that we’ve got some test data, let’s see how to do each type of join and look at the data it returns. First up is the INNER JOIN. Run this select statement:

    ----- Inner join. -----
    select
      a.*, b.*
    from dbo.invHdr a (nolock)
    join dbo.invDet b (nolock)
    on b.invnum = a.invnum

    INNER JOIN example

    INNER JOIN example

    Notice that the SELECT statement has returned a single row where a matching invoice number (invnum) was found in both the header and detail table.

    LEFT JOIN

    Now let’s look at what a LEFT JOIN would return. Run this statement:

    -- Left join.
    select
      a.*, b.*
    from dbo.invHdr a (nolock)
    left join dbo.invDet b (nolock)
    on b.invnum = a.invnum

    LEFT JOIN example

    LEFT JOIN example

    Notice how the LEFT JOIN statement returns all matching rows from both tables and all rows from the “left” table regardless of whether there are rows in the “right” table. The image below depicts this graphically.

    LEFT JOIN depiction

    LEFT JOIN depiction

    The area in the middle are the matching rows. The area on the left (blue) are all the rows from the “left” table where no matches were found in the “right” table. However, since the type of JOIN is a LEFT JOIN, the rows from the left table are returned as well. The area on the right (white) are the rows from the “right” table. Only the colored areas are returned as part of the result.

    You might be asking “How do I know which is the left table versus the right table?” The answer is simple. When you do any type of join, each table that is added to the join is considered to be the right table in relation to the table before it. For example, suppose we were to JOIN tables X, Y, and Z like so:

    X –> Y –> Z

    Here, X is considered to be the “left” table in relation to Y. Likewise, the table Y is considered to be the “left” table in relation to Z. It follows therefore that X is considered to be the “left” table in relation to Z. Some DBAs are under the misconception that the equal sign determines which table is “left” or “right”. Not so. For example, the two SELECT statements below would return identical results:

    SELECT statement 1

    select a*, b.*
    from dbo.invHdr a (nolock)
    left join dbo.invDet b (nolock)
    on a.invnum = b.invnum

    SELECT statement 2

    select a.*, b.*
    from dbo.invHdr a (nolock)
    left join dbo.invDet b (nolock)
    on b.invnum = a.invnum

    As you can see, the same results are returned. The ON statement simply states the relationship between the two tables. That is, that they are related via the invnum column.

    RIGHT JOIN

    This type of join is simply the inverse of the LEFT JOIN. A right join returns all rows where matches were found in both tables as well as all rows from the right table. Here is a graphic that shows this concept:

    RIGHT JOIN depiction

    RIGHT JOIN depiction

    If we were to re-write the SELECT statement used in the LEFT JOIN example to be a RIGHT JOIN, it would look like this:

    select a.*, b.*
    from dbo.invHdr a (nolock)
    right join dbo.invDet b (nolock)
    on b.invnum = a.invnum
    

    Notice that the only change that was made was to change the word LEFT to RIGHT. If you were to run this statement, you would see a result set similar to the LEFT JOIN example graphic. The only difference would be that for some rows, the columns for the LEFT table would be NULL.

    Keep in mind that is because we are asking for all rows that match between the two tables and ALL rows from the right table regardless if a match was found on the left table.

    FULL OUTER JOIN

    Now let’s look at a full outer join. This type of join returns all matching rows and all non-matching rows. Below is a depiction of this concept. As you can see, all the colored sections indicate the rows that would be returned.

    FULL OUTER JOIN depiction

    FULL OUTER JOIN depiction

    Run the following statement below to see the results:

    Code

    select
       a.*, b.*
    from dbo.invHdr a (nolock)
    full outer join dbo.invDet b (nolock)
    on b.invnum = a.invnum
    

    As you can see, the rows that are returned include both the matching rows and the non-matching rows.

    FULL OUTER JOIN example

    FULL OUTER JOIN example

    CROSS JOIN

    A CROSS JOIN is one in which all possible combinations of the two tables are returned. Run the code below to see the output:

    select
       a.*, b.*
    from dbo.invHdr a (nolock)
    cross join dbo.invDet b (nolock)
    

    As you can see from the graphic below, all the combinations of the records from both tables are returned.

    CROSS JOIN example

    CROSS JOIN example

    In my next post I will show how you can use the LEFT JOIN, RIGHT JOIN or FULL OUTER JOIN to help you fined orphaned records in your tables. Orphaned records are a violation of database normalization rules and should be addressed if and when they are found.

    Check back in a couple of days. Any feedback and/or comments are welcomed.
    Minh

    My post on detecting orphaned records is available here


    SQL – Unions

    April 15, 2009

    This post is in response to Mike’s request regarding UNIONs. Thanks Mike for the suggestion.

    What is a union?  A union is a SQL command that combines two resultsets together in one heterogenous view.  Let me elaborate.  In many database designs, it is common to partition data horizontally.  A classic example is dealing with sales data.  As a database accumulates sales data over several years, a DBA might split the data horizontally so that the current year’s data is in one table (we’ll call this table CURRENT_SALES) and all other sales data that is older than 1 year is in an archive table (HISTORICAL_SALES).

    Let’s say that you have a need to provide to the executive team sales data that encompasses the current year and sales data going back to 3 years prior to current year.  Of course this needs to be on the same report.  One way you might want to do this is to create a temporary table and insert the data from the CURRENT_SALES table then do an insert using data from the HISTORICAL_SALES table.  Finally, you would return the data in the form of a SELECT. Here’s an example:

    Assume that your current sales table has the following structure:

    create table dbo.current_sales
    (
      customer_id int not null
      ,sale_date datetime not null default (getdate())
      ,revenue numeric(19, 5) not null
    )
    

    Further assume that your historical sales table has the following structure:

    create table dbo.historical_sales
    (
      customer_id int not null
      ,sale_date datetime not null
      ,revenue numeric(19, 5) not null
      ,archive_date datetime not null default (getdate())
    )
    

    One way to combine the data for the report might be something like this:

    -- Create a temporary table.
    create table #tSalesData
    (
       customer_id int not null
       ,sale_date datetime not null
       ,revenue numeric(19, 5) not null
       ,row_display_order int identity(1,1) not null
    )
    go
    
    -- Get sales data from current year.
    insert into #tSalesData
    (
       customer_id
       ,sale_date
       ,revenue
    )
    select
       customer_id
       ,sale_date
       ,revenue
    from dbo.current_sales
    order by
       sale_date
    
    -- Get sales data from historical table.
    insert into #tSalesData
    (
       customer_id
       ,sale_date
       ,revenue
    )
    select
       customer_id
       ,sale_date
       ,revenue
    from dbo.historical_sales
    where
       sale_date between dateadd(yy, -4, getdate()) and getdate()
    order by
       sale_date
    
    -- Return data.
    select * from #tSalesData
    

    Keep in mind that this is a trivial example and one that I would not implement in a production environment. It is for illustrative purposes only. Even so, this example suffers from two main issues.

    First, it uses a temporary table to store data that is fetched from the current sales table and historical sales table. While there is nothing functionally wrong with using a temporary table, it does come at a cost.  Any time you use a temporary table there are I/O costs inherent in any INSERT, UPDATE or DELETE operation performed on the table.

    Secondly, it uses an ORDER BY statement to order the data by sale_date.  In the above code example, we have four occurrences of I/O cost:

    • The SELECT statement to fetch data for current year.
    • The SELECT statement to fetch data from history.
    • The INSERT statement to store the data.
    • The SELECT statement to return the data.

    While this may be acceptable for a relatively small dataset, it’s not feasible when returning larger datasets.  The example below shows how to accomplish the same result using a UNION:

    select
       customer_id
       ,sale_date
       ,revenue
    from dbo.current_sales
    UNION
    select
       customer_id
       ,sale_date
       ,revenue
    from historical_sales
    where
       sale_date between dateadd(yy, -4, getdate()) and getdate()
    order by sale_date
    

    The example above would incur considerably less I/O since there are only two read operations.  The previous example had four I/O operations. Furthermore, recall that in the initial example two ORDER BY clauses were used to order the data. In the example above, note that only one ORDER BY clause is used.

    Note that in the example above duplicates would be removed since we did not use the ALL argument. Using the ALL argument preserves duplicates.

    As a final note, be aware that when using the UNION statement, it is required that all datasets that participate in the UNION must have the same column names and similiar data types.

    Some of you may have come to the conclusion that a UNION is similiar to a JOIN in that both allow you to fetch and combine data from two or more tables. I will have a post on JOINs and their usage next week and finish up with a post on their differences.

    As always, I welcome any comments, insights and/or corrections.

    Minh