When you submit a query to SQL Server – before the query is executed – SQL Server tries to optimize the way data is retrieved from the tables and indexes to create an execution plan that is as effective as possible. One of the decisions the optimizer in SQL Server has to make is which of the join methods (if any) to use.
There are three join methods: Nested Loops Join, Hash Join and Merge Join.
They all have their relative strengths and weaknesses, which will be covered in this blog post.
Nested Loops Join
Before SQL Server 7.0, this was the only join method used by SQL Server. Generally speaking, this is the preferred join method in OLTP systems. As the name implies, SQL Server uses one of the data sets to do an iteration, and for each row a search is done in the other data set to see if the key matches. If you look at the execution plan, the iterating data set is displayed as the upper, and the data set that the search is done on is the lower data set.
Obviously, for this join method to be effective, there must be an index on the table that is searched. Unless there are very few rows. A typical example is a join between a primary and a foreign key. The primary key is indexed by default, but the foreign key is not. So if the foreign key needs to be searched a lot, the benefits from an index can be huge. This join method is preferred by SQL Server if the data sets involved are small (for instance, using the primary key as a filter in the query), so a missing index may be noticed by the fact that another join method is used.
Hash Join
This join method is often used by SQL Server when there are indexes missing on the joining columns, or when the data sets are large and unsorted. In replacement of a useful index, hash tables (with a hash key for the join columns) are created on both data sets. The smaller of the data sets is used as input, and a hash key is created on all rows. On the remaining data set, the hash key is calculated on each row, and a seek is done in the input to see if there are matching rows.
Because of the large amounts of data that can be involved in hash joins, tempdb can be used by SQL Server if the memory available is not enough.
In a data warehouse solution, hash joins may not be a problem if large amount of data is returned to the client. In an OLTP solution however, hash join may be an indicator of a missing index.
Merge Join
A merge join requires both data sets to be sorted on the joining columns. The data sets are compared, and matching rows are returned.
If necessary, SQL Server may sort the data before the merge join operation. The sort operation can be very costly, so the ideal situation for a merge join is a sorted index. Merge joins can be very effective however, so if you see a sort operator in the execution plan, it is not necessarily a bad thing.
Tempdb may be used for sorting operations.