Add missing lookup in SSIS

Lookup

Before a fact table is loaded, a common pattern is to do a lookup in SSIS before the loading. The lookup can be some key from a dimension for instance, or maybe a repository of primary keys in a MDM solution. But what if the lookup doesn’t return a matching value? Sometimes when this happens, we want to insert new values to the lookup table (to get the key) before the loading of the fact table starts. Here I’ll show one way to handle this scenario.

A common pattern for handling late arriving lookups is to replace the lookup value with a marker, for instance -1, and simply continue the loading of the fact table. After the loading finishes, a special routine picks up unmatched rows in the fact table and replace the markers with new and correct keys from the lookup table.

Another approach is to use an OLE DB Command in the data flow to add the rows to the dimension. This is the method I’ll describe in this blog post. In this simple demonstration, I have a source table with customers. I also have a lookup table that contains the primary key that I want to use in the data flow as key in the destination table. Lastly, there is a destination table. (Imagine there are lots of data sources in this example, otherwise it doesn’t make sense of course.)

Late Arriving Lookup Example Database

Late Arriving Lookup Example Database

In the source, there will be two customers. One with the name “Tomas Lind”, and another customer with the name “A new customer!”. The customer names are used for the lookup. In the lookup table, Tomas Lind will already exist so there will be a lookup match. The customer “A new customer!” will not exist in the lookup table. Finally, the destination table will be empty. But since this solution is made to handle late arriving lookups, when the data flow finishes, both customers will be in the destination table.

To create the demo database, run the following T-SQL script:

CREATE DATABASE LAL
GO

USE LAL
GO

CREATE TABLE dbo.SourceFact
	(
		CustomerName VARCHAR(100)
	)
GO

INSERT dbo.SourceFact (CustomerName) VALUES ('Tomas Lind'), ('A new customer!')
GO

CREATE TABLE dbo.CustomerLookup
	(
		CustomerId INT IDENTITY(1,1) NOT NULL CONSTRAINT CustomerLookup_CustomerId_PK PRIMARY KEY CLUSTERED
		,CustomerName VARCHAR(100)
	)
GO

INSERT dbo.CustomerLookup (CustomerName) VALUES ('Tomas Lind')
GO

CREATE TABLE dbo.DestinationFact
	(
		CustomerId INT NOT NULL CONSTRAINT DestinationFact_CustomerId_PK PRIMARY KEY CLUSTERED
		,CustomerName VARCHAR(100)
	)
GO

The first component we need in SSIS is a dataflow. The starting point in the dataflow is simply to read the content in the source table. The next step is to add the CustomerId (primary key) column to the data flow with the use of a Derived Column:

Data Source And Derived Column

Data Source And Derived Columnn

Derived Column

Derived Column

The CustomerId could have been added in the source instead, but the point is that it initially is set to 0. The next step is to add the lookup component:

SSIS Lookup

SSIS Lookup

SSIS Lookup General Details

SSIS Lookup General Details

SSIS Lookup Column Details

SSIS Lookup Column Details

As you can see in the screen shots above, the settings are: not to use cache (of no use here), to redirect unmatching entries, and to replace the CustomerId with the key found in the lookup table. Also note that the customer name is used as the lookup value and the CustomerId is the result from the lookup.

The lookup component will now return two data flows. One with matching/existing customers, and another with unmatched/new customers. For the unmatching rows we will use an OLE DB Command to insert the new customers, and for the already existing customers, we don’t need to do anything yet. In the OLE DB Command we can use a T-SQL statement, but in this demonstration I will use a stored procedure that adds new rows to the CustomerLookup table:

CREATE PROCEDURE dbo.AddCustomerLookup
	(
		@CustomerName VARCHAR(100)
		,@CustomerId INT OUTPUT
	)

AS

SET NOCOUNT ON

INSERT dbo.CustomerLookup (CustomerName) VALUES (@CustomerName)

SELECT @CustomerId = SCOPE_IDENTITY()
GO

On the Component Properties page in the OLE DB Command, the stored procedure is called and the CustomerId is returned as an OUTPUT parameter:

OLE DB Command SQL Command

OLE DB Command SQL Command

SSIS OLE DB Command Column Mappings

SSIS OLE DB Command Column Mappings

Now we have CustomerId’s for both the old and the new customer. Both these data flows needs to be merged with a Union all component before the insert to the destination:

SSIS Union All And Destination

SSIS Union All And Destination

That’s it! If we run the package there will be one old customer in the “left” data flow, and one new customer in the other data flow:

Late Arriving Lookup SSIS Run Results

Late Arriving Lookup SSIS Run Results

The destination table is now populated with both rows:

Destination Table

Destination Table

6 comments » Write a comment

  1. Good approach. But one question. Let’s say you have an incoming file for which the value “A new customer!” appears in 50 records in a row. SSIS shoves data through the pipe in packets. So I would expect that you could see all 50 records sent to the No Match lookup output. If the stored procedure doesn’t test for pre-existence, I would expect the new value to be added 50 times.
    Also, there’s a tradeoff… you can set caching off, in which case each row would generate a database call for the lookup, but you’ll immediately find any row that has been added. Alternatively, you can turn on caching, in which case all 50 records in a row are guaranteed to not be found by the lookup (as well as any other instances of that value later in the input stream). But in that case, the lookup for all existing records are cached, but the new values are a one by one re-lookup via stored procedure.

    Just a thought. Regardless, this post put me on the right track. Thank you!

    • Hi, thanks for the feedback! Perhaps some kind of de-duplication routine that runs before the lookup could be used as well.

      Regards
      Tomas

      • As “de-duplication” you can use aggregation, but then you will not be able to return them to the main flow.
        You could split before that and one group can be aggregated and inserted, but then you would have to do another lookup. Problem is that this lookup might occur before records are inserted.

  2. I am new to SSIS. How is the CustomerName passed from the ‘Lookup No Match Output’ to the ‘AddCustomerLookup’ step? Mine always has a NULL value, causing my insert to fail.

    • Hi, in the demonstration there is always a name. What’s missing is the key for that name.

      Perhaps you can try to put a data viewer in the data flow before or after the lookup to view the records where name is missing?

      • Yes, the data viewer revealed that my Excel data source had some empty rows. I used a conditional split to remove those. Thank you.

Leave a Reply

Required fields are marked *.