Using XQuery to insert a XML hierarchy into parent / child tables

A XML document is hierarchical. A value gets meaning from its position in the document. Nodes can be nested so that a parent node can contain children nodes. This way the child nodes inherits all the values from the parent.

In a relational database the same thing can be achieved, but by using a different technology. Here we use key values in separate tables instead. Information from the parent or child can be retrieved by using relations between tables.

XML is commonly used in applications to store information about objects, and a relational database is often used to persist that information.

So how can a hierarchical XML document be inserted to relational tables? In this blog post I will show one possible technique, using only T-SQL and XQuery. In the example the tables will have their own internal primary keys defined which will add to the complexity, but it is a rather common scenario.

Consider the following XML document as the source:

<Invoices>
	<Invoice>
		<InvoiceNumber>991001</InvoiceNumber>
		<InvoiceDate>2013-12-19</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PROD1234</Product>
					<Amount>1000</Amount>
				</InvoiceRow>
				<InvoiceRow>
					<Product>PROD5678</Product>
					<Amount>100</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
	<Invoice>
		<InvoiceNumber>991002</InvoiceNumber>
		<InvoiceDate>2013-12-20</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PROD1234</Product>
					<Amount>2000</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
</Invoices>

The XML document contains invoices (two in the example, but there could be one or more). Each invoice has header information (InvoiceNumber and InvoiceDate), and each invoice has one or more invoice rows. Each invoice row has a product and an amount. So if we look at one of the invoice rows, we know which invoice it belong to, not by an reference key, but by its location as an nested element under the invoice. There is an InvoiceNumber in the example, but what if that value isn’t mandatory (for instance when an invoice has not been created yet). Or what if the value is not unique? If we want to keep the positions of the nodes when inserting to the database, we must calculate the position of each node.

Our goal is to insert the information in the above XML document into the following tables:

XML Invoices Table Design

XML Invoices Table Design

The sql script that creates the database can be found here. As you can see, the database uses internal keys to relate invoice rows to their invoice (InvoiceId).

To get the XML document into a table format, I use XQuery and the .nodes function. The function accepts a node location as parameter and returns all of those nodes in the document as a table. To get the actual values from the XML document, I use the .value function. The value function returns a value from the current node in the data type specified. To get the position of each invoice in the XML document, the T-SQL DENSE_RANK function is used. The DENSE_RANK function is ordered by the InvoiceNodes. To get all invoices, the following XQuery can be used (the first step is to put the XML document into a variable):

DECLARE @INVOICES XML

SET @INVOICES = '<Invoices>
	<Invoice>
		<InvoiceNumber>991001</InvoiceNumber>
		<InvoiceDate>2013-12-19</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PROD1234</Product>
					<Amount>1000</Amount>
				</InvoiceRow>
				<InvoiceRow>
					<Product>PROD5678</Product>
					<Amount>100</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
	<Invoice>
		<InvoiceNumber>991002</InvoiceNumber>
		<InvoiceDate>2013-12-20</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PROD1234</Product>
					<Amount>2000</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
</Invoices>'

SELECT
	Invoice.InvoiceNode.value('InvoiceNumber[1]', 'INT') AS InvoiceNumber
	,Invoice.InvoiceNode.value('InvoiceDate[1]', 'DATE') AS InvoiceDate
	,DENSE_RANK() OVER (ORDER BY InvoiceNode) AS Position
FROM
    @INVOICES.nodes('/Invoices/Invoice') AS Invoice(InvoiceNode)

This returns the following:

XQuery With Rank Position

XQuery With Rank Position

The next XQuery statement is used to get all the invoice rows. I still retrieve all the “/Invoices/Invoice” for the position, but I have added a join to all the “/Invoices/Invoice/InvoiceRows/InvoiceRow” nodes:

DECLARE @INVOICES XML

SET @INVOICES = '<Invoices>
	<Invoice>
		<InvoiceNumber>991001</InvoiceNumber>
		<InvoiceDate>2013-12-19</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PROD1234</Product>
					<Amount>1000</Amount>
				</InvoiceRow>
				<InvoiceRow>
					<Product>PROD5678</Product>
					<Amount>100</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
	<Invoice>
		<InvoiceNumber>991002</InvoiceNumber>
		<InvoiceDate>2013-12-20</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PROD1234</Product>
					<Amount>2000</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
</Invoices>'

SELECT
	DENSE_RANK() OVER (ORDER BY InvoiceNode) AS Position
    ,InvoiceRow.InvoiceRowNode.value('Product[1]', 'VARCHAR(10)') AS Product
    ,InvoiceRow.InvoiceRowNode.value('Amount[1]', 'INT') AS Amount
FROM
    @INVOICES.nodes('/Invoices/Invoice') AS Invoice(InvoiceNode)
CROSS APPLY
	Invoice.InvoiceNode.nodes('./InvoiceRows/InvoiceRow') AS InvoiceRow(InvoiceRowNode)

This returns the invoice rows and the invoice parent’s position in the XML document:

XQuery With Parent Rank Position

XQuery With Parent Rank Position

To be honest, I had some trouble getting the DENSE_RANK function to work as I wanted. It wasn’t until I found this blog post by Adam Machanic that I could get it to work. Thanks Adam! In that blog post there are some alternative solutions as well.

So. If we put it all together and insert the results from the above queries, we get the following script. I added a couple of invoices with InvoiceNumber = 0, just to prove to myself that it worked and the solution can tell them apart without the InvoiceNumber.

DECLARE @INVOICES XML

SET @INVOICES = '<Invoices>
	<Invoice>
		<InvoiceNumber>991001</InvoiceNumber>
		<InvoiceDate>2013-12-19</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PROD1234</Product>
					<Amount>1000</Amount>
				</InvoiceRow>
				<InvoiceRow>
					<Product>PROD5678</Product>
					<Amount>100</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
	<Invoice>
		<InvoiceNumber>991002</InvoiceNumber>
		<InvoiceDate>2013-12-20</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PROD1234</Product>
					<Amount>2000</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
	<Invoice>
		<InvoiceNumber>0</InvoiceNumber>
		<InvoiceDate>2013-12-01</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PRODXYX</Product>
					<Amount>1000</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
	<Invoice>
		<InvoiceNumber>0</InvoiceNumber>
		<InvoiceDate>2013-12-01</InvoiceDate>
			<InvoiceRows>
				<InvoiceRow>
					<Product>PRODXYX2</Product>
					<Amount>2000</Amount>
				</InvoiceRow>
			</InvoiceRows>
	</Invoice>
</Invoices>'

DECLARE @TMP_INVOICES TABLE
	(
		COL_InvoiceId INT NOT NULL
		,COL_InvoiceNumber INT NOT NULL
		,COL_InvoiceDate DATE NOT NULL
		,COL_XMLPosition INT NOT NULL
	)

INSERT dbo.Invoices
	(
		InvoiceNumber
		,InvoiceDate
		,XMLPosition
	)
OUTPUT
	INSERTED.*
INTO
	@TMP_INVOICES
SELECT
	Invoice.InvoiceNode.value('InvoiceNumber[1]', 'INT') AS InvoiceNumber
	,Invoice.InvoiceNode.value('InvoiceDate[1]', 'DATE') AS InvoiceDate
	,DENSE_RANK() OVER (ORDER BY InvoiceNode) AS Position
FROM
    @INVOICES.nodes('/Invoices/Invoice') AS Invoice(InvoiceNode)

INSERT dbo.InvoiceRows
	(
		InvoiceId
		,Product
		,Amount
	)
SELECT
	T1.COL_InvoiceId
	,Product
	,Amount
FROM
	(
		SELECT
			DENSE_RANK() OVER (ORDER BY InvoiceNode) AS Position
			,InvoiceRow.InvoiceRowNode.value('Product[1]', 'VARCHAR(10)') AS Product
			,InvoiceRow.InvoiceRowNode.value('Amount[1]', 'INT') AS Amount
		FROM
			@INVOICES.nodes('/Invoices/Invoice') AS Invoice(InvoiceNode)
		CROSS APPLY
			Invoice.InvoiceNode.nodes('./InvoiceRows/InvoiceRow') AS InvoiceRow(InvoiceRowNode)
	) AS T0
JOIN
	@TMP_INVOICES T1
ON
	T0.Position = T1.COL_XMLPosition

Basically, what is added is a table variable (@TBL_INVOICES) to keep track of the resulting InvoiceId primary keys after the insert to the Invoices table. I need those Id’s to insert as foreign keys in the InvoiceRows table. The Position (calculated by the DENSE_RANK function) is used to join the invoice rows with its parents.

Leave a Reply

Required fields are marked *.