A XML document is hierarchical. A value gets meaning from its position in the document. Nodes can be nested so that a parent node can contain children nodes. This way the child nodes inherits all the values from the parent.
In a relational database the same thing can be achieved, but by using a different technology. Here we use key values in separate tables instead. Information from the parent or child can be retrieved by using relations between tables.
XML is commonly used in applications to store information about objects, and a relational database is often used to persist that information.
So how can a hierarchical XML document be inserted to relational tables? In this blog post I will show one possible technique, using only T-SQL and XQuery. In the example the tables will have their own internal primary keys defined which will add to the complexity, but it is a rather common scenario.
Consider the following XML document as the source:
<Invoices> <Invoice> <InvoiceNumber>991001</InvoiceNumber> <InvoiceDate>2013-12-19</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PROD1234</Product> <Amount>1000</Amount> </InvoiceRow> <InvoiceRow> <Product>PROD5678</Product> <Amount>100</Amount> </InvoiceRow> </InvoiceRows> </Invoice> <Invoice> <InvoiceNumber>991002</InvoiceNumber> <InvoiceDate>2013-12-20</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PROD1234</Product> <Amount>2000</Amount> </InvoiceRow> </InvoiceRows> </Invoice> </Invoices>
The XML document contains invoices (two in the example, but there could be one or more). Each invoice has header information (InvoiceNumber and InvoiceDate), and each invoice has one or more invoice rows. Each invoice row has a product and an amount. So if we look at one of the invoice rows, we know which invoice it belong to, not by an reference key, but by its location as an nested element under the invoice. There is an InvoiceNumber in the example, but what if that value isn’t mandatory (for instance when an invoice has not been created yet). Or what if the value is not unique? If we want to keep the positions of the nodes when inserting to the database, we must calculate the position of each node.
Our goal is to insert the information in the above XML document into the following tables:
The sql script that creates the database can be found here. As you can see, the database uses internal keys to relate invoice rows to their invoice (InvoiceId).
To get the XML document into a table format, I use XQuery and the .nodes function. The function accepts a node location as parameter and returns all of those nodes in the document as a table. To get the actual values from the XML document, I use the .value function. The value function returns a value from the current node in the data type specified. To get the position of each invoice in the XML document, the T-SQL DENSE_RANK function is used. The DENSE_RANK function is ordered by the InvoiceNodes. To get all invoices, the following XQuery can be used (the first step is to put the XML document into a variable):
DECLARE @INVOICES XML SET @INVOICES = '<Invoices> <Invoice> <InvoiceNumber>991001</InvoiceNumber> <InvoiceDate>2013-12-19</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PROD1234</Product> <Amount>1000</Amount> </InvoiceRow> <InvoiceRow> <Product>PROD5678</Product> <Amount>100</Amount> </InvoiceRow> </InvoiceRows> </Invoice> <Invoice> <InvoiceNumber>991002</InvoiceNumber> <InvoiceDate>2013-12-20</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PROD1234</Product> <Amount>2000</Amount> </InvoiceRow> </InvoiceRows> </Invoice> </Invoices>' SELECT Invoice.InvoiceNode.value('InvoiceNumber[1]', 'INT') AS InvoiceNumber ,Invoice.InvoiceNode.value('InvoiceDate[1]', 'DATE') AS InvoiceDate ,DENSE_RANK() OVER (ORDER BY InvoiceNode) AS Position FROM @INVOICES.nodes('/Invoices/Invoice') AS Invoice(InvoiceNode)
This returns the following:
The next XQuery statement is used to get all the invoice rows. I still retrieve all the “/Invoices/Invoice” for the position, but I have added a join to all the “/Invoices/Invoice/InvoiceRows/InvoiceRow” nodes:
DECLARE @INVOICES XML SET @INVOICES = '<Invoices> <Invoice> <InvoiceNumber>991001</InvoiceNumber> <InvoiceDate>2013-12-19</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PROD1234</Product> <Amount>1000</Amount> </InvoiceRow> <InvoiceRow> <Product>PROD5678</Product> <Amount>100</Amount> </InvoiceRow> </InvoiceRows> </Invoice> <Invoice> <InvoiceNumber>991002</InvoiceNumber> <InvoiceDate>2013-12-20</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PROD1234</Product> <Amount>2000</Amount> </InvoiceRow> </InvoiceRows> </Invoice> </Invoices>' SELECT DENSE_RANK() OVER (ORDER BY InvoiceNode) AS Position ,InvoiceRow.InvoiceRowNode.value('Product[1]', 'VARCHAR(10)') AS Product ,InvoiceRow.InvoiceRowNode.value('Amount[1]', 'INT') AS Amount FROM @INVOICES.nodes('/Invoices/Invoice') AS Invoice(InvoiceNode) CROSS APPLY Invoice.InvoiceNode.nodes('./InvoiceRows/InvoiceRow') AS InvoiceRow(InvoiceRowNode)
This returns the invoice rows and the invoice parent’s position in the XML document:
To be honest, I had some trouble getting the DENSE_RANK function to work as I wanted. It wasn’t until I found this blog post by Adam Machanic that I could get it to work. Thanks Adam! In that blog post there are some alternative solutions as well.
So. If we put it all together and insert the results from the above queries, we get the following script. I added a couple of invoices with InvoiceNumber = 0, just to prove to myself that it worked and the solution can tell them apart without the InvoiceNumber.
DECLARE @INVOICES XML SET @INVOICES = '<Invoices> <Invoice> <InvoiceNumber>991001</InvoiceNumber> <InvoiceDate>2013-12-19</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PROD1234</Product> <Amount>1000</Amount> </InvoiceRow> <InvoiceRow> <Product>PROD5678</Product> <Amount>100</Amount> </InvoiceRow> </InvoiceRows> </Invoice> <Invoice> <InvoiceNumber>991002</InvoiceNumber> <InvoiceDate>2013-12-20</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PROD1234</Product> <Amount>2000</Amount> </InvoiceRow> </InvoiceRows> </Invoice> <Invoice> <InvoiceNumber>0</InvoiceNumber> <InvoiceDate>2013-12-01</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PRODXYX</Product> <Amount>1000</Amount> </InvoiceRow> </InvoiceRows> </Invoice> <Invoice> <InvoiceNumber>0</InvoiceNumber> <InvoiceDate>2013-12-01</InvoiceDate> <InvoiceRows> <InvoiceRow> <Product>PRODXYX2</Product> <Amount>2000</Amount> </InvoiceRow> </InvoiceRows> </Invoice> </Invoices>' DECLARE @TMP_INVOICES TABLE ( COL_InvoiceId INT NOT NULL ,COL_InvoiceNumber INT NOT NULL ,COL_InvoiceDate DATE NOT NULL ,COL_XMLPosition INT NOT NULL ) INSERT dbo.Invoices ( InvoiceNumber ,InvoiceDate ,XMLPosition ) OUTPUT INSERTED.* INTO @TMP_INVOICES SELECT Invoice.InvoiceNode.value('InvoiceNumber[1]', 'INT') AS InvoiceNumber ,Invoice.InvoiceNode.value('InvoiceDate[1]', 'DATE') AS InvoiceDate ,DENSE_RANK() OVER (ORDER BY InvoiceNode) AS Position FROM @INVOICES.nodes('/Invoices/Invoice') AS Invoice(InvoiceNode) INSERT dbo.InvoiceRows ( InvoiceId ,Product ,Amount ) SELECT T1.COL_InvoiceId ,Product ,Amount FROM ( SELECT DENSE_RANK() OVER (ORDER BY InvoiceNode) AS Position ,InvoiceRow.InvoiceRowNode.value('Product[1]', 'VARCHAR(10)') AS Product ,InvoiceRow.InvoiceRowNode.value('Amount[1]', 'INT') AS Amount FROM @INVOICES.nodes('/Invoices/Invoice') AS Invoice(InvoiceNode) CROSS APPLY Invoice.InvoiceNode.nodes('./InvoiceRows/InvoiceRow') AS InvoiceRow(InvoiceRowNode) ) AS T0 JOIN @TMP_INVOICES T1 ON T0.Position = T1.COL_XMLPosition
Basically, what is added is a table variable (@TBL_INVOICES) to keep track of the resulting InvoiceId primary keys after the insert to the Invoices table. I need those Id’s to insert as foreign keys in the InvoiceRows table. The Position (calculated by the DENSE_RANK function) is used to join the invoice rows with its parents.
above query is not working if child node is not exist in some parent nodes. for example:
991001
2013-12-19
991002
2013-12-20
PROD1234
2000
0
2013-12-01
PRODXYX
1000
0
2013-12-01
PRODXYX2
2000
above query is not working if child node is not exist in some parent nodes. for example:
991001
2013-12-19
991002
2013-12-20
PROD1234
2000
0
2013-12-01
PRODXYX
1000
0
2013-12-01
PRODXYX2
2000
use this xml for above query