In some cases it is necessary to have an unbroken time series, for instance to have all days during a year. But what if some days are missing from the data? How can we fill in gaps in time series?
By using a Numbers function in T-SQL, we can fill in the gaps in a serie of values. A common example is a time series of days, but any incrementing series of values can use the method I’ll describe in this blog post.
Let’s start by creating some sample data. This script will create a database with a table that will contain average temperatures over a year:
CREATE DATABASE FillInGaps GO USE FillInGaps GO CREATE TABLE dbo.Temperatures ( TempDay DATE NOT NULL ,Temperature NUMERIC(4,2) NOT NULL ) GO
To populate the table with some sample data, we’ll use a numbers function. The numbers function simply accepts the parameter @N and returns that many rows.
CREATE FUNCTION dbo.GetNums2(@N AS BIGINT) RETURNS TABLE AS RETURN SELECT TOP (@N) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS I FROM SYS.OBJECTS S1 CROSS JOIN SYS.OBJECTS S2 CROSS JOIN SYS.OBJECTS S3 CROSS JOIN SYS.OBJECTS S4 CROSS JOIN SYS.OBJECTS S5 GO
By using a T-SQL random function, we can generate some sample data. The following script will generate and insert all days 2015 and a random temperature between 0 and 30 (never mind the wildly fluctuating temperatures 🙂 )
DECLARE @START_DATE DATE SET @START_DATE = '2014-12-31' INSERT dbo.Temperatures SELECT DATEADD(DAY, I, @START_DATE) ,ABS(CHECKSUM(NewId())) % 30 AS TMP FROM dbo.GetNums2(365) SELECT * FROM dbo.Temperatures
But since the table contains all days for 2015, this is a unbroken time series. So let’s remove some of the days:
DELETE FROM dbo.Temperatures WHERE TempDay IN ('2015-01-13', '2015-01-14', '2015-01-15', '2015-03-14', '2015-04-23', '2015-07-26', '2015-09-19')
Note that I created a three day gap in January.
To create the unbroken time series for 2015, we once again use the numbers function together with a left outer join:
DECLARE @START_DATE DATE SET @START_DATE = '2014-12-31' SELECT DATEADD(DAY, I, @START_DATE) AS TempDay ,Temperature ,( SELECT Temperature FROM dbo.Temperatures WHERE TempDay = (SELECT MAX(TempDay) FROM dbo.Temperatures T_PREV WHERE T_PREV.TempDay <= DATEADD(DAY, I, @START_DATE)) ) AS TemperatureLastKnown FROM dbo.GetNums2(365) LEFT OUTER JOIN dbo.Temperatures T ON DATEADD(DAY, I, @START_DATE) = T.TempDay ORDER BY TempDay
This T-SQL gives us a result like the following:
The first column simply returns the date. Note that I use the function instead of the column TempDay, since that column misses some values.
The second column, Temperature, returns the temperature. Note that the value is NULL if there is no measurement for that day.
The last column, TemperatureLastKnown, shows a solution where I use the “last known” value for the temperature. This method can be useful if you want to avoid NULLs or if you want a smoothing effect.
Thanks Thomas, very useful article.
I have to do this kind of filling gaps but have to do it gradually.
I should not only take into account the previous value but also with the next one.
There are gaps from 1 to a maximum of 9 days. Longer gaps may not be filled in my case.
The most simple situation is a gap of 1 day where the filled data is the average of the previous and next value.
Taking your example (where the previous value is 8 en the next known value = 0), the value of
2015-01-13 would become 0.75*8 + 0.25*0 = 6
2015-01-14 would become 0.50*8 + 0.50*0 = 4
2015-01-15 would become 0.25*8 + 0.75*0 = 2
It is becoming increasingly difficult as the gaps get bigger.
Is there a function to gradually fill in the gaps to smooth the results.
Hi! Not sure about your requirements, but perhaps you could use a moving averge?