How To Calculate Length of Overlapping Ranges in PostgreSQL
The Problem
PostgreSQL has many interesting features for building custom logic inside the database to ensure the data is correct. It is also possible to build logic for simpler processing data.
In this post I will show how to use PostgreSQL to create an easy to use custom aggregate to calculate the length of a collection of overlapping ranges.
An Example
Consider storing information about events in a database. Each event has its start and end time. It should also have some information about the user and type of the event, maybe some resource id. To make the examples simpler, let’s consider only the user id and the event name.
When storing ranges in a relational database we have a couple of solutions. One most commonly used is to have two columns start
and stop
like this:
create table events (
id int,
start timestamp,
stop timestamp
);
Then the length of the event can be calculated with stop - start
, but for this, we need to be sure that we won’t get negative values.
Of course, the most important thing in the database is the correct data. For this example, to make sense, I need to ensure that start <= stop
.
create table events (
id int,
start timestamp not null,
stop timestamp not null,
check (start <= stop)
);
I have also added NOT NULL
to make the model correct.
The whole table looks like this:
create table events (
id int primary key,
user_id int not null,
event text not null,
start timestamp not null,
stop timestamp not null,
check (start <= stop)
);
Another solution is to use the PostgreSQL range types. Then the table can look like this:
create table events (
id int primary key,
user_id int not null,
event text not null,
timerange tsrange not null
);
I still have access to the start
and stop
values with lower(timerange)
and upper(timerange)
but I also have lots of functions and operators simplifying writing queries. The stuff I’d need to write on my own, which is already there and is well tested.
We don’t even need to add the check, as trying to create a range with lower bound bigger than the upper one ends with an error:
$ select tsrange('2010-02-02 10:10:10', '2010-02-02 09:10:10');
ERROR: range lower bound must be less than
or equal to range upper bound
Insert Some Data
Let’s insert some data for further calculations.
insert into events(
id, user_id, event, timerange)
values
(1, 1, 'aaa',
tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
(2, 1, 'aaa',
tsrange('2010-01-01 10:20:00', '2010-01-01 11:00:00')),
(3, 1, 'aaa',
tsrange('2010-01-01 10:30:00', '2010-01-01 12:00:00')),
-- another event with exactly the same time
(4, 1, 'bbb',
tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
(5, 1, 'bbb',
tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
-- another day
(6, 1, 'aaa',
tsrange('2010-01-02 10:00:00', '2010-02-01 10:10:00')),
(7, 1, 'aaa',
tsrange('2010-01-02 10:00:00', '2010-02-01 11:00:00')),
(8, 1, 'aaa',
tsrange('2010-01-02 12:00:00', '2010-02-01 12:10:00')),
(9, 1, 'aaa',
tsrange('2010-01-02 14:00:00', '2010-02-01 14:00:00')),
-- another user
(10, 2, 'aaa',
tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
(11, 2, 'aaa',
tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
(12, 2, 'aaa',
tsrange('2010-01-01 11:00:00', '2010-01-01 11:10:00')),
(13, 2, 'bbb',
tsrange('2010-01-01 11:00:00', '2010-01-01 11:40:00')),
(14, 2, 'bbb',
tsrange('2010-01-01 11:00:00', '2010-01-01 12:00:00'));
The data now is:
$ select * from events;
id | user_id | event | timerange
----+---------+-------+-----------------------------------------------
1 | 1 | aaa | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
2 | 1 | aaa | ["2010-01-01 10:20:00","2010-01-01 11:00:00")
3 | 1 | aaa | ["2010-01-01 10:30:00","2010-01-01 12:00:00")
4 | 1 | bbb | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
5 | 1 | bbb | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
6 | 1 | aaa | ["2010-01-02 10:00:00","2010-02-01 10:10:00")
7 | 1 | aaa | ["2010-01-02 10:00:00","2010-02-01 11:00:00")
8 | 1 | aaa | ["2010-01-02 12:00:00","2010-02-01 12:10:00")
9 | 1 | aaa | ["2010-01-02 14:00:00","2010-02-01 14:00:00")
10 | 2 | aaa | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
11 | 2 | aaa | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
12 | 2 | aaa | ["2010-01-01 11:00:00","2010-01-01 11:10:00")
13 | 2 | bbb | ["2010-01-01 11:00:00","2010-01-01 11:40:00")
14 | 2 | bbb | ["2010-01-01 11:00:00","2010-01-01 12:00:00")
(14 rows)
The First Idea
Now let’s get the total length of the events. To make things more interesting, let’s consider that we want to see the total time for each user, for each day, for each event.
First, let’s check the first 3 rows, they are for the same user, event, resource and day:
$ select * from events where id in (1,2,3);
id | user_id | event | timerange
----+---------+-------+-----------------------------------------------
1 | 1 | aaa | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
2 | 1 | aaa | ["2010-01-01 10:20:00","2010-01-01 11:00:00")
3 | 1 | aaa | ["2010-01-01 10:30:00","2010-01-01 12:00:00")
(3 rows)
The length of the event can be calculated with:
select
id, timerange,
upper(timerange) - lower(timerange) length
from events where id in (1,2,3);
id | timerange | length
----+-----------------------------------------------+----------
1 | ["2010-01-01 10:00:00","2010-01-01 10:10:00") | 00:10:00
2 | ["2010-01-01 10:20:00","2010-01-01 11:00:00") | 00:40:00
3 | ["2010-01-01 10:30:00","2010-01-01 12:00:00") | 01:30:00
(3 rows)
Does it mean that the user was doing something for 00:10:00 + 00:40:00 + 01:30:00 = 02:20:00
? No, because the ranges overlap. The user started at 10:00:00
and finished at 10:20:00
. This can be one session. Another session started at 10:20:00
and finished at 12:00:00
.
The ranges in the rows 2 and 3 overlap. There can be some problems with the reporting software, but this is the data and we need to do something with the overlapping stuff.
Checking Ranges Overlapping
PostgreSQL has an operator to check of two ranges overlap &&
:
select
tsrange('2010-02-02 10:10:10', '2010-02-02 11:10:10')
&& tsrange('2010-02-02 11:00:10', '2010-02-02 11:10:10');
?column?
----------
t
And another one to check if the ranges are adjacent to each other -|-
:
select
tsrange('2010-02-02 10:10:10', '2010-02-02 11:10:10')
-|- tsrange('2010-02-02 11:10:10', '2010-02-02 12:10:10');
?column?
----------
t
(1 row)
Grouping Rows
Another thing, PostgreSQL has, is the ability to group things. Let’s make a list of all ranges for the user, resource, and day:
select
user_id,
event,
lower(timerange)::date as day,
array_agg(timerange)
from events
group by
user_id,
event,
lower(timerange)::date;
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id | 1
event | bbb
day | 2010-01-01
array_agg |
{"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")"}
-[ RECORD 2 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id | 2
event | aaa
day | 2010-01-01
array_agg |
{"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
"[\"2010-01-01 11:00:00\",\"2010-01-01 11:10:00\")"}
-[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id | 2
event | bbb
day | 2010-01-01
array_agg |
{"[\"2010-01-01 11:00:00\",\"2010-01-01 11:40:00\")",
"[\"2010-01-01 11:00:00\",\"2010-01-01 12:00:00\")"}
-[ RECORD 4 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id | 1
event | aaa
day | 2010-01-02
array_agg |
{"[\"2010-01-02 10:00:00\",\"2010-02-01 10:10:00\")",
"[\"2010-01-02 10:00:00\",\"2010-02-01 11:00:00\")",
"[\"2010-01-02 12:00:00\",\"2010-02-01 12:10:00\")",
"[\"2010-01-02 14:00:00\",\"2010-02-01 14:00:00\")"}
-[ RECORD 5 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id | 1
event | aaa
day | 2010-01-01
array_agg |
{"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
"[\"2010-01-01 10:20:00\",\"2010-01-01 11:00:00\")",
"[\"2010-01-01 10:30:00\",\"2010-01-01 12:00:00\")"}
As you can see, for each of the items we group by (user, type, day), we get an array of the time ranges. The ranges overlap and we need to calculate the length of the sessions. It would be nice to merge the ranges some way.
A Custom Aggregate Function
PostgreSQL allows for creating a custom aggregate function. The simplest aggregate is a combination of two functions:
- sfunc(state, new-value) -> adds the new-value to the internal state
- ffunc(state) -> calculates the final output out of the internal state
The custom aggregate with these two functions looks like a perfect solution:
- the new-value will be a tsrange value from the current row
- internally it will store an array of tsrange values (starting with an empty one)
- when adding a new value, it will check if it overlaps with any values in the list, if it does, then the merged range will be stored
- the last function will calculate the total length as the sum of the lenghts of the ranges in the list
The Implementation
The basic aggregate syntax is:
create aggreagate overlapping_range_length(tsrange)
(
sfunc = add_overlapping_tsrange_to_array,
stype = tsrange[],
initcond = '{}'
finalfunc = tsrange_array_length
);
where:
- sfunc - will add a new value to the internal aggregator state
- stype - the type of the internal state
- initcond - the initial value of the internal state
- finalfunc - the function to calculate the final value from the internal state. If this function is omitted, then the aggregate will return the internal state variable.
Calculating The Length of The Ranges
The final function must take tsrange[]
as the argument and must return time
. The simplest way to process array values is to convert it to a table with the unnest
function:
$ select unnest(array[1,2,3]);
unnest
--------
1
2
3
(3 rows)
There will be time ranges of type tsrange
, I need to get their lengths with upper(value) - lower(value)
and the sum all of them.
The final function looks like this:
create function tsrange_array_length(tsrange[])
returns interval
language sql
as $_$
select
coalesce(
sum(upper(value)-lower(value)),
'00:00:00')
from (select unnest($1) as value) as _;
$_$;
Merging The Ranges in an Array
Now the more tricky part, the function for storing the internal state.
For each new value I need to go through all the values, I already have in an array, and check if any of them overlaps or is adjacent to the new one. What’s more, it’s possible that the value overlaps with two already existing values, so I need to do even more checks.
This function must have tsrange[] (the current internal state)
and tsrange (the new value)
as the arguments and should return tsrange[]
, which the aggregate will store in the internal state.
create function add_overlapping_tsrange_to_array(
data tsrange[], value tsrange)
returns tsrange[]
language plpgsql
as $_$
declare
-- array of values to return
new_data tsrange[];
-- value merged with values from the internal state
new_value tsrange;
-- current element from loop
element tsrange;
begin
-- we don't want to process null value
if value is null then
return new_data;
end if;
-- this will be merged with overlapping values
-- and will be added to the returned array
-- at the end
new_value := value;
-- loop over the internal state values
for i in 1..coalesce(array_upper(data, 1), 0) loop
element = data[i];
-- if the current element is adjacent or overlaps the new_value
-- then merged it and store in new_value
if (element && new_value or element -|- new_value) then
new_value := new_value + element;
else
-- if the current element is not adjacent and doesn't overlap
-- then just add it to the returned list
new_data := array_append(new_data, element);
end if;
end loop;
-- here we have the all the ranges merged with the value one
new_data := array_append(new_data, new_value);
return new_data;
end;
$_$;
The Aggregates
I can create two aggregates out of the above function:
- tsrange_agg - will return an array of the merged ranges
- tsrange_length_agg - will return the length of the merged ranges
create aggregate tsrange_agg(tsrange) (
sfunc = add_overlapping_tsrange_to_array,
stype = tsrange[],
initcond = '{}'
);
create aggregate tsrange_length_agg(tsrange) (
sfunc = add_overlapping_tsrange_to_array,
stype = tsrange[],
initcond = '{}',
finalfunc = tsrange_array_length
);
The Final Query
The previously used query, which generated overlapping ranges, was:
select
user_id,
event,
lower(timerange)::date as day,
array_agg(timerange)
from events
group by
user_id,
event,
lower(timerange)::date;
Here is a similar one, but using the aggregates:
select
user_id,
event,
lower(timerange)::date as day,
tsrange_agg(timerange) merged_range,
tsrange_length_agg(timerange) length
from events
group by
user_id,
event,
lower(timerange)::date;
And the results:
-[ RECORD 1 ]+----------------------------------------------------------------------------------------------------------
user_id | 1
event | bbb
day | 2010-01-01
merged_range |
{"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")"}
length | 00:10:00
-[ RECORD 2 ]+----------------------------------------------------------------------------------------------------------
user_id | 2
event | aaa
day | 2010-01-01
merged_range |
{"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
"[\"2010-01-01 11:00:00\",\"2010-01-01 11:10:00\")"}
length | 00:20:00
-[ RECORD 3 ]+----------------------------------------------------------------------------------------------------------
user_id | 2
event | bbb
day | 2010-01-01
merged_range |
{"[\"2010-01-01 11:00:00\",\"2010-01-01 12:00:00\")"}
length | 01:00:00
-[ RECORD 4 ]+----------------------------------------------------------------------------------------------------------
user_id | 1
event | aaa
day | 2010-01-02
merged_range |
{"[\"2010-01-02 10:00:00\",\"2010-02-01 14:00:00\")"}
length | 30 days 04:00:00
-[ RECORD 5 ]+----------------------------------------------------------------------------------------------------------
user_id | 1
event | aaa
day | 2010-01-01
merged_range |
{"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
"[\"2010-01-01 10:20:00\",\"2010-01-01 12:00:00\")"}
length | 01:50:00
Summary
These aggregates work pretty nicely and simplify the queries. There can be made many others, for other data types and with different logic. Don’t process all your data only in applications, too often it’s too slow.