How To Calculate Length of Overlapping Ranges in PostgreSQL

Fri, Mar 1, 2019

by Szymon Lipiński

tags: database postgresql

The Problem

PostgreSQL has many interesting features for building custom logic inside the database to ensure the data is correct. It is also possible to build logic for simpler processing data.

In this post I will show how to use PostgreSQL to create an easy to use custom aggregate to calculate the length of a collection of overlapping ranges.

An Example

Consider storing information about events in a database. Each event has its start and end time. It should also have some information about the user and type of the event, maybe some resource id. To make the examples simpler, let’s consider only the user id and the event name.

When storing ranges in a relational database we have a couple of solutions. One most commonly used is to have two columns start and stop like this:

create table events (
  id int, 
  start timestamp, 
  stop timestamp
);

Then the length of the event can be calculated with stop - start, but for this, we need to be sure that we won’t get negative values.

Of course, the most important thing in the database is the correct data. For this example, to make sense, I need to ensure that start <= stop.

create table events (
  id int, 
  start timestamp not null, 
  stop timestamp not null,
  check (start <= stop)
);

I have also added NOT NULL to make the model correct.

The whole table looks like this:

create table events (
  id int primary key, 
  user_id int not null,
  event text not null,
  start timestamp not null, 
  stop timestamp not null,
  check (start <= stop)
);

Another solution is to use the PostgreSQL range types. Then the table can look like this:

create table events (
  id int primary key, 
  user_id int not null,
  event text not null,
  timerange tsrange not null
);

I still have access to the start and stop values with lower(timerange) and upper(timerange) but I also have lots of functions and operators simplifying writing queries. The stuff I’d need to write on my own, which is already there and is well tested.

We don’t even need to add the check, as trying to create a range with lower bound bigger than the upper one ends with an error:

$ select tsrange('2010-02-02 10:10:10', '2010-02-02 09:10:10');
ERROR:  range lower bound must be less than 
or equal to range upper bound

Insert Some Data

Let’s insert some data for further calculations.

insert into events(
    id, user_id, event, timerange)
values 
(1, 1, 'aaa', 
 tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
(2, 1, 'aaa',
 tsrange('2010-01-01 10:20:00', '2010-01-01 11:00:00')),
(3, 1, 'aaa',
 tsrange('2010-01-01 10:30:00', '2010-01-01 12:00:00')),
-- another event with exactly the same time
(4, 1, 'bbb',
 tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
(5, 1, 'bbb', 
 tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
-- another day
(6, 1, 'aaa', 
 tsrange('2010-01-02 10:00:00', '2010-02-01 10:10:00')),
(7, 1, 'aaa', 
 tsrange('2010-01-02 10:00:00', '2010-02-01 11:00:00')),
(8, 1, 'aaa', 
 tsrange('2010-01-02 12:00:00', '2010-02-01 12:10:00')),
(9, 1, 'aaa', 
 tsrange('2010-01-02 14:00:00', '2010-02-01 14:00:00')),
-- another user
(10, 2, 'aaa', 
 tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
(11, 2, 'aaa', 
 tsrange('2010-01-01 10:00:00', '2010-01-01 10:10:00')),
(12, 2, 'aaa', 
 tsrange('2010-01-01 11:00:00', '2010-01-01 11:10:00')),
(13, 2, 'bbb', 
 tsrange('2010-01-01 11:00:00', '2010-01-01 11:40:00')),
(14, 2, 'bbb', 
 tsrange('2010-01-01 11:00:00', '2010-01-01 12:00:00'));

The data now is:

$ select * from events;
 id | user_id | event |                   timerange                   
----+---------+-------+-----------------------------------------------
  1 |       1 | aaa   | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
  2 |       1 | aaa   | ["2010-01-01 10:20:00","2010-01-01 11:00:00")
  3 |       1 | aaa   | ["2010-01-01 10:30:00","2010-01-01 12:00:00")
  4 |       1 | bbb   | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
  5 |       1 | bbb   | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
  6 |       1 | aaa   | ["2010-01-02 10:00:00","2010-02-01 10:10:00")
  7 |       1 | aaa   | ["2010-01-02 10:00:00","2010-02-01 11:00:00")
  8 |       1 | aaa   | ["2010-01-02 12:00:00","2010-02-01 12:10:00")
  9 |       1 | aaa   | ["2010-01-02 14:00:00","2010-02-01 14:00:00")
 10 |       2 | aaa   | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
 11 |       2 | aaa   | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
 12 |       2 | aaa   | ["2010-01-01 11:00:00","2010-01-01 11:10:00")
 13 |       2 | bbb   | ["2010-01-01 11:00:00","2010-01-01 11:40:00")
 14 |       2 | bbb   | ["2010-01-01 11:00:00","2010-01-01 12:00:00")
(14 rows)

The First Idea

Now let’s get the total length of the events. To make things more interesting, let’s consider that we want to see the total time for each user, for each day, for each event.

First, let’s check the first 3 rows, they are for the same user, event, resource and day:

$ select * from events where id in (1,2,3);
 id | user_id | event |                   timerange                   
----+---------+-------+-----------------------------------------------
  1 |       1 | aaa   | ["2010-01-01 10:00:00","2010-01-01 10:10:00")
  2 |       1 | aaa   | ["2010-01-01 10:20:00","2010-01-01 11:00:00")
  3 |       1 | aaa   | ["2010-01-01 10:30:00","2010-01-01 12:00:00")
(3 rows)

The length of the event can be calculated with:

select 
  id, timerange, 
  upper(timerange) - lower(timerange) length
from events where id in (1,2,3);

 id |                   timerange                   |  length  
----+-----------------------------------------------+----------
  1 | ["2010-01-01 10:00:00","2010-01-01 10:10:00") | 00:10:00
  2 | ["2010-01-01 10:20:00","2010-01-01 11:00:00") | 00:40:00
  3 | ["2010-01-01 10:30:00","2010-01-01 12:00:00") | 01:30:00
(3 rows)

Does it mean that the user was doing something for 00:10:00 + 00:40:00 + 01:30:00 = 02:20:00? No, because the ranges overlap. The user started at 10:00:00 and finished at 10:20:00. This can be one session. Another session started at 10:20:00 and finished at 12:00:00.

The ranges in the rows 2 and 3 overlap. There can be some problems with the reporting software, but this is the data and we need to do something with the overlapping stuff.

Checking Ranges Overlapping

PostgreSQL has an operator to check of two ranges overlap &&:

select 
  tsrange('2010-02-02 10:10:10', '2010-02-02 11:10:10') 
  && tsrange('2010-02-02 11:00:10', '2010-02-02 11:10:10');
 ?column? 
----------
 t

And another one to check if the ranges are adjacent to each other -|-:

select 
  tsrange('2010-02-02 10:10:10', '2010-02-02 11:10:10') 
  -|- tsrange('2010-02-02 11:10:10', '2010-02-02 12:10:10');
 ?column? 
----------
 t
(1 row)

Grouping Rows

Another thing, PostgreSQL has, is the ability to group things. Let’s make a list of all ranges for the user, resource, and day:

select 
  user_id, 
  event, 
  lower(timerange)::date as day, 
  array_agg(timerange) 
from events 
group by 
  user_id, 
  event, 
  lower(timerange)::date;

-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id   | 1
event     | bbb
day       | 2010-01-01
array_agg | 
  {"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
   "[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")"}
-[ RECORD 2 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id   | 2
event     | aaa
day       | 2010-01-01
array_agg | 
  {"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
   "[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
   "[\"2010-01-01 11:00:00\",\"2010-01-01 11:10:00\")"}
-[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id   | 2
event     | bbb
day       | 2010-01-01
array_agg | 
  {"[\"2010-01-01 11:00:00\",\"2010-01-01 11:40:00\")",
   "[\"2010-01-01 11:00:00\",\"2010-01-01 12:00:00\")"}
-[ RECORD 4 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id   | 1
event     | aaa
day       | 2010-01-02
array_agg | 
  {"[\"2010-01-02 10:00:00\",\"2010-02-01 10:10:00\")",
   "[\"2010-01-02 10:00:00\",\"2010-02-01 11:00:00\")",
   "[\"2010-01-02 12:00:00\",\"2010-02-01 12:10:00\")",
   "[\"2010-01-02 14:00:00\",\"2010-02-01 14:00:00\")"}
-[ RECORD 5 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
user_id   | 1
event     | aaa
day       | 2010-01-01
array_agg | 
  {"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
   "[\"2010-01-01 10:20:00\",\"2010-01-01 11:00:00\")",
   "[\"2010-01-01 10:30:00\",\"2010-01-01 12:00:00\")"}

As you can see, for each of the items we group by (user, type, day), we get an array of the time ranges. The ranges overlap and we need to calculate the length of the sessions. It would be nice to merge the ranges some way.

A Custom Aggregate Function

PostgreSQL allows for creating a custom aggregate function. The simplest aggregate is a combination of two functions:

sfunc(state, new-value) -> adds the new-value to the internal state
ffunc(state) -> calculates the final output out of the internal state

The custom aggregate with these two functions looks like a perfect solution:

the new-value will be a tsrange value from the current row
internally it will store an array of tsrange values (starting with an empty one)
when adding a new value, it will check if it overlaps with any values in the list, if it does, then the merged range will be stored
the last function will calculate the total length as the sum of the lenghts of the ranges in the list

The Implementation

The basic aggregate syntax is:

create aggreagate overlapping_range_length(tsrange)
(
	sfunc = add_overlapping_tsrange_to_array,
	stype = tsrange[],
	initcond = '{}'
	finalfunc = tsrange_array_length
);

where:

sfunc - will add a new value to the internal aggregator state
stype - the type of the internal state
initcond - the initial value of the internal state
finalfunc - the function to calculate the final value from the internal state. If this function is omitted, then the aggregate will return the internal state variable.

Calculating The Length of The Ranges

The final function must take tsrange[] as the argument and must return time. The simplest way to process array values is to convert it to a table with the unnest function:

$ select unnest(array[1,2,3]);
 unnest 
--------
      1
      2
      3
(3 rows)

There will be time ranges of type tsrange, I need to get their lengths with upper(value) - lower(value) and the sum all of them.

The final function looks like this:

create function tsrange_array_length(tsrange[])
returns interval
language sql
as $_$
  select 
    coalesce(
      sum(upper(value)-lower(value)),
      '00:00:00') 
  from (select unnest($1) as value) as _;
$_$;

Merging The Ranges in an Array

Now the more tricky part, the function for storing the internal state.

For each new value I need to go through all the values, I already have in an array, and check if any of them overlaps or is adjacent to the new one. What’s more, it’s possible that the value overlaps with two already existing values, so I need to do even more checks.

This function must have tsrange[] (the current internal state) and tsrange (the new value) as the arguments and should return tsrange[], which the aggregate will store in the internal state.

create function add_overlapping_tsrange_to_array(
  data tsrange[], value tsrange)
returns tsrange[]
language plpgsql
as $_$
declare
  -- array of values to return
  new_data tsrange[];
  -- value merged with values from the internal state
  new_value tsrange;
  -- current element from loop
  element tsrange;
begin

  -- we don't want to process null value
  if value is null then
    return new_data;
  end if;

  -- this will be merged with overlapping values
  -- and will be added to the returned array
  -- at the end
  new_value := value;

  -- loop over the internal state values
  for i in 1..coalesce(array_upper(data, 1), 0) loop
    element = data[i];
    -- if the current element is adjacent or overlaps the new_value
    -- then merged it and store in new_value
    if (element && new_value or element -|- new_value) then
      new_value := new_value + element;
    else
      -- if the current element is not adjacent and doesn't overlap
      -- then just add it to the returned list
      new_data := array_append(new_data, element);
    end if;
  end loop;
  
  -- here we have the all the ranges merged with the value one
  new_data := array_append(new_data, new_value);
  
  return new_data;

end;
$_$;

The Aggregates

I can create two aggregates out of the above function:

tsrange_agg - will return an array of the merged ranges
tsrange_length_agg - will return the length of the merged ranges


create aggregate tsrange_agg(tsrange) (
  sfunc = add_overlapping_tsrange_to_array,
  stype = tsrange[],
  initcond = '{}'
);

create aggregate tsrange_length_agg(tsrange) (
  sfunc = add_overlapping_tsrange_to_array,
  stype = tsrange[],
  initcond = '{}',
  finalfunc = tsrange_array_length
);

The Final Query

The previously used query, which generated overlapping ranges, was:

select 
  user_id, 
  event, 
  lower(timerange)::date as day, 
  array_agg(timerange) 
from events 
group by 
  user_id, 
  event, 
  lower(timerange)::date;

Here is a similar one, but using the aggregates:

select 
  user_id, 
  event, 
  lower(timerange)::date as day, 
  tsrange_agg(timerange) merged_range,
  tsrange_length_agg(timerange) length
from events 
group by 
  user_id, 
  event, 
  lower(timerange)::date;

And the results:

-[ RECORD 1 ]+----------------------------------------------------------------------------------------------------------
user_id      | 1
event        | bbb
day          | 2010-01-01
merged_range | 
  {"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")"}
length       | 00:10:00
-[ RECORD 2 ]+----------------------------------------------------------------------------------------------------------
user_id      | 2
event        | aaa
day          | 2010-01-01
merged_range | 
  {"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
   "[\"2010-01-01 11:00:00\",\"2010-01-01 11:10:00\")"}
length       | 00:20:00
-[ RECORD 3 ]+----------------------------------------------------------------------------------------------------------
user_id      | 2
event        | bbb
day          | 2010-01-01
merged_range | 
  {"[\"2010-01-01 11:00:00\",\"2010-01-01 12:00:00\")"}
length       | 01:00:00
-[ RECORD 4 ]+----------------------------------------------------------------------------------------------------------
user_id      | 1
event        | aaa
day          | 2010-01-02
merged_range | 
  {"[\"2010-01-02 10:00:00\",\"2010-02-01 14:00:00\")"}
length       | 30 days 04:00:00
-[ RECORD 5 ]+----------------------------------------------------------------------------------------------------------
user_id      | 1
event        | aaa
day          | 2010-01-01
merged_range | 
  {"[\"2010-01-01 10:00:00\",\"2010-01-01 10:10:00\")",
   "[\"2010-01-01 10:20:00\",\"2010-01-01 12:00:00\")"}
length       | 01:50:00

Summary

These aggregates work pretty nicely and simplify the queries. There can be made many others, for other data types and with different logic. Don’t process all your data only in applications, too often it’s too slow.