PostGIS and Small Number of Huge Geometries

by Szymon LipiƄski

A couple of weeks ago I had quite a strange situation: a PostgreSQL table with small number of huge geometries. A program had to query the table really often, so the speed was a crucial thing.

The table was really small, but the spatial query, using PostGIS, was very slow.

Replicate the Situation

Small Number of Big Geometries

First of all let’s create a couple of huge geometries:

CREATE TABLE big_geometries (
  id SERIAL PRIMARY KEY,
  geometry geometry
);

and insert some data:

INSERT INTO big_geometries (geometry)
SELECT
    ST_Buffer(
      GeomFromEWKT('SRID=4326;POINT(' || x || ' ' || y || ')'),
        20,
        'quad_segs=10000')
FROM
    generate_series(1,50,2) x,
    generate_series(10, 50, 5) y;

CREATE INDEX i_big_geometries_geometry
  ON big_geometries USING GIST (geometry);

ANALYZE big_geometries;

How many geometries do we have there?

SELECT COUNT(*) FROM big_geometries;

 count
-------
  225
(1 row)

And how many points per each?

SELECT min(
    ST_NPoints(geometry)), 
    avg(ST_NPoints(geometry)),
    max(ST_NPoints(geometry)
  )
FROM big_geometries;

  min  |        avg         |  max
-------+--------------------+-------
 40001 | 40001.000000000000 | 40001
(1 row)

Huge Number of Small Geometries

Let’s create another table, full of small geometries, just for comparison.

CREATE TABLE small_geometries (
  id SERIAL PRIMARY KEY,
  geometry geometry
);

INSERT INTO small_geometries (geometry)
SELECT
    ST_Buffer(
      GeomFromEWKT('SRID=4326;POINT(' || x || ' ' || y || ')'),
      1,
      'quad_segs=8'
    )
FROM
    generate_series(1,80,2) x, generate_series(1,80,2) y;

CREATE INDEX i_small_geometries_geometry
  ON small_geometries USING GIST (geometry);

SELECT COUNT(*) FROM small_geometries;
 count
-------
 1600
(1 row)

SELECT min(
    ST_NPoints(geometry)), 
    avg(ST_NPoints(geometry)), 
    max(ST_NPoints(geometry)
  )
FROM small_geometries;

 min |         avg         | max
-----+---------------------+-----
  33 | 33.0000000000000000 |  33
(1 row)

Meta Info

To simulate the real situation, I need another table. A table with some meta information for those geometries:

CREATE TABLE big_info(
  id SERIAL PRIMARY KEY,
  geometry_id INTEGER REFERENCES big_geometries,
  info TEXT NOT NULL,
  version INTEGER NOT NULL,
  CHECK(version >= 0),
  UNIQUE(geometry_id, version)
);

CREATE TABLE small_info(
  id SERIAL PRIMARY KEY,
  geometry_id INTEGER REFERENCES small_geometries,
  info TEXT NOT NULL,
  version INTEGER NOT NULL,
  CHECK(version >= 0),
  UNIQUE(geometry_id, version)
);

Add some semi-random data:

INSERT INTO big_info(geometry_id, info, version)
  SELECT id, 'big_'||id, 0 FROM big_geometries LIMIT 25;

INSERT INTO big_info(geometry_id, info, version)
  SELECT id, 'big_'||id, 1 FROM big_geometries LIMIT 40;

INSERT INTO big_info(geometry_id, info, version)
  SELECT id, 'big_'||id, 3 FROM big_geometries LIMIT 300;

INSERT INTO small_info(geometry_id, info, version)
  SELECT id, 'small_'||id, 0 FROM small_geometries LIMIT 25;

INSERT INTO small_info(geometry_id, info, version)
  SELECT id, 'small_'||id, 1 FROM small_geometries LIMIT 40;

INSERT INTO small_info(geometry_id, info, version)
  SELECT id, 'small_'||id, 3 FROM small_geometries LIMIT 300;

CREATE INDEX i_big_info_geometry_id ON big_info (geometry_id);
CREATE INDEX i_small_info_geometry_id ON small_info (geometry_id);

And some final queries:

VACUUM ANALYZE small_geometries;
VACUUM ANALYZE big_geometries;
VACUUM ANALYZE small_info;
VACUUM ANALYZE big_info;

All Tables

OK, so now I have two tables with geometries:

  table name         no. of geometries    no. of points in each geometry
  ------------------ -------------------- ---------------------------------
  big_geometries      255                  40k
  small_geometries    1.6k                 30

And two tables with meta info, both with 345 rows.

Let’s Query Now

Get Simple Data

The thing I need to get is meta information for geometry which contains given geometry and has given info text.

I will check both tables (big and small) using similar queries.

SELECT i.id, i.info
FROM small_geometries g, small_info i
WHERE
      ST_Intersects(
        geometry,
        GeomFromEWKT('SRID=4326;LINESTRING(1 1, 22 22)')
      )
  AND i.info = 'small_1'
  AND g.id = i.geometry_id
  AND i.version = (
                    SELECT max(ii.version) 
                    FROM small_info ii
                    WHERE ii.geometry_id = i.geometry_id
                  );

 id |  info
----+---------
 66 | small_1
(1 row)

Time: 3.302 ms

SELECT i.id, i.info
FROM big_geometries g, big_info i
WHERE
      ST_Intersects(
        geometry,
        GeomFromEWKT('SRID=4326;LINESTRING(1 1, 22 22)')
      )
  AND i.info = 'big_10'
  AND g.id = i.geometry_id
  AND i.version = (
                    SELECT max(ii.version)
                    FROM big_info ii
                    WHERE ii.geometry_id = i.geometry_id
                  );

 id |  info
----+--------
 75 | big_10
(1 row)

Time: 971.588 ms

The time difference between those queries is really significant. In my real life example I’ve got the table with big geometries and the time 1s per query is too big for me.

In the real life I don’t query for such a big geometry like 'LINESTRING (1 1, 22 22)'. I use it here, just to have a nicer example.

Let’s try to do something about that.

Check Execution Plans

Let’s compare the execution plans for both:

EXPLAIN ANALYZE
SELECT i.id, i.info
FROM small_geometries g, small_info i
WHERE
      ST_Intersects(
        geometry,
        GeomFromEWKT('SRID=4326;LINESTRING(1 1, 22 22)')
      )
  AND i.info = 'small_1'
  AND g.id = i.geometry_id
  AND i.version = (
                    SELECT max(ii.version)
                    FROM small_info ii
                    WHERE ii.geometry_id = i.geometry_id
                  );

                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=5.11..479.96 rows=1 width=13) (actual time=0.435..3.529 rows=1 loops=1)
   ->  Bitmap Heap Scan on small_geometries g  (cost=5.11..161.18 rows=38 width=4) (actual time=0.228..3.249 rows=11 loops=1)
         Recheck Cond: (geometry && 'gggg'::geometry)
         Filter: _st_intersects(geometry, 'gggg'::geometry)
         ->  Bitmap Index Scan on i_small_geometries_geometry  (cost=0.00..5.10 rows=113 width=0) (actual time=0.064..0.064 rows=144 loops=1)
               Index Cond: (geometry && 'gggg'::geometry)
   ->  Index Scan using i_small_info_geometry_id on small_info i  (cost=0.00..8.38 rows=1 width=21) (actual time=0.022..0.022 rows=0 loops=11)
         Index Cond: (i.geometry_id = g.id)
         Filter: ((i.info = 'small_1'::text) AND (i.version = (SubPlan 1)))
         SubPlan 1
           ->  Aggregate  (cost=7.57..7.58 rows=1 width=4) (actual time=0.059..0.060 rows=1 loops=3)
                 ->  Seq Scan on small_info ii  (cost=0.00..7.56 rows=1 width=4) (actual time=0.002..0.051 rows=3 loops=3)
                       Filter: (geometry_id = $0)
 Total runtime: 3.649 ms
(14 rows)

I’ve changed this explain info a little bit (others are change too) to make it more readible. In the plan there are some geometries in the WKB format. I’ve exchanged the below geometry with the ‘gggg’ string.

0102000020E610000002000000000000000000F03F000000000000F03F00000000000036400000000000003640

This geometry is exactly:

SELECT ST_AsText('0102000020E610000002000000000000000000F03F000000000000F03F00000000000036400000000000003640');
       st_astext
-----------------------
 LINESTRING(1 1,22 22)
(1 row)

The plan for query using the big_geometries table:

EXPLAIN ANALYZE
SELECT i.id, i.info
FROM big_geometries g, big_info i
WHERE
      ST_Intersects(
        geometry, 
        GeomFromEWKT('SRID=4326;LINESTRING(1 1, 22 22)')
      )
  AND i.info = 'big_10'
  AND g.id = i.geometry_id
  AND i.version = (
                    SELECT max(ii.version) 
                    FROM big_info ii 
                    WHERE ii.geometry_id = i.geometry_id
                  );

                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=4.81..190.36 rows=1 width=11) (actual time=61.797..959.664 rows=1 loops=1)
   ->  Bitmap Heap Scan on big_geometries g  (cost=4.81..26.24 rows=25 width=4) (actual time=10.314..955.373 rows=124 loops=1)
         Recheck Cond: (geometry && 'gggg'::geometry)
         Filter: _st_intersects(geometry, 'gggg'::geometry)
         ->  Bitmap Index Scan on i_big_geometries_geometry  (cost=0.00..4.81 rows=74 width=0) (actual time=0.044..0.044 rows=147 loops=1)
               Index Cond: (geometry && 'gggg'::geometry)
   ->  Index Scan using i_big_info_geometry_id on big_info i  (cost=0.00..6.55 rows=1 width=19) (actual time=0.026..0.026 rows=0 loops=124)
         Index Cond: (i.geometry_id = g.id)
         Filter: ((i.info = 'big_10'::text) AND (i.version = (SubPlan 1)))
         SubPlan 1
           ->  Aggregate  (cost=5.63..5.64 rows=1 width=4) (actual time=0.053..0.054 rows=1 loops=3)
                 ->  Seq Scan on big_info ii  (cost=0.00..5.62 rows=1 width=4) (actual time=0.005..0.043 rows=3 loops=3)
                       Filter: (geometry_id = $0)
 Total runtime: 959.806 ms
(14 rows)

Analysis

Both plans look like this:

The only difference between those two queries is that one table has got small number of big geometries, second has got huge number of small geometries.

After a couple of checks it turned out that the PostGIS function ST_Intersects is too expensive. The definition of the function looks like this:

CREATE OR REPLACE FUNCTION _st_intersects(geometry, geometry)
  RETURNS boolean AS
  '$lib/postgis-1.5', 'intersects'
  LANGUAGE c IMMUTABLE STRICT
  COST 100;

The last parameter COST is the cost that is taken by planner to calculate cost of different execution plans. What if I change that a little bit to some extremely high value?

The Solution

What about changing the order of those subqueries? First find metadata, later check geometries for each found metadata?

CREATE OR REPLACE FUNCTION _st_intersects(geometry, geometry)
  RETURNS boolean AS
  '$lib/postgis-1.5', 'intersects'
  LANGUAGE c IMMUTABLE STRICT
  COST 10000;

This should make PostgreSQL avoid it at all costs.

Execution Plans After The Change

Let’s check the execution plans once again:

EXPLAIN ANALYZE
SELECT i.id, i.info
FROM big_geometries g, big_info i
WHERE
      ST_Intersects(
        geometry,
        GeomFromEWKT('SRID=4326;LINESTRING(1 1, 22 22)')
      )
  AND i.info = 'big_10'
  AND g.id = i.geometry_id
  AND i.version = (
                    SELECT max(ii.version) 
                    FROM big_info ii 
                    WHERE ii.geometry_id = i.geometry_id
                  );
                                                                                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..1688.06 rows=1 width=11) (actual time=18.538..18.637 rows=1 loops=1)
   ->  Index Scan using i_big_info_geometry_id on big_info i  (cost=0.00..1654.77 rows=1 width=19) (actual time=0.417..0.504 rows=1 loops=1)
         Filter: ((info = 'big_10'::text) AND (version = (SubPlan 1)))
         SubPlan 1
           ->  Aggregate  (cost=5.63..5.64 rows=1 width=4) (actual time=0.115..0.116 rows=1 loops=3)
                 ->  Seq Scan on big_info ii  (cost=0.00..5.62 rows=1 width=4) (actual time=0.008..0.104 rows=3 loops=3)
                       Filter: (geometry_id = $0)
   ->  Index Scan using big_geometries_pkey on big_geometries g  (cost=0.00..33.27 rows=1 width=4) (actual time=18.114..18.123 rows=1 loops=1)
         Index Cond: (g.id = i.geometry_id)
         Filter: ((g.geometry && 'gggg'::geometry) AND _st_intersects(g.geometry, 'gggg'::geometry))
 Total runtime: 18.736 ms
(11 rows)

It really helped.

One more thing left to do, check if it hasn’t broke anything with the small_geometries table:

EXPLAIN ANALYZE
SELECT i.id, i.info
FROM small_geometries g, small_info i
WHERE
      ST_Intersects(
        geometry, 
        GeomFromEWKT('SRID=4326;LINESTRING(1 1, 22 22)')
      )
  AND i.info = 'small_1'
  AND g.id = i.geometry_id
  AND i.version = (
                    SELECT max(ii.version) 
                    FROM small_info ii 
                    WHERE ii.geometry_id = i.geometry_id
                  );
                                                                                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..2820.75 rows=1 width=13) (actual time=0.374..0.477 rows=1 loops=1)
   ->  Index Scan using i_small_info_geometry_id on small_info i  (cost=0.00..2787.46 rows=1 width=21) (actual time=0.206..0.303 rows=1 loops=1)
         Filter: ((info = 'small_1'::text) AND (version = (SubPlan 1)))
         SubPlan 1
           ->  Aggregate  (cost=7.57..7.58 rows=1 width=4) (actual time=0.060..0.061 rows=1 loops=3)
                 ->  Seq Scan on small_info ii  (cost=0.00..7.56 rows=1 width=4) (actual time=0.002..0.052 rows=3 loops=3)
                       Filter: (geometry_id = $0)
   ->  Index Scan using small_geometries_pkey on small_geometries g  (cost=0.00..33.27 rows=1 width=4) (actual time=0.163..0.166 rows=1 loops=1)
         Index Cond: (g.id = i.geometry_id)
         Filter: ((g.geometry && 'gggg'::geometry) AND _st_intersects(g.geometry, 'gggg'::geometry))
 Total runtime: 0.520 ms
(11 rows)

The plan looks the same as for big_geometries. The time is much better now.

Summary

1) I changed only the cost of the function _st_intersects(geometry, geometry), quite nice effect as for changing just the library function cost.

  Geometries                Time Before [ms]       Time After [ms]
  ------------------------- ---------------------- -----------------------
  BIG                       971.588                10.328
  SMALL                     3.302                  1.462

2) I think it should be considered to change the default cost in PostGIS installation to some higher value, maybe also for other spatial functions.