An Error in A Query Removed All Data

Tue, Nov 28, 2017

by Szymon Lipiński

tags: database postgresql sql

There Was an Error in a Good Query

After applying a data migration we had some duplicated rows in a table. I wrote a simple query for finding the duplicates and removing them from the table. All went fine. Or at least I thought so.

This kind of operations should always be done in a transaction, so the changes can be rolled back on error. The general workflow I use looks like this:

BEGIN;

DELETE FROM x...

SELECT * FROM x...

...

This leaves an open transaction, so I can check all the data without modifying them permanently in the database.

An Example of The Problem

I cannot show you the production code. This is an artificial example, but the problem is the same.

The Data Structure

There is just one table. The column some_id is a foreign key. For this example code I don’t want to add a new table, so it’s just an integer here.

CREATE TABLE test(
  id SERIAL PRIMARY KEY,
  some_id INTEGER NOT NULL,
  word TEXT NOT NULL
);

And the table looks like this:


                             Table "public.test"
 Column  |  Type   | Nullable |             Default              
---------+---------+----------+----------------------------------
 id      | integer | not null | nextval('test_id_seq'::regclass)
 some_id | integer | not null | 
 word    | text    | not null | 
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

Let’s insert some data:


INSERT INTO test(some_id, word)
VALUES
(1, 'aaa'),
(1, 'bbb'),
(1, 'ccc'),
(1, 'ddd'),
(1, 'aaa'),
(2, 'bbb'),
(2, 'bbb'),
(1, 'ccc'),
(2, 'ccc'),
(3, 'ccc')
;

And the data is:

SELECT * FROM test ORDER BY some_id, word;

 id | some_id | word 
----+---------+------
  5 |       1 | aaa
  1 |       1 | aaa
  2 |       1 | bbb
  8 |       1 | ccc
  3 |       1 | ccc
  4 |       1 | ddd
  6 |       2 | bbb
  7 |       2 | bbb
  9 |       2 | ccc
 10 |       3 | ccc
(10 rows)

The duplicates can be hard to see here, so let’s find them:


SELECT some_id, array_agg(word ORDER BY word)
FROM test
GROUP BY some_id
ORDER BY some_id;

 some_id |         array_agg         
---------+---------------------------
       1 | {aaa,aaa,bbb,ccc,ccc,ddd}
       2 | {bbb,bbb,ccc}
       3 | {ccc}
(3 rows)

The Constraint

According to the business logic, there can be only one row with a given word for each some_id. To ensure that the data is consistent, we need a database constraint:

ALTER TABLE test
ADD CONSTRAINT unique_some_id_word 
UNIQUE(some_id, word);

However trying to create this constraint on the table test with the current data ends with:

=> ALTER TABLE test ADD CONSTRAINT unique_some_id_word UNIQUE(some_id, word);
ERROR:  could not create unique index "unique_some_id_word"
DETAIL:  Key (some_id, word)=(2, bbb) is duplicated.

Find The Duplicates

The simplest way would be to use the rank windowing function.

SELECT
  id,
  RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
FROM test;

The above query divides the whole table data into small windows. Each of them will have the same values of the pair (some_id, word). The rows inside each window are sorted by the id (which doesn’t really matter, it’s just to have consistent results). Then each row in the window gets a unique increasing number, which is stored in the r column.

 id | r 
----+---
  1 | 1
  5 | 2
  2 | 1
  3 | 1
  8 | 2
  4 | 1
  6 | 1
  7 | 2
  9 | 1
 10 | 1
(10 rows)

If any of the rows has r value bigger than 1 then it’s a duplicate.

SELECT id
FROM (
  SELECT
    id,
    RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
  FROM test
) x
WHERE x.r > 1;

The rows we need to delete are:

 id 
----
  5
  8
  7
(3 rows)

Removing The Duplicates

In the previous query we got all the ids to remove. I thought that deleting the rows would be easy:

DELETE FROM test
WHERE id IN (
  SELECT id
  FROM (
    SELECT
      id,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);
DELETE 3

And after that I got:

SELECT some_id, array_agg(word ORDER BY word)
FROM test
GROUP BY some_id
ORDER BY some_id;
 some_id |     array_agg     
---------+-------------------
       1 | {aaa,bbb,ccc,ddd}
       2 | {bbb,ccc}
       3 | {ccc}
(3 rows)

So it’s what I wanted.

The Terrible Mistake

In the original query I made a mistake. I used word instead of id.

DELETE FROM test
WHERE id IN (
  SELECT id
  FROM (
    SELECT
      word,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);

And it was a little bit devastating:

DELETE 10

=> SELECT * FROM test;
 id | some_id | word 
----+---------+------
(0 rows)

Yey, so I just REMOVED ALL THE DATA.

The Problem

This looks weird, but let’s go from the beginning. The internal query is fine and returns:

SELECT
  word,
  RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
FROM test;

 word | r 
------+---
 aaa  | 1
 aaa  | 2
 bbb  | 1
 ccc  | 1
 ccc  | 2
 ddd  | 1
 bbb  | 1
 bbb  | 2
 ccc  | 1
 ccc  | 1
(10 rows)

However the next query has a huge error inside:

SELECT id
  FROM (
    SELECT
      word,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);

ERROR:  column "id" does not exist
LINE 1: SELECT id
               ^

How does the external query work? Well, it’s possible to use the external table columns in a subquery. This way a query like this works:

UPDATE x SET a = (SELECT y.a FROM y WHERE y.some_id = a.id);

This is what happened here:

DELETE FROM test
WHERE id IN (  -- <= this is test.id
  SELECT id    -- <= this is also test.id
  FROM (
    SELECT
      word,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);

Let’s just list all the ids we have in the test table:

SELECT id
FROM test
ORDER BY id;

 id 
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

Let’s also check what we really get when trying to delete the rows:

SELECT id
FROM test
WHERE id IN (  -- <= this is test.id
  SELECT id    -- <= this is also test.id
  FROM (
    SELECT
      word,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);

 id 
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

In both queries we get the same thing.

The Summary

I don’t think it’s a bug. It’s a valid query, which has tremendously terrific side effect. It removes all the data, just because of an error in a subquery, which then was silently replaced with something else.

Another reason for testing code before running that on production.