An Error in A Query Removed All Data

Author: Szymon Lipiński
Published at: 2017-11-28

An Error in a Query

After a data migration we had some duplicated rows in a table. I wrote a simple query for finding the duplicates, and then removing the duplicated rows. All went fine.

You should always do that in a transaction, so you can rollback all changes. The general work flow I use looks like this:

BEGIN;

DELETE FROM x...

SELECT * FROM x...

...

This leaves an open transaction, so I can check all the data without changing that.

Example of The Problem

I cannot show you the production code. This is an artificial example, but the problem is the same.

The Data Structure

There is just one table. The column some_id is a foreign key, for this example code I don’t want to add a new table here, so it’s just an integer here.

CREATE TABLE test(
  id SERIAL PRIMARY KEY,
  some_id INTEGER NOT NULL,
  word TEXT NOT NULL
);

And the table looks like this:

                             Table "public.test"
 Column  |  Type   | Nullable |             Default              
---------+---------+----------+----------------------------------
 id      | integer | not null | nextval('test_id_seq'::regclass)
 some_id | integer | not null | 
 word    | text    | not null | 
Indexes:
    "test_pkey" PRIMARY KEY, btree (id)

Let’s insert some data:

INSERT INTO test(some_id, word)
VALUES
(1, 'aaa'),
(1, 'bbb'),
(1, 'ccc'),
(1, 'ddd'),
(1, 'aaa'),
(2, 'bbb'),
(2, 'bbb'),
(1, 'ccc'),
(2, 'ccc'),
(3, 'ccc')
;

And the data is:

SELECT * FROM test ORDER BY some_id, word;
 id | some_id | word 
----+---------+------
  5 |       1 | aaa
  1 |       1 | aaa
  2 |       1 | bbb
  8 |       1 | ccc
  3 |       1 | ccc
  4 |       1 | ddd
  6 |       2 | bbb
  7 |       2 | bbb
  9 |       2 | ccc
 10 |       3 | ccc
(10 rows)

The duplicates can be hard to see here, so let’s check this:

SELECT some_id, array_agg(word ORDER BY word)
FROM test
GROUP BY some_id
ORDER BY some_id;

 some_id |         array_agg         
---------+---------------------------
       1 | {aaa,aaa,bbb,ccc,ccc,ddd}
       2 | {bbb,bbb,ccc}
       3 | {ccc}
(3 rows)

The Constraint

The business logic for that table is that there can be just one word for the some_id. To ensure that the data is consistent, we need a database constraint:

ALTER TABLE test
ADD CONSTRAINT unique_some_id_word 
UNIQUE(some_id, word);

However trying to create this constraint on the table test with the current data ends with:

=> ALTER TABLE test ADD CONSTRAINT unique_some_id_word UNIQUE(some_id, word);
ERROR:  could not create unique index "unique_some_id_word"
DETAIL:  Key (some_id, word)=(2, bbb) is duplicated.

Find The Duplicates

The simplest way would be to use the rank windowing function.

SELECT
  id,
  RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
FROM test;

The above query divides the whole table data into small windows, each of them will have the same values (some_id, word), then inside the window, the rows are sorted by id (which doesn’t really matter, it’s just to have consistent results). Then each row in the window gets a number, which is stored in the r column.

 id | r 
----+---
  1 | 1
  5 | 2
  2 | 1
  3 | 1
  8 | 2
  4 | 1
  6 | 1
  7 | 2
  9 | 1
 10 | 1
(10 rows)

And now if any of the rows has r value bigger than 1 then it’s a duplicate.

SELECT id
FROM (
  SELECT
    id,
    RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
  FROM test
) x
WHERE x.r > 1;

And this returns:

 id 
----
  5
  8
  7
(3 rows)

Removing The Duplicates

In the previous query we got all the ids to remove, so now it should be as easy as:

DELETE FROM test
WHERE id IN (
  SELECT id
  FROM (
    SELECT
      id,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);
DELETE 3

And after that I get:

SELECT some_id, array_agg(word ORDER BY word)
FROM test
GROUP BY some_id
ORDER BY some_id;
 some_id |     array_agg     
---------+-------------------
       1 | {aaa,bbb,ccc,ddd}
       2 | {bbb,ccc}
       3 | {ccc}
(3 rows)

So it’s what I wanted.

The Terrible Mistake

In the original query I made a mistake, and I used word instead of id. Which generally looked like this:

DELETE FROM test
WHERE id IN (
  SELECT id
  FROM (
    SELECT
      word,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);

And this is a little bit devastating:

DELETE 10

=> SELECT * FROM test;
 id | some_id | word 
----+---------+------
(0 rows)

Yey, so just REMOVED ALL THE DATA.

The Problem

This looks wierd, but let’s go from the beginning.

The internal query is fine, and returns:

SELECT
  word,
  RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
FROM test;

 word | r 
------+---
 aaa  | 1
 aaa  | 2
 bbb  | 1
 ccc  | 1
 ccc  | 2
 ddd  | 1
 bbb  | 1
 bbb  | 2
 ccc  | 1
 ccc  | 1
(10 rows)

However the next query has a huge error inside:

SELECT id
  FROM (
    SELECT
      word,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);

ERROR:  column "id" does not exist
LINE 1: SELECT id
               ^

So how does the external query work? Well, it’s possible to use the external table columns in a subquery. This way a query like this works:

UPDATE x SET a = (SELECT y.a FROM y WHERE y.some_id = a.id);

And this is what happened here:

DELETE FROM test
WHERE id IN (  -- <= this is test.id
  SELECT id    -- <= this is also test.id
  FROM (
    SELECT
      word,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);

Let’s just list all the ids we have in the test` table:

SELECT id
FROM test
ORDER BY id;

 id 
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

and let’s check what we really get when trying to delete the rows:

SELECT id
FROM test
WHERE id IN (  -- <= this is test.id
  SELECT id    -- <= this is also test.id
  FROM (
    SELECT
      word,
      RANK() OVER (PARTITION BY some_id, word ORDER BY id) r
    FROM test
  ) x
  WHERE x.r > 1
);

 id 
----
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
(10 rows)

So generally we get the same thing.

The Summary

I don’t think it’s a bug. It’s a valid query, which has tremendously terrific side effect. It just removes all the data, just because of an error in a subquery, which then was silently replaced with something else.

Another reason for testing code before running that on production.

The comments are disabled. If you want to write something to me, you can use e.g. Twitter.