Garbage Collection of Unused PostgreSQL Tables and Columns

Over the last five years, our database at Trustly have kept growing in number of tables, columns and functions, where some of the tables and columns aren’t being used by any database functions any longer. Getting rid of them is important, as otherwise people working with the database will be confused and annoyed. Database developers should always be able to rely on the data model being relevant and up to date.

In our system, no applications access the database tables directly, instead everything goes through stored procedures.

This means, if a table or column name is not present anywhere in any function’s source code, it’s very likely the table/column is not being used by anything. The only exception is if you have dynamically crafted queries executed using EXECUTE, where the table/column names are constructed from different parts. In our system, we thankfully only have a few such cases.

SELECT
Tables.TableName
FROM (
SELECT DISTINCT
regexp_replace(pg_catalog.pg_class.relname,'s$','')
FROM pg_catalog.pg_class
INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
WHERE pg_catalog.pg_class.relkind = 'r'
AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog')
) Tables(TableName)
WHERE NOT EXISTS (
SELECT 1 FROM pg_catalog.pg_proc
WHERE pg_catalog.pg_proc.prosrc ~* Tables.TableName
)
ORDER BY Tables.TableName

This query returned quite a lot of table names with about half of them being false positives,
but still a managable list to go through manually.

50 minutes of manual work later:

92 files changed, 1114 deletions(-)
DROP TABLE: 16
DROP VIEW: 6
DROP SEQUENCES: 7
DROP FK_CONSTRAINTS: 5
DROP CONSTRAINTS: 17
False positives: 14

Then I moved on to the task of finding unused table columns.
The query below excludes any false positives found in the previous query.

SELECT DISTINCT
    Columns.ColumnName
FROM (
    SELECT
        regexp_replace(pg_catalog.pg_attribute.attname,'id$','')
    FROM pg_catalog.pg_class
    INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
    INNER JOIN pg_catalog.pg_attribute ON pg_catalog.pg_attribute.attrelid = pg_catalog.pg_class.oid
    WHERE pg_catalog.pg_class.relkind = 'r'
    AND pg_catalog.pg_attribute.attnum > 0
    AND NOT pg_catalog.pg_attribute.attisdropped
    AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog')
    -- Exclude columns in tables we know are unused by the stored procedures, but we want to keep around anyway:
    AND pg_catalog.pg_class.relname !~* '^(alexacountrie|bankersworldonline|bindgroup|clearinghousecurrencie|dailystat|geoiporganization|hourlystat|usercommitment|polishbank|polishbanknumber|swedishpostalcode|testbasedata|testperlmodule|useraccesslogarchive)'
) Columns(ColumnName)
WHERE NOT EXISTS (
    SELECT 1 FROM pg_catalog.pg_proc
    WHERE pg_catalog.pg_proc.prosrc ~* Columns.ColumnName
)
ORDER BY Columns.ColumnName

It took me two hours to go through all code. It was an interesting journey in time with lots of memories.

50 files changed, 5396 insertions(+), 5917 deletions(-)
ALTER TABLE DROP COLUMN: 30
False positives: 87

The reason why there were so many insertions and deletions, was because the dropped columns affected some of the base tables with reference data, which had to be regenerated, thus affecting all lines in those files.

In summary, the ROI on those three hours of time invested is enormous. Developers can now feel confident all tables and columns fulfill a purpose in the system. This exercise will of course need to be repeated in the future though.

Uncategorized

2014-04-30

5 Comments

alexk

I think dynamic queries are not the only thing one should worry about when dropping columns. An easy example to break things would be a function written in C (or other language that does’t expose the source in procsrc column). A more realistic one is the code that does SELECT * (or INSERT/UPDATE/DELETE RETURNING *). Of course, any good book on SQL will suggest listing the columns explicitly, but not every developer reads good books on SQL…

Also, if the column name is not unique, the statement above might not show some unused columns that have the same name as the used ones. Another point for having sane SQL coding conventions.

Permalink, Reply

2014-04-30 15:42

Son

It’s hard to find your articles in google. I found it on 17 spot,
you should build quality backlinks , it will help you to
rank to google top 10. I know how to help you, just type in google – k2 seo tricks

2014-07-12 15:57

Johnc64

Good writeup, I am normal visitor of ones blog, maintain up the excellent operate, and It’s going to be a regular visitor for a lengthy time. fbdbdkcbedfa

2014-09-01 14:05

EtsukoDutton

Hi admin, i see your site needs fresh posts. Daily updates will rank your
site in google higher, content is king nowadays. If you are to
lazy to write unique posts everyday you should search in google for:

Ightsero’s Essential Tool

2015-06-30 03:11

Joel on SQL