Over the last five years, our database at Trustly have kept growing in number of tables, columns and functions, where some of the tables and columns aren’t being used by any database functions any longer. Getting rid of them is important, as otherwise people working with the database will be confused and annoyed. Database developers should always be able to rely on the data model being relevant and up to date.
In our system, no applications access the database tables directly, instead everything goes through stored procedures.
This means, if a table or column name is not present anywhere in any function’s source code, it’s very likely the table/column is not being used by anything. The only exception is if you have dynamically crafted queries executed using EXECUTE, where the table/column names are constructed from different parts. In our system, we thankfully only have a few such cases.
SELECT Tables.TableName FROM ( SELECT DISTINCT regexp_replace(pg_catalog.pg_class.relname,'s$','') FROM pg_catalog.pg_class INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace WHERE pg_catalog.pg_class.relkind = 'r' AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog') ) Tables(TableName) WHERE NOT EXISTS ( SELECT 1 FROM pg_catalog.pg_proc WHERE pg_catalog.pg_proc.prosrc ~* Tables.TableName ) ORDER BY Tables.TableName
This query returned quite a lot of table names with about half of them being false positives,
but still a managable list to go through manually.
50 minutes of manual work later:
92 files changed, 1114 deletions(-) DROP TABLE: 16 DROP VIEW: 6 DROP SEQUENCES: 7 DROP FK_CONSTRAINTS: 5 DROP CONSTRAINTS: 17 False positives: 14
Then I moved on to the task of finding unused table columns.
The query below excludes any false positives found in the previous query.
SELECT DISTINCT Columns.ColumnName FROM ( SELECT regexp_replace(pg_catalog.pg_attribute.attname,'id$','') FROM pg_catalog.pg_class INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace INNER JOIN pg_catalog.pg_attribute ON pg_catalog.pg_attribute.attrelid = pg_catalog.pg_class.oid WHERE pg_catalog.pg_class.relkind = 'r' AND pg_catalog.pg_attribute.attnum > 0 AND NOT pg_catalog.pg_attribute.attisdropped AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog') -- Exclude columns in tables we know are unused by the stored procedures, but we want to keep around anyway: AND pg_catalog.pg_class.relname !~* '^(alexacountrie|bankersworldonline|bindgroup|clearinghousecurrencie|dailystat|geoiporganization|hourlystat|usercommitment|polishbank|polishbanknumber|swedishpostalcode|testbasedata|testperlmodule|useraccesslogarchive)' ) Columns(ColumnName) WHERE NOT EXISTS ( SELECT 1 FROM pg_catalog.pg_proc WHERE pg_catalog.pg_proc.prosrc ~* Columns.ColumnName ) ORDER BY Columns.ColumnName
It took me two hours to go through all code. It was an interesting journey in time with lots of memories.
50 files changed, 5396 insertions(+), 5917 deletions(-) ALTER TABLE DROP COLUMN: 30 False positives: 87
The reason why there were so many insertions and deletions, was because the dropped columns affected some of the base tables with reference data, which had to be regenerated, thus affecting all lines in those files.
In summary, the ROI on those three hours of time invested is enormous. Developers can now feel confident all tables and columns fulfill a purpose in the system. This exercise will of course need to be repeated in the future though.