Garbage Collection of Unused PostgreSQL Tables and Columns

Over the last five years, our database at Trustly have kept growing in number of tables, columns and functions, where some of the tables and columns aren’t being used by any database functions any longer. Getting rid of them is important, as otherwise people working with the database will be confused and annoyed. Database developers should always be able to rely on the data model being relevant and up to date.

In our system, no applications access the database tables directly, instead everything goes through stored procedures.

This means, if a table or column name is not present anywhere in any function’s source code, it’s very likely the table/column is not being used by anything. The only exception is if you have dynamically crafted queries executed using EXECUTE, where the table/column names are constructed from different parts. In our system, we thankfully only have a few such cases.

FROM pg_catalog.pg_class
INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
WHERE pg_catalog.pg_class.relkind = 'r'
AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog')
) Tables(TableName)
SELECT 1 FROM pg_catalog.pg_proc
WHERE pg_catalog.pg_proc.prosrc ~* Tables.TableName
ORDER BY Tables.TableName

This query returned quite a lot of table names with about half of them being false positives,
but still a managable list to go through manually.

50 minutes of manual work later:

92 files changed, 1114 deletions(-)
False positives: 14

Then I moved on to the task of finding unused table columns.
The query below excludes any false positives found in the previous query.

    FROM pg_catalog.pg_class
    INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
    INNER JOIN pg_catalog.pg_attribute ON pg_catalog.pg_attribute.attrelid = pg_catalog.pg_class.oid
    WHERE pg_catalog.pg_class.relkind = 'r'
    AND pg_catalog.pg_attribute.attnum > 0
    AND NOT pg_catalog.pg_attribute.attisdropped
    AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog')
    -- Exclude columns in tables we know are unused by the stored procedures, but we want to keep around anyway:
    AND pg_catalog.pg_class.relname !~* '^(alexacountrie|bankersworldonline|bindgroup|clearinghousecurrencie|dailystat|geoiporganization|hourlystat|usercommitment|polishbank|polishbanknumber|swedishpostalcode|testbasedata|testperlmodule|useraccesslogarchive)'
) Columns(ColumnName)
    SELECT 1 FROM pg_catalog.pg_proc
    WHERE pg_catalog.pg_proc.prosrc ~* Columns.ColumnName
ORDER BY Columns.ColumnName

It took me two hours to go through all code. It was an interesting journey in time with lots of memories.

50 files changed, 5396 insertions(+), 5917 deletions(-)
False positives: 87

The reason why there were so many insertions and deletions, was because the dropped columns affected some of the base tables with reference data, which had to be regenerated, thus affecting all lines in those files.

In summary, the ROI on those three hours of time invested is enormous. Developers can now feel confident all tables and columns fulfill a purpose in the system. This exercise will of course need to be repeated in the future though.

SET search_path for all SECURITY DEFINER functions

As pointed out by Peter Eisentraut in a blog post named
Schema Search Paths Considered Pain in the Butt, you need to make sure the search_path is explicitly set for all SECURITY DEFINER functions in PostgreSQL.

Fixing this manually for, in my case, 2106 functions, is, indeed a “pain in the butt”, so I crafted a little query to automate the job:

\pset format unaligned
\o /tmp/fix_search_path_for_security_definer_functions.sql
            -- inject SET search_path in-between LANGUAGE and SECURITY DEFINER in the declaration
                E'(LANGUAGE [a-z]+)\\s+(SECURITY DEFINER)',
                E'\\1\n SET search_path TO public, pg_temp\n \\2'
from pg_proc
where prosecdef is true -- SECURITY DEFINER functions
-- don't include functions for which we have already specified a search_path
and not (coalesce(array_to_string(proconfig,''),'') like '%search_path%')
-- public schema
and pronamespace = 2200
\i /tmp/fix_search_path_for_security_definer_functions.sql
-- If all goes well you should see a lot of CREATE FUNCTION being spammed on the screen

Fun with pg_catalog.pg_depend

Learning PostgreSQL and SQL in general probably begins with the concept of TABLES. Then probably VIEWS, INDEXES and maybe TRIGGERS.
Some users might not ever go any further, which is sad, because there is so much more to explore!

I thought it would be cool to automatically generate a graph showing the dependencies between objects and their types.
This shows the order in which the different types of objects can be created,
perhaps mentally useful to think in terms of “after we have created a TABLE we can create an INDEX”.

Something like this would be nice to include in the PostgreSQL documentation online.
I think it would be helpful when learning about PostgreSQL different object types.

The graph below was produced using GraphViz dot command and live-data from pg_catalog.pg_depend:


As we can see, before anything else, we need a SCHEMA, which is the root node.
Once we have a SCHEMA, we can create TABLES, TYPES, VIEWS, SEQUENCES and FUNCTIONS.
Some users might not even know about SCHEMAs, as the schema “public” is pre-installed.
To create an INDEX, we first need a TABLE.
Etc, etc, etc…

You might be surprised FUNCTION and LANGUAGE have arrows pointing in both directions.
Turns out you need some functions before you can create a language like plperl, such as plperl_call_handler.
The self-referencing arrow from/to FUNCTION is less surprising as some functions can of course call other functions.

(Not all object types are included in this graph as I’m not using them all in my system.)

::xml madness

Nobody likes XML, except masochists and Microsoft consultants who charge by the hour.
Sometimes you are forced to deal with XML anyway, such as parsing the response from external APIs.
Parsing XML is per se a nasty business to begin with, but in this particular example, the ugliness set new records.

The only “solution” I came up with is too ugly for production use, but the alternatives were even uglier, so I had no option.

I hope there is someone out there reading this who can present a proper solution to the problem.

Let’s say you have this XML:

<testxml xmlns:ns1="" xmlns:ns2="">

Using xpath() you extract the content of testxml:

SELECT xpath(
    '<testxml xmlns:ns1="" xmlns:ns2=""><ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo></testxml>'::xml,
-- Result: <ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo>

The returned XML is not valid since its missing the xmlns definitions,
but the PostgreSQL XML data type doesn’t complain, which is OK I guess,
a bit of quirks mode perhaps?

Because of the missing xmlns, it’s impossible to make use of this XML fragment returned.
You cannot extract any subsequent sub-elements in it using XPath.

For instance, this won’t work:

SELECT xpath(
        '<testxml xmlns:ns1="" xmlns:ns2=""><ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo></testxml>'::xml,

-- Error:
-- ERROR:  could not create XPath object
-- DETAIL:  namespace error : Namespace prefix ns1 on foo is not defined
-- <ns1:foo>
--         ^
-- namespace error : Namespace prefix ns2 on bar is not defined
--   <ns2:bar>baz</ns2:bar>
--           ^
-- Undefined namespace prefix
-- xmlXPathCompiledEval: evaluation failed

Even if you pass it the NSArray in the outer xpath() call, you don’t get ‘baz’ but nothing at all.

SELECT xpath(
        '<testxml xmlns:ns1="" xmlns:ns2=""><ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo></testxml>'::xml,

-- Returns:
--  xpath 
-- -------
--  {}
-- (1 row)
-- And NOT 'baz' which is what we want.

Therefore, given you have an XML “fragment” where the tags have ns:___ but without the xmlns:___=”URI” part,
is there any way to extract sub-elements, given you know the namespaces?

For instance, by wrapping the XML “fragment” inside an outer XML tag,
and specifying bogous xmlns elements for the namespaces,
I managed to hack together a work-around, probably with hundreds of flaws,
and this cannot possibly be the best way to approach this problem.

SELECT xpath_fragment('/ns1:foo/ns2:bar/text()','<ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo>',ARRAY['ns1','ns2']);

(1 row)

Source code of insanely ugly “solution” to the “problem”:

CREATE OR REPLACE FUNCTION xpath_fragment(_XPath text, _XML xml, _NSNames text[]) RETURNS XML[] AS $BODY$
_ text;
_WrappedXML xml;
_NSArray text[][] := ARRAY[]::text[][];

SELECT ('<xml ' || array_to_string(array_agg('xmlns:' || unnest || '=" "'),' ') || '>' || _XML::text || '</xml>')::text INTO _WrappedXML FROM unnest(_NSNames);

FOR _ IN SELECT unnest(_NSNames)
    _NSArray := _NSArray || ARRAY[[_, ' ']];

RETURN xpath('/xml' || _XPath, _WrappedXML, _NSArray);

Ideas, anyone?

PL/pgSQL #1 annoyance

I cannot count the number of times I’ve written the same stupid code,
just to do something as simple as making sure the UPDATE
updates exactly one row.

Last time I complained about this in public was at our Christmas party at the office.
Normally nobody cares to listen to me when I’ve had a few beers and start talking about SQL stuff.
Next day I noticed a colleague had spent the hang-over-morning hacking on a patch.
What an awesome Christmas party I thought!

That was quite some time ago and my code still looks ugly, so let’s complain some more and hope it helps.

Number of ugly statements in our codebase:

$ grep -r -E "(IF NOT FOUND THEN)|(RETURNING (1|TRUE) INTO STRICT)" . | wc -l

In a typical OLTP application, each UPDATE typically affects exactly one row.
If more than one row or no row at all would match, you want to throw an exception, return false,
or what ever the proper behaviour might be in the situation.

This sounds like a trivial task, but it turns out there is no nice way of doing this in plpgsql.

Here are three different work-arounds to ensure the update went through:


Even if FOUND could mean more than one row was affected,
it’s safe to assume exactly one row was updated if you are updating
using a primary key or any other unique column.

UPDATE Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0;
    RAISE no_data_found;


If you are a bit more paranoid, you can check the exact row count,
and throw different exceptions based on if no rows were found
or if more than one was found.

UPDATE Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0;
IF _RowCount = 0 THEN
    RAISE no_data_found;
ELSIF _RowCount > 1 THEN
    RAISE too_many_rows;


This is the one I prefer. It guarantees only one row was affected,
and is at the same time a one-liner, or two lines if you could
the DECLARE of the boolean variable.

_OK boolean;
UPDATE Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0 RETURNING TRUE INTO STRICT _OK;

However, none of them are pretty. Something as simple and frequent should be in the syntax of a language.

The proposed syntax in johto’s patch was:

UPDATE STRICT Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0;

Tom Lane suggested to put the STRICT keyword in the beginning instead,
avoiding conflicts with table names, which sounds like a good idea to me.

STRICT UPDATE Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0;

I really hope the community will find a way to fix this syntax bug,
without causing to much pain in terms of backwards compatibility issues for users.

I remember when we upgraded from 8.4 to 9.1, I struggled changing all code where
IN/OUT parameters were in conflict with column names,
as the behaviour changed to raise an error.

Users who didn’t want to invest time fixing their code could simply set the setting plpgsql.variable_conflict in postgresql.conf to get back the old behaviour.

I think this was a very user-friendly and nice way to deal with the compatibility issue.

If the STRICT syntax isn’t possible because of hypothetical compatibility issues,
perhaps it could be up to the user to turn it on or off, using a similar setting?

I would even like to go a bit further and suggest making the update of exactly one row the default in PL/pgSQL.
In the typically fewer cases where you want to update more than one row, could explicitly specify it with some other keyword,
such as ALL UPDATE or whatever. This would make the code much cleaner, as you wouldn’t need to type STRICT
all over your code.

Securing PostgreSQL using hostssl cert clientcert=1

This is not the first time I’ve struggled getting SSL certificate validation to work,
so I thought this time I better write down how I did to avoid future time-waste.

For security and convenience reasons, I want to do the signing of client certificates
on a separate dedicated machine, also known as certificate authority (CA).

This allows us to grant new clients access without having to login to the PostgreSQL
server signing certs or modifying pg_hba.conf.

We will create a special database group called sslcertusers.
All users in this group will be able to connect provided they have a
client certificate signed by the CA.

In the example below, replace “trustly” with the name of your company/organisation.

The instructions are for Ubuntu Linux 12.04 LTS.

Setup CA

The CA should be an offline computer locked in a safe.

Generate CA private key

sudo openssl genrsa -des3 -out /etc/ssl/private/trustly-ca.key 2048
sudo chown root:ssl-cert /etc/ssl/private/trustly-ca.key
sudo chmod 640 /etc/ssl/private/trustly-ca.key

Generate CA public certificate

sudo openssl req -new -x509 -days 3650 \
-subj '/C=SE/ST=Stockholm/L=Stockholm/O=Trustly/CN=trustly' \
-key /etc/ssl/private/trustly-ca.key \
-out /usr/local/share/ca-certificates/trustly-ca.crt
sudo update-ca-certificates

Configure PostgreSQL-server

Generate PostgreSQL-server private key

# Remove default snakeoil certs
sudo rm /var/lib/postgresql/9.1/main/server.key
sudo rm /var/lib/postgresql/9.1/main/server.crt
# Enter a passphrase
sudo -u postgres openssl genrsa -des3 -out /var/lib/postgresql/9.1/main/server.key 2048
# Remove the passphrase
sudo -u postgres openssl rsa -in /var/lib/postgresql/9.1/main/server.key -out /var/lib/postgresql/9.1/main/server.key
sudo -u postgres chmod 400 /var/lib/postgresql/9.1/main/server.key

Request CA to sign PostgreSQL-server key

sudo -u postgres openssl req -new -nodes -key /var/lib/postgresql/9.1/main/server.key -days 3650 -out /tmp/server.csr -subj '/C=SE/ST=Stockholm/L=Stockholm/O=Trustly/CN=postgres'

Sign PostgreSQL-server key with CA private key

sudo openssl x509 -days 3650 \
-req -in /tmp/server.csr \
-CA /usr/local/share/ca-certificates/trustly-ca.crt \
-CAkey /etc/ssl/private/trustly-ca.key -CAcreateserial \
-out /var/lib/postgresql/9.1/main/server.crt
sudo chown postgres:postgres /var/lib/postgresql/9.1/main/server.crt

Create root cert = PostgreSQL-server cert + CA cert

sudo -u postgres sh -c 'cat /var/lib/postgresql/9.1/main/server.crt /etc/ssl/certs/trustly-ca.pem &gt; /var/lib/postgresql/9.1/main/root.crt'
sudo cp /var/lib/postgresql/9.1/main/root.crt /usr/local/share/ca-certificates/trustly-postgresql.crt
sudo update-ca-certificates

Grant access

CREATE GROUP sslcertusers;
ALTER GROUP sslcertusers ADD USER joel;
# /etc/postgresql/9.1/main/pg_hba.conf:
hostssl nameofdatabase +sslcertusers cert clientcert=1

Restart PostgreSQL

sudo service postgresql restart


Copy root cert from PostgreSQL-server

mkdir ~/.postgresql
cp /etc/ssl/certs/trustly-postgresql.pem ~/.postgresql/root.crt

Generate PostgreSQL-client private key

openssl genrsa -des3 -out ~/.postgresql/postgresql.key 1024

# If this is a server, remove the passphrase:
openssl rsa -in ~/.postgresql/postgresql.key -out ~/.postgresql/postgresql.key

Request CA to sign PostgreSQL-client key

# Replace "joel" with username:
openssl req -new -key ~/.postgresql/postgresql.key -out ~/.postgresql/postgresql.csr -subj '/C=SE/ST=Stockholm/L=Stockholm/O=Trustly/CN=joel'
sudo openssl x509 -days 3650 -req -in ~/.postgresql/postgresql.csr -CA /etc/ssl/certs/trustly-ca.pem -CAkey /etc/ssl/private/trustly-ca.key -out ~/.postgresql/postgresql.crt -CAcreateserial
sudo chown joel:joel -R ~/.postgresql
sudo chmod 400 -R ~/.postgresql/postgresql.key


The following files are created/modififed on each machine:


/etc/ssl/certs/trustly-ca.pem -&gt; /usr/local/share/ca-certificates/trustly-ca.crt


/etc/ssl/certs/trustly-ca.pem -&gt; /usr/local/share/ca-certificates/trustly-ca.crt
/etc/ssl/certs/trustly-postgresql.pem -&gt; /usr/local/share/ca-certificates/trustly-postgresql.crt



Secure deployment of PostgreSQL functions

Two people might be simultaneously modifying the same database function. To protect against this scenario, I created a little tool to stage the source code of all functions in pg_proc and the definition of all views. You then deploy your changes, and afterwards you call fdiff() to show a diff of the changes made, by comparing the new content of pg_proc and the definitions of all views. If the changes are what you expected, you go ahead and commit.

We’ve been using this methodology every day in both my current company and my previous one, and it has prevented us from doing mistakes.

Source code and installation instructions:

This is how to use it:

  1. Start a new transaction
  2. BEGIN;

  3. Temp store state of all views/functions
  4. SELECT fstage();

  5. Deploy changes to schema

  7. Show a diff of changes made
  8. SELECT fdiff();

  9. If the changes are expected, go ahead and commit
  10. COMMIT;

Example on the output from fdiff():

| Updated or replaced functions |

Schema................: public
Name..................: process_order
Argument data types...: _orderid bigint
Result data type......: boolean
Language..............: plpgsql
Type..................: normal
Volatility............: STABLE
Owner.................: amazon
20 c OR Orders.ShippingDate IS NOT NULL
20 c OR Orders.ShippingDate > now() - interval '2 month'

Hope you’ll like it!