Pgcronjob: Cron daemon to run user defined functions

I find myself quite often having to split a large write operation such as updating millions of rows in a heavily used table. This usually boils down to writing a small Perl script that runs a small one-off function that performs the task by updating a limited numbers of rows per run, and then committing in between to avoid a long running transaction.

This time I needed to do this I decided to not write yet another Perl script and to instead do something to improve the situation. I’ve not been able to find any cron-ish tools at all to run database functions, except for pgAgent, but that looks more like a client-side daemon, I wanted something bare-minimum that can run server-side and piggy-back on the OS cron.

A few hours later and 382 lines of code later, the project was finished and pushed to Github:

createuser pgcronjob
psql -f install.sql
crontab pgcronjob.crontab

CREATE OR REPLACE FUNCTION public.CronJob_Function_Template_Skeleton()
RETURNS batchjobstate
LANGUAGE plpgsql
SET search_path TO public, pg_temp
RAISE NOTICE 'Hello world!';
PERFORM pg_sleep(random());
RAISE NOTICE 'Slept for a while.';
IF random() < 0.5 THEN
 -- Tell CronJob() we have more work to do and we want it to run us again in due time
 RAISE NOTICE 'See you again!';
ELSIF random() < 0.5 THEN
 -- Throw error to CronJob() to test errors
 RAISE EXCEPTION 'Simulate error in CronJob function';
 -- Tell CronJob() we're done and we don't want it to run us ever again
 RAISE NOTICE 'Bye world!';

GRANT EXECUTE ON FUNCTION public.CronJob_Function_Template_Skeleton() TO pgcronjob;

SELECT CronJob_Register('public','cronjob_function_template_skeleton');


Hopefully this will be useful for others as well. It would be fun to get some feedback.

The initial commit even comes with a nice README and install and uninstall scripts.

Extract from the README:

The settings are conditions that must all be TRUE for the cronjob to run, i.e. they are AND'd together.

Always NOT NULL:
- Enabled boolean NOT NULL DEFAULT TRUE: Controls whether the cronjob is enabled or not.
- RunEvenIfOthersAreWaiting boolean NOT NULL DEFAULT FALSE: Controls whether to run the cronjob or not if there are other waiting db txns (pg_stat_activity.waiting).
- RetryOnError boolean NOT NULL DEFAULT FALSE: Controls whether to run the cronjob ever again if the user-defined function would throw an error.

Can be NULL (which means setting is ignored):
- RunAfterTimestamp timestamptz: Run only after the specified timestamp.
- RunUntilTimestamp timestamptz: Run only until the specified timestamp.
- RunAfterTime time: Run only after the specified time of the day.
- RunBeforeTime time: Run only until the specified time of the day.
- RunInterval interval: Run only after having slept for this interval after the last run started.
- SleepInterval interval: Run only after having slept for this interval after the last run finished.

Finding missing foreign keys

By coincidence I stumbled upon a table where one of its columns didn’t have a foreign key. I found this strange, since our convention is to always add foreign keys when appropriate.

But humans are humans, and will eventually forget to add a foreign key, so we better have both belt and suspenders and find a way to deal with this inevitable problem in an efficient way.

It would be a tedious job to manually look for missing foreign keys in all tables and columns.
But if you’re lucky enough to have the pleasure of working with a system where all tables and columns have been given their names by following a strict naming convention, it might be possible to fully automate the task.

This is the namning convention we use:

  1. Table names are always in plural form with a tailing “s”, e.g. users
  2. Primary key column names are always equal to the table name in singular form with a tailing id, e.g. userid.
  3. Foreign key columns are always equal to the primary key they are referecning, e.g. transactions.userid -> users.userid

This means you always know based on the column name alone, what table if any that column might be referencing.

Example: If table transactions have a column named userid, and there is a table where userid is also the primary key, but if there isn’t any foreign key on transactions.userid, then it’s a missing foreign key, or else someone has not followed the namning convention.

Thanks to PostgreSQL’s fantastic pg_catalog system tables, we can write a query which uses the rules from the naming convention and returns all the columns which appears to be missing foreign keys. It doesn’t support multi-column keys, but we don’t have many of those, so it’s not a problem in my case.

Thanks to the view below, I automatically found three more missing foreign keys of the same type, which saves me many hours of boring work today.

pg1:joel=#* SELECT * FROM view_missing_foreign_keys;
 nspname |      relname    | attname 
 public  | transactions    | userid
 public  | someohtertable1 | userid
 public  | someothertable2 | userid
(5 rows)

I’m posting the view I wrote here in hope it might be useful for others with a similar naming convention, and/or to inspire others to include the table name in their column names used as primary/foreign keys.

CREATE OR REPLACE VIEW view_missing_foreign_keys AS
FROM pg_catalog.pg_namespace
INNER JOIN pg_catalog.pg_class ON (pg_catalog.pg_class.relnamespace = pg_catalog.pg_namespace.oid)
INNER JOIN pg_catalog.pg_attribute ON (pg_catalog.pg_attribute.attrelid = pg_catalog.pg_class.oid)
WHERE pg_catalog.pg_class.relkind = 'r'
AND pg_catalog.pg_attribute.attnum > 0
AND NOT pg_catalog.pg_attribute.attisdropped
AND pg_catalog.pg_namespace.nspname NOT IN ('pg_toast','information_schema','pg_catalog')
AND pg_catalog.pg_attribute.attname LIKE '%id'
    -- The column is PRIMARY KEY in some table
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'p'
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) = format('PRIMARY KEY (%s)',pg_catalog.pg_attribute.attname)
    -- There is no FOREIGN KEY on this column
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'f'
    AND pg_catalog.pg_constraint.conrelid = pg_catalog.pg_class.oid
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) LIKE (format('FOREIGN KEY (%s)',pg_catalog.pg_attribute.attname) || '%')
    -- This column is not the PRIMARY KEY of it's own table,
    -- since if it was, we wouldn't require a FOREIGN KEY on it
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'p'
    AND pg_catalog.pg_constraint.conrelid = pg_catalog.pg_class.oid
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) = format('PRIMARY KEY (%s)',pg_catalog.pg_attribute.attname)

“How we use PostgreSQL at Trustly” at PGConf.EU

Slides from my talk, How we use PostgreSQL at Trustly, are now available.

psql \watch 1400000000 epoch time countdown counter

\pset fieldsep ' '
    (('epoch'::timestamptz + 14*10^8 * '1 s'::interval)-now())::interval(0),
    (14*10^8-extract(epoch from now()))::int,
    extract(epoch from now())::int
\watch 1

09:18:28 33508 1399966492
09:18:27 33507 1399966493
09:18:26 33506 1399966494
09:18:25 33505 1399966495
09:18:24 33504 1399966496
09:18:23 33503 1399966497

Garbage Collection of Unused PostgreSQL Tables and Columns

Over the last five years, our database at Trustly have kept growing in number of tables, columns and functions, where some of the tables and columns aren’t being used by any database functions any longer. Getting rid of them is important, as otherwise people working with the database will be confused and annoyed. Database developers should always be able to rely on the data model being relevant and up to date.

In our system, no applications access the database tables directly, instead everything goes through stored procedures.

This means, if a table or column name is not present anywhere in any function’s source code, it’s very likely the table/column is not being used by anything. The only exception is if you have dynamically crafted queries executed using EXECUTE, where the table/column names are constructed from different parts. In our system, we thankfully only have a few such cases.

FROM pg_catalog.pg_class
INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
WHERE pg_catalog.pg_class.relkind = 'r'
AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog')
) Tables(TableName)
SELECT 1 FROM pg_catalog.pg_proc
WHERE pg_catalog.pg_proc.prosrc ~* Tables.TableName
ORDER BY Tables.TableName

This query returned quite a lot of table names with about half of them being false positives,
but still a managable list to go through manually.

50 minutes of manual work later:

92 files changed, 1114 deletions(-)
False positives: 14

Then I moved on to the task of finding unused table columns.
The query below excludes any false positives found in the previous query.

    FROM pg_catalog.pg_class
    INNER JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
    INNER JOIN pg_catalog.pg_attribute ON pg_catalog.pg_attribute.attrelid = pg_catalog.pg_class.oid
    WHERE pg_catalog.pg_class.relkind = 'r'
    AND pg_catalog.pg_attribute.attnum > 0
    AND NOT pg_catalog.pg_attribute.attisdropped
    AND pg_catalog.pg_namespace.nspname NOT IN ('information_schema','pg_catalog')
    -- Exclude columns in tables we know are unused by the stored procedures, but we want to keep around anyway:
    AND pg_catalog.pg_class.relname !~* '^(alexacountrie|bankersworldonline|bindgroup|clearinghousecurrencie|dailystat|geoiporganization|hourlystat|usercommitment|polishbank|polishbanknumber|swedishpostalcode|testbasedata|testperlmodule|useraccesslogarchive)'
) Columns(ColumnName)
    SELECT 1 FROM pg_catalog.pg_proc
    WHERE pg_catalog.pg_proc.prosrc ~* Columns.ColumnName
ORDER BY Columns.ColumnName

It took me two hours to go through all code. It was an interesting journey in time with lots of memories.

50 files changed, 5396 insertions(+), 5917 deletions(-)
False positives: 87

The reason why there were so many insertions and deletions, was because the dropped columns affected some of the base tables with reference data, which had to be regenerated, thus affecting all lines in those files.

In summary, the ROI on those three hours of time invested is enormous. Developers can now feel confident all tables and columns fulfill a purpose in the system. This exercise will of course need to be repeated in the future though.

SET search_path for all SECURITY DEFINER functions

As pointed out by Peter Eisentraut in a blog post named
Schema Search Paths Considered Pain in the Butt, you need to make sure the search_path is explicitly set for all SECURITY DEFINER functions in PostgreSQL.

Fixing this manually for, in my case, 2106 functions, is, indeed a “pain in the butt”, so I crafted a little query to automate the job:

\pset format unaligned
\o /tmp/fix_search_path_for_security_definer_functions.sql
            -- inject SET search_path in-between LANGUAGE and SECURITY DEFINER in the declaration
                E'(LANGUAGE [a-z]+)\\s+(SECURITY DEFINER)',
                E'\\1\n SET search_path TO public, pg_temp\n \\2'
from pg_proc
where prosecdef is true -- SECURITY DEFINER functions
-- don't include functions for which we have already specified a search_path
and not (coalesce(array_to_string(proconfig,''),'') like '%search_path%')
-- public schema
and pronamespace = 2200
\i /tmp/fix_search_path_for_security_definer_functions.sql
-- If all goes well you should see a lot of CREATE FUNCTION being spammed on the screen

Fun with pg_catalog.pg_depend

Learning PostgreSQL and SQL in general probably begins with the concept of TABLES. Then probably VIEWS, INDEXES and maybe TRIGGERS.
Some users might not ever go any further, which is sad, because there is so much more to explore!

I thought it would be cool to automatically generate a graph showing the dependencies between objects and their types.
This shows the order in which the different types of objects can be created,
perhaps mentally useful to think in terms of “after we have created a TABLE we can create an INDEX”.

Something like this would be nice to include in the PostgreSQL documentation online.
I think it would be helpful when learning about PostgreSQL different object types.

The graph below was produced using GraphViz dot command and live-data from pg_catalog.pg_depend:


As we can see, before anything else, we need a SCHEMA, which is the root node.
Once we have a SCHEMA, we can create TABLES, TYPES, VIEWS, SEQUENCES and FUNCTIONS.
Some users might not even know about SCHEMAs, as the schema “public” is pre-installed.
To create an INDEX, we first need a TABLE.
Etc, etc, etc…

You might be surprised FUNCTION and LANGUAGE have arrows pointing in both directions.
Turns out you need some functions before you can create a language like plperl, such as plperl_call_handler.
The self-referencing arrow from/to FUNCTION is less surprising as some functions can of course call other functions.

(Not all object types are included in this graph as I’m not using them all in my system.)