SET search_path for all SECURITY DEFINER functions

As pointed out by Peter Eisentraut in a blog post named
Schema Search Paths Considered Pain in the Butt, you need to make sure the search_path is explicitly set for all SECURITY DEFINER functions in PostgreSQL.

Fixing this manually for, in my case, 2106 functions, is, indeed a “pain in the butt”, so I crafted a little query to automate the job:

\t
\pset format unaligned
\o /tmp/fix_search_path_for_security_definer_functions.sql
select
    array_to_string(
        array_agg(
            -- inject SET search_path in-between LANGUAGE and SECURITY DEFINER in the declaration
            regexp_replace(
                pg_get_functiondef(oid),
                E'(LANGUAGE [a-z]+)\\s+(SECURITY DEFINER)',
                E'\\1\n SET search_path TO public, pg_temp\n \\2'
            )
        ),
        ';'
    )
from pg_proc
where prosecdef is true -- SECURITY DEFINER functions
-- don't include functions for which we have already specified a search_path
and not (coalesce(array_to_string(proconfig,''),'') like '%search_path%')
-- public schema
and pronamespace = 2200
;
\t
\o
\i /tmp/fix_search_path_for_security_definer_functions.sql
-- If all goes well you should see a lot of CREATE FUNCTION being spammed on the screen

Fun with pg_catalog.pg_depend

Learning PostgreSQL and SQL in general probably begins with the concept of TABLES. Then probably VIEWS, INDEXES and maybe TRIGGERS.
Some users might not ever go any further, which is sad, because there is so much more to explore!

I thought it would be cool to automatically generate a graph showing the dependencies between objects and their types.
This shows the order in which the different types of objects can be created,
perhaps mentally useful to think in terms of “after we have created a TABLE we can create an INDEX”.

Something like this would be nice to include in the PostgreSQL documentation online.
I think it would be helpful when learning about PostgreSQL different object types.

The graph below was produced using GraphViz dot command and live-data from pg_catalog.pg_depend:

pg_depend_dot

As we can see, before anything else, we need a SCHEMA, which is the root node.
Once we have a SCHEMA, we can create TABLES, TYPES, VIEWS, SEQUENCES and FUNCTIONS.
Some users might not even know about SCHEMAs, as the schema “public” is pre-installed.
To create an INDEX, we first need a TABLE.
Etc, etc, etc…

You might be surprised FUNCTION and LANGUAGE have arrows pointing in both directions.
Turns out you need some functions before you can create a language like plperl, such as plperl_call_handler.
The self-referencing arrow from/to FUNCTION is less surprising as some functions can of course call other functions.

(Not all object types are included in this graph as I’m not using them all in my system.)

::xml madness

Nobody likes XML, except masochists and Microsoft consultants who charge by the hour.
Sometimes you are forced to deal with XML anyway, such as parsing the response from external APIs.
Parsing XML is per se a nasty business to begin with, but in this particular example, the ugliness set new records.

The only “solution” I came up with is too ugly for production use, but the alternatives were even uglier, so I had no option.

I hope there is someone out there reading this who can present a proper solution to the problem.

Let’s say you have this XML:

<testxml xmlns:ns1="http://www.example.com" xmlns:ns2="http://www.example2.com">
    <ns1:foo>
        <ns2:bar>baz</ns2:bar>
    </ns1:foo>
</testxml>

Using xpath() you extract the content of testxml:

SELECT xpath(
    '/testxml/ns1:foo',
    '<testxml xmlns:ns1="http://www.example.com" xmlns:ns2="http://www.example2.com"><ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo></testxml>'::xml,
    ARRAY[
        ['ns1','http://www.example.com'],
        ['ns2','http://www.example2.com']
    ]
);
-- Result: <ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo>

The returned XML is not valid since its missing the xmlns definitions,
but the PostgreSQL XML data type doesn’t complain, which is OK I guess,
a bit of quirks mode perhaps?

Because of the missing xmlns, it’s impossible to make use of this XML fragment returned.
You cannot extract any subsequent sub-elements in it using XPath.

For instance, this won’t work:

SELECT xpath(
    '/ns1:foo/ns2:bar/text()',
    (xpath(
        '/testxml/ns1:foo',
        '<testxml xmlns:ns1="http://www.example.com" xmlns:ns2="http://www.example2.com"><ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo></testxml>'::xml,
        ARRAY[
            ['ns1','http://www.example.com'],
            ['ns2','http://www.example2.com']
        ]
    ))[1]
);

-- Error:
-- ERROR:  could not create XPath object
-- DETAIL:  namespace error : Namespace prefix ns1 on foo is not defined
-- <ns1:foo>
--         ^
-- namespace error : Namespace prefix ns2 on bar is not defined
--   <ns2:bar>baz</ns2:bar>
--           ^
-- Undefined namespace prefix
-- xmlXPathCompiledEval: evaluation failed

Even if you pass it the NSArray in the outer xpath() call, you don’t get ‘baz’ but nothing at all.

SELECT xpath(
    '/ns1:foo/ns2:bar/text()',
    (xpath(
        '/testxml/ns1:foo',
        '<testxml xmlns:ns1="http://www.example.com" xmlns:ns2="http://www.example2.com"><ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo></testxml>'::xml,
        ARRAY[
            ['ns1','http://www.example.com'],
            ['ns2','http://www.example2.com']
        ]
    ))[1],
    ARRAY[
        ['ns1','http://www.example.com'],
        ['ns2','http://www.example2.com']
    ]
);

-- Returns:
--  xpath 
-- -------
--  {}
-- (1 row)
-- And NOT 'baz' which is what we want.

Therefore, given you have an XML “fragment” where the tags have ns:___ but without the xmlns:___=”URI” part,
is there any way to extract sub-elements, given you know the namespaces?

For instance, by wrapping the XML “fragment” inside an outer XML tag,
and specifying bogous xmlns elements for the namespaces,
I managed to hack together a work-around, probably with hundreds of flaws,
and this cannot possibly be the best way to approach this problem.

SELECT xpath_fragment('/ns1:foo/ns2:bar/text()','<ns1:foo><ns2:bar>baz</ns2:bar></ns1:foo>',ARRAY['ns1','ns2']);

 xpath_fragment 
----------------
 {baz}
(1 row)

Source code of insanely ugly “solution” to the “problem”:

CREATE OR REPLACE FUNCTION xpath_fragment(_XPath text, _XML xml, _NSNames text[]) RETURNS XML[] AS $BODY$
DECLARE
_ text;
_WrappedXML xml;
_NSArray text[][] := ARRAY[]::text[][];
BEGIN

SELECT ('<xml ' || array_to_string(array_agg('xmlns:' || unnest || '=" "'),' ') || '>' || _XML::text || '</xml>')::text INTO _WrappedXML FROM unnest(_NSNames);

FOR _ IN SELECT unnest(_NSNames)
LOOP
    _NSArray := _NSArray || ARRAY[[_, ' ']];
END LOOP;

RETURN xpath('/xml' || _XPath, _WrappedXML, _NSArray);
END;
$BODY$ LANGUAGE plpgsql IMMUTABLE;

Ideas, anyone?

PL/pgSQL #1 annoyance

I cannot count the number of times I’ve written the same stupid code,
just to do something as simple as making sure the UPDATE
updates exactly one row.

Last time I complained about this in public was at our Christmas party at the office.
Normally nobody cares to listen to me when I’ve had a few beers and start talking about SQL stuff.
Next day I noticed a colleague had spent the hang-over-morning hacking on a patch.
What an awesome Christmas party I thought!

That was quite some time ago and my code still looks ugly, so let’s complain some more and hope it helps.

Number of ugly statements in our codebase:

$ grep -r -E "(IF NOT FOUND THEN)|(RETURNING (1|TRUE) INTO STRICT)" . | wc -l
701

In a typical OLTP application, each UPDATE typically affects exactly one row.
If more than one row or no row at all would match, you want to throw an exception, return false,
or what ever the proper behaviour might be in the situation.

This sounds like a trivial task, but it turns out there is no nice way of doing this in plpgsql.

Here are three different work-arounds to ensure the update went through:

IF NOT FOUND THEN

Even if FOUND could mean more than one row was affected,
it’s safe to assume exactly one row was updated if you are updating
using a primary key or any other unique column.

UPDATE Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0;
IF NOT FOUND THEN
    RAISE no_data_found;
END IF;

GET DIAGNOSTICS integer_var = ROW_COUNT;

If you are a bit more paranoid, you can check the exact row count,
and throw different exceptions based on if no rows were found
or if more than one was found.

UPDATE Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0;
GET DIAGNOSTICS _RowCount = ROW_COUNT;
IF _RowCount = 0 THEN
    RAISE no_data_found;
ELSIF _RowCount > 1 THEN
    RAISE too_many_rows;
END IF;

RETURNING TRUE INTO STRICT _OK

This is the one I prefer. It guarantees only one row was affected,
and is at the same time a one-liner, or two lines if you could
the DECLARE of the boolean variable.

DECLARE
_OK boolean;
...
UPDATE Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0 RETURNING TRUE INTO STRICT _OK;

However, none of them are pretty. Something as simple and frequent should be in the syntax of a language.

The proposed syntax in johto’s patch was:

UPDATE STRICT Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0;

Tom Lane suggested to put the STRICT keyword in the beginning instead,
avoiding conflicts with table names, which sounds like a good idea to me.

STRICT UPDATE Orders SET Processed = 1 WHERE OrderID = _OrderID AND Processed = 0;

I really hope the community will find a way to fix this syntax bug,
without causing to much pain in terms of backwards compatibility issues for users.

I remember when we upgraded from 8.4 to 9.1, I struggled changing all code where
IN/OUT parameters were in conflict with column names,
as the behaviour changed to raise an error.

Users who didn’t want to invest time fixing their code could simply set the setting plpgsql.variable_conflict in postgresql.conf to get back the old behaviour.

I think this was a very user-friendly and nice way to deal with the compatibility issue.

If the STRICT syntax isn’t possible because of hypothetical compatibility issues,
perhaps it could be up to the user to turn it on or off, using a similar setting?

I would even like to go a bit further and suggest making the update of exactly one row the default in PL/pgSQL.
In the typically fewer cases where you want to update more than one row, could explicitly specify it with some other keyword,
such as ALL UPDATE or whatever. This would make the code much cleaner, as you wouldn’t need to type STRICT
all over your code.

Securing PostgreSQL using hostssl cert clientcert=1

This is not the first time I’ve struggled getting SSL certificate validation to work,
so I thought this time I better write down how I did to avoid future time-waste.

For security and convenience reasons, I want to do the signing of client certificates
on a separate dedicated machine, also known as certificate authority (CA).

This allows us to grant new clients access without having to login to the PostgreSQL
server signing certs or modifying pg_hba.conf.

We will create a special database group called sslcertusers.
All users in this group will be able to connect provided they have a
client certificate signed by the CA.

In the example below, replace “trustly” with the name of your company/organisation.

The instructions are for Ubuntu Linux 12.04 LTS.

Setup CA

The CA should be an offline computer locked in a safe.

Generate CA private key

sudo openssl genrsa -des3 -out /etc/ssl/private/trustly-ca.key 2048
sudo chown root:ssl-cert /etc/ssl/private/trustly-ca.key
sudo chmod 640 /etc/ssl/private/trustly-ca.key

Generate CA public certificate

sudo openssl req -new -x509 -days 3650 \
-subj '/C=SE/ST=Stockholm/L=Stockholm/O=Trustly/CN=trustly' \
-key /etc/ssl/private/trustly-ca.key \
-out /usr/local/share/ca-certificates/trustly-ca.crt
sudo update-ca-certificates

Configure PostgreSQL-server

Generate PostgreSQL-server private key

# Remove default snakeoil certs
sudo rm /var/lib/postgresql/9.1/main/server.key
sudo rm /var/lib/postgresql/9.1/main/server.crt
# Enter a passphrase
sudo -u postgres openssl genrsa -des3 -out /var/lib/postgresql/9.1/main/server.key 2048
# Remove the passphrase
sudo -u postgres openssl rsa -in /var/lib/postgresql/9.1/main/server.key -out /var/lib/postgresql/9.1/main/server.key
sudo -u postgres chmod 400 /var/lib/postgresql/9.1/main/server.key

Request CA to sign PostgreSQL-server key

sudo -u postgres openssl req -new -nodes -key /var/lib/postgresql/9.1/main/server.key -days 3650 -out /tmp/server.csr -subj '/C=SE/ST=Stockholm/L=Stockholm/O=Trustly/CN=postgres'

Sign PostgreSQL-server key with CA private key

sudo openssl req -x509 \
-key /etc/ssl/private/trustly-ca.key \
-in /tmp/server.csr \
-out /var/lib/postgresql/9.1/main/server.crt
sudo chown postgres:postgres /var/lib/postgresql/9.1/main/server.crt

Create root cert = PostgreSQL-server cert + CA cert

sudo -u postgres sh -c 'cat /var/lib/postgresql/9.1/main/server.crt /etc/ssl/certs/trustly-ca.pem &gt; /var/lib/postgresql/9.1/main/root.crt'
sudo cp /var/lib/postgresql/9.1/main/root.crt /usr/local/share/ca-certificates/trustly-postgresql.crt
sudo update-ca-certificates

Grant access

CREATE GROUP sslcertusers;
ALTER GROUP sslcertusers ADD USER joel;
# /etc/postgresql/9.1/main/pg_hba.conf:
hostssl nameofdatabase +sslcertusers 192.168.1.0/24 cert clientcert=1

Restart PostgreSQL

sudo service postgresql restart

PostgreSQL-client(s)

Copy root cert from PostgreSQL-server

mkdir ~/.postgresql
cp /etc/ssl/certs/trustly-postgresql.pem ~/.postgresql/root.crt

Generate PostgreSQL-client private key

openssl genrsa -des3 -out ~/.postgresql/postgresql.key 1024

# If this is a server, remove the passphrase:
openssl rsa -in ~/.postgresql/postgresql.key -out ~/.postgresql/postgresql.key

Request CA to sign PostgreSQL-client key

# Replace "joel" with username:
openssl req -new -key ~/.postgresql/postgresql.key -out ~/.postgresql/postgresql.csr -subj '/C=SE/ST=Stockholm/L=Stockholm/O=Trustly/CN=joel'
sudo openssl x509 -req -in ~/.postgresql/postgresql.csr -CA /etc/ssl/certs/trustly-ca.pem -CAkey /etc/ssl/private/trustly-ca.key -out ~/.postgresql/postgresql.crt -CAcreateserial
sudo chown joel:joel -R ~/.postgresql
sudo chmod 400 -R ~/.postgresql/postgresql.key

Files

The following files are created/modififed on each machine:

CA

/etc/ssl/private/trustly-ca.key
/usr/local/share/ca-certificates/trustly-ca.crt
/etc/ssl/certs/trustly-ca.pem -&gt; /usr/local/share/ca-certificates/trustly-ca.crt

PostgreSQL-server

/var/lib/postgresql/9.1/main/server.key
/var/lib/postgresql/9.1/main/server.crt
/var/lib/postgresql/9.1/main/root.crt
/usr/local/share/ca-certificates/trustly-ca.crt
/usr/local/share/ca-certificates/trustly-postgresql.crt
/etc/ssl/certs/trustly-ca.pem -&gt; /usr/local/share/ca-certificates/trustly-ca.crt
/etc/ssl/certs/trustly-postgresql.pem -&gt; /usr/local/share/ca-certificates/trustly-postgresql.crt
/etc/postgresql/9.1/main/pg_hba.conf

PostgreSQL-client

~/.postgresql/root.crt
~/.postgresql/postgresql.key
~/.postgresql/postgresql.crt

Secure deployment of PostgreSQL functions

Two people might be simultaneously modifying the same database function. To protect against this scenario, I created a little tool to stage the source code of all functions in pg_proc and the definition of all views. You then deploy your changes, and afterwards you call fdiff() to show a diff of the changes made, by comparing the new content of pg_proc and the definitions of all views. If the changes are what you expected, you go ahead and commit.

We’ve been using this methodology every day in both my current company and my previous one, and it has prevented us from doing mistakes.

Source code and installation instructions: https://github.com/joelonsql/fdiff

This is how to use it:

  1. Start a new transaction
  2. BEGIN;

  3. Temp store state of all views/functions
  4. SELECT fstage();

  5. Deploy changes to schema
  6. CREATE OR REPLACE FUNCTION ...
    CREATE OR REPLACE VIEW ...

  7. Show a diff of changes made
  8. SELECT fdiff();

  9. If the changes are expected, go ahead and commit
  10. COMMIT;

Example on the output from fdiff():


+-------------------------------+
| Updated or replaced functions |
+-------------------------------+

Schema................: public
Name..................: process_order
Argument data types...: _orderid bigint
Result data type......: boolean
Language..............: plpgsql
Type..................: normal
Volatility............: STABLE
Owner.................: amazon
20 c OR Orders.ShippingDate IS NOT NULL
20 c OR Orders.ShippingDate > now() - interval '2 month'

Hope you’ll like it!

Will PostgreSQL 9.5 Bring Back Moore’s Law?

I just watched a presentation from Mar 11, 2012, where Paul Graham / YCombinator is giving away seven billion dollar startup ideas:

  1. A New Search Engine
  2. Replace Email
  3. Replace Universities
  4. Internet Drama
  5. The Next Steve Jobs
  6. Bring Back Moore’s Law
  7. Ongoing Diagnosis

Idea 6 on the list, is basically about the need to invent the sufficiently smart compiler, capable of understanding how to automatically break down a problem (expressed in code) into pieces which can be executed in parallell, in order to utilize GPUs and/or a lot of CPU cores.

Building such a compiler is probably extremely difficult, if at all possible, perhaps mainly due to the complexity possible to express in C, C++, Python, Java or any normal imperative language.

Imagine instead the same problem but for a declarative language like SQL. Would it be equally difficult? Core member Bruce Momjian of the PostgreSQL team have already expressed his ideas on how to tackle the problem.

When thinking about it, the query planner is a bit like a compiler, kind of. It takes the SQL and figures out a smart way to execute it, breaking it down into a lot of sub-tasks, of which some could with benefit be executed in parallel.

So let’s hope the great pg-hackers will pull this off, so we don’t have to pay some future start-up compiler-billionaire license fees to get automatic parallelism.

Follow

Get every new post delivered to your Inbox.