Will PostgreSQL 9.5 Bring Back Moore’s Law?

I just watched a presentation from Mar 11, 2012, where Paul Graham / YCombinator is giving away seven billion dollar startup ideas:

  1. A New Search Engine
  2. Replace Email
  3. Replace Universities
  4. Internet Drama
  5. The Next Steve Jobs
  6. Bring Back Moore’s Law
  7. Ongoing Diagnosis

Idea 6 on the list, is basically about the need to invent the sufficiently smart compiler, capable of understanding how to automatically break down a problem (expressed in code) into pieces which can be executed in parallell, in order to utilize GPUs and/or a lot of CPU cores.

Building such a compiler is probably extremely difficult, if at all possible, perhaps mainly due to the complexity possible to express in C, C++, Python, Java or any normal imperative language.

Imagine instead the same problem but for a declarative language like SQL. Would it be equally difficult? Core member Bruce Momjian of the PostgreSQL team have already expressed his ideas on how to tackle the problem.

When thinking about it, the query planner is a bit like a compiler, kind of. It takes the SQL and figures out a smart way to execute it, breaking it down into a lot of sub-tasks, of which some could with benefit be executed in parallel.

So let’s hope the great pg-hackers will pull this off, so we don’t have to pay some future start-up compiler-billionaire license fees to get automatic parallelism.

Announcing the pci-blackbox: PCI-DSS compliant system with 3-D Secure support, built on PostgreSQL and PL/pgSQL

The pci-blackbox aims to simplify the process of becoming PCI-DSS compliant, by handling card data in an isolated system, physically and logically separated from the rest of the system.

flow chart

The idea comes from a guy I met over a beer who works at Skype. He explained they had done something similar, they have/had a special “PCI server” kept secure behind locked doors, which only function was to encrypt/decrypt/process card data.

I thought it sounded like a smart idea, for merchants who for some reason need to become PCI-DSS compliant and cannot use a hosted payment solution.

The company I work for is in the process of implementing card payments, and we cannot use a hosted solution, so we decided to give this concept a shot, and see if we managed to come up with something useful. Hopefully we have, you be the judge.

The pci-blackbox must be run on a separate server from the main system. In this test however, everything is being run on the same machine.

The API consist of three functions:

  • encrypt_card(): Encrypt sensitive card data, return a CardKey.
  • authorise_payment_request(): Authorise payment using a CardKey.
  • authorise_payment_request_3d(): Authorise 3-D Secure payment.

Highlights:

  • 3-D Secure support.
  • Gateway-independent, like Spreedly, the API is generic, not specific to any gateway, currently only Adyen is supported though.
  • Host your own payment page, like Stripe, you design your own payment page, and POST directly to the pci-blackbox.
  • PCI-DSS compliant.
  • Open sourced under the MIT-license.
  • JSON API.

This module is work in progress and has not been put into production yet.

If anyone know of any other similar open source project, which provides an isolated card-component, please let me know. Couldn’t find any, so that’s why I started hacking on this.

The files under /nonpci are just an example implementation on how to use the pci-blackbox.

The installation instructions only setup a test environment.

Any feedback is very much appreciated, thank you!

Production upgrade from 8.4 to 9.1

At Wednesday, 2012-08-15 05:00 am, the work began. Four hours of downtime later at 09:00 am the upgrade was complete.
Two brand new identical HP DL380 Gen8 servers with 400GB SSD-disks and 192GB RAM are now serving our customers.

This was one of the most nervous moments in my life, a revert would have been impossible once the API was brought back online.
Luckily, it turned out we had done a decent job testing everything, only a few glitches caused some minor problems during the following hours.

Many thanks to Magnus Hagander from Redpill Linpro who helped out with the upgrade.
I had never seen pg_basebackup in action before, impressive, very user-friendly, it even had a nice progress bar!

Trustly Group AB is a very PostgreSQL centric payment company, so a moment like this of course called for some cake and champagne!

PostgreSQL upgrade from 8.4 to 9.1

It’s been very frustrating to see all new cool features in 9.0 and 9.1, while being stuck at 8.4.
What I’ve been looking forward to the most:

  • Streaming replication to off-load work from the master to slaves in Hot Standby mode
  • Synchronous replication to get rid of DRBD
  • Calling functions with named parameters
  • Foreign Key Locks

    In August 2010, I sent an email to pghackers reporting a strange deadlock problem.
    I couldn’t understand how it was possible you could get a deadlock by updating
    the very same row again later in the same transaction.

    http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg157869.html

    I also made a screencast demonstrating the problem:

    http://www.screencast.com/users/joeljacobson/folders/Jing/media/42c31028-80fa-45fe-b21f-9039110c3555

    Simon Riggs came up with a very clever solution to this problem.
    By introducing a new type of lock, it would be possible to avoid a lot of deadlocks.
    Álvaro Herrera has since then been working on the implementation.
    It turned out to be a lot of work, unfortunately the feature didn’t make it into 9.2.
    I bet this will be one of the most important performance features in 9.3.
    It will greatly improve the performance by increasing the concurrency possible if you have a high transaction load.

    I thought it would be fun to test my old example from 2010 using the latest version of the patch.

    Let’s checkout Álvaro’s latest version of the fklocks patch:

    git remote add fklocks git://github.com/alvherre/postgres.git
    git fetch fklocks
    git checkout fklocks

    CREATE TABLE A (
    AID integer not null,
    Col1 integer,
    PRIMARY KEY (AID)
    );
    
    CREATE TABLE B (
    BID integer not null,
    AID integer not null,
    Col2 integer,
    PRIMARY KEY (BID),
    FOREIGN KEY (AID) REFERENCES A(AID)
    );
    
    INSERT INTO A (AID) VALUES (1);
    INSERT INTO B (BID,AID) VALUES (2,1);
    
    -- Process 1:
    BEGIN;
        -- Process 2:
        BEGIN;
    
    UPDATE A SET Col1 = 1 WHERE AID = 1;
    
        UPDATE B SET Col2 = 2 WHERE BID = 2;
    
    UPDATE B SET Col2 = 3 WHERE BID = 2;
    -- Process 1 is waiting
    
        UPDATE B SET Col2 = 4 WHERE BID = 2;
        -- Process 2 is allowed to update the same row
        COMMIT;
    
    COMMIT;
    
    SELECT * FROM A;
     aid | col1 
    -----+------
       1 |    1
    (1 row)
    
    SELECT * FROM B;
     bid | aid | col2 
    -----+-----+------
       2 |   1 |    3
    (1 row)
    

    Both transactions were able to COMMIT successfully, hurray!

    Without this patch, you would get a “deadlock detected” on row 30,
    when Process 2 tries to update the same row again, BID = 2.

    However, I was a bit surprised the final value of Col2 is 3 and not 4,
    as Process 2 updated the row BID = 2 before Process 1,
    and Process 2 updated the row again after Process 1.
    This probably has a logical explanation.

    I hope to have time to do some concurrency stress testing using
    example code and data extracted from our production database.

    Would be interesting to see if my old read-life deadlock scenarios
    are fixed by this patch and how it affects concurrency and performance.

    PostgreSQL 9.3devel – plpgsql_check_function

    One of the biggest incompatibility issues between 9.x and 8.4 is the new behavior of
    PL/pgSQL functions where parameter names are in conflict with columns.

    In 8.4, PL/pgSQL silently use the value of the parameters.

    In 9.x, PL/pgSQL throws an error, but not when doing CREATE FUNCTION,
    only when actually executing the function and hitting a row in conflict.

    This means it is currently impossible to get a list of any functions with this problem.

    You could disable the new behavior by adding #variable_conflict use_variable
    to all existing functions or to the global configuration, but this is sub-optimal,
    it would be much better if the new improved behavior could be used for all functions.

    I asked around on IRC in #postgresql on irc.freenode.net and got lucky!

    A new feature will hopefully be added to 9.3, the function is called plpgsql_check_function developed by Pavel Stěhule.

    plpgsql_check_function will detect any problems you might get during execution,
    not only the errors you get when creating functions.

    Example

    ubuntu@pgsql93:~$ psql
    psql (9.3devel)
    Type "help" for help.
    
    ubuntu=# CREATE OR REPLACE FUNCTION myfunc(id integer) RETURNS TEXT AS $$
    ubuntu$# DECLARE
    ubuntu$# value text;
    ubuntu$# BEGIN
    ubuntu$# SELECT value INTO value FROM mytable WHERE id = id;
    ubuntu$# RETURN value;
    ubuntu$# END;
    ubuntu$# $$ LANGUAGE plpgsql;
    CREATE FUNCTION
    ubuntu=# SELECT plpgsql_check_function(oid) FROM pg_proc WHERE proname = 'myfunc';
                             plpgsql_check_function                          
    -------------------------------------------------------------------------
     error:42702:5:SQL statement:column reference "value" is ambiguous
     Query: SELECT value            FROM mytable WHERE id = id
     --            ^
     Detail: It could refer to either a PL/pgSQL variable or a table column.
    (4 rows)
    

    I’m just about to upgrade our production database from 8.4 to 9.1.
    Out of the 1474 functions (1.8 MB of code), 132 functions had some kind of problem.
    Thanks to plpgsql_check_function I didn’t have to go through all the PL/pgSQL functions manually.

    Thank you Pavel! You saved me a lot of work today! 🙂

    Deploying PostgreSQL Stored Procedures

    When you mention stored procedures, I see horror and disgust in the eyes of many developers.

    “Stored Procedures are like crack, they are addictive and harmful.”

    I agree they are addictive, but harmful?

    “They are difficult to version control, test and deploy.”

    This is of course not true.

    Deployments can be tricky though. How can you be sure none of the functions in the production
    database have been locally modified by some evil consultant?
    In a hurry, he did a \ef thinking “I’ll commit this change to the repo later, I won’t forget!”.

    Months later, the function needs to be modified again, this time by someone else, who makes
    the change to the function source code file in the VCS, commits, tests and deploys the function.

    Whops! The local modification made by the evil consultant is gone! Alarms goes off, people are
    screaming “the system is down!”, people are running around in circles, you revert your change,
    back to the previous version in the VCS, but people are screaming even higher “the system is down!”.

    But all the tests passed! How could this happened!

    To prevent this scenario from happening, I’ve made a simple tool which I always use when deploying functions.

    It’s called PgDeploy. I’ve been using it in production successfully since 2008.

    It gives you a second chance to review the diff your deployment would actually cause in the
    production database, by comparing the new source code with what’s already in the database.

    -- SYNOPSIS:
    SELECT deploy(SQL text, MD5 char(32));
    
    -- Steg 1, preview:
    -- Watch the diff. Get MD5 hash.
    
    SELECT deploy($SQL$
        CREATE OR REPLACE FUNCTION myfunc() RETURNS BOOLEAN AS $$
        DECLARE
        BEGIN
           -- source code of function
           RETURN TRUE;
        END;
        $$ LANGUAGE plpgsql;
    $SQL$, NULL);
    
    -- The output returned is a diff showing what functions were removed,
    -- created, replaced or updated.
    
    -- Steg 2, deploy:
    -- Pass MD5 from step 1 as second argument to confirm deployment.
    
    SELECT deploy($SQL$
        CREATE OR REPLACE FUNCTION myfunc() RETURNS BOOLEAN AS $$
        DECLARE
        BEGIN
           -- source code of function
           RETURN TRUE;
        END;
        $$ LANGUAGE plpgsql;
    $SQL$,'3b8297fb1408e480b05d1597ad5e65d5');
    

    Below is a usage example, where two developers modify the same function,
    where one of them forgets to commit, causing a difference of
    the source code in the database and in the VCS.

    DAY 1. You deploy a new function:

    SELECT deploy($SQL$
    CREATE OR REPLACE FUNCTION calc_value_added_tax(amount numeric) RETURNS NUMERIC AS $$
    DECLARE
    _vat numeric := 0.25;
    BEGIN
    RETURN amount * (1 + _vat);
    END;
    $$ LANGUAGE plpgsql IMMUTABLE;
    $SQL$, NULL);
    

    The deploy function returns as text showing any removed,
    new or updated/replaced functions caused by your SQL-code.
    In this case, we can see a new function was created, like expected:

                          deploy                      
    --------------------------------------------------
     +-------------------+                           +
     | Removed functions |                           +
     +-------------------+                           +
                                                     +
                                                     +
                                                     +
     +---------------+                               +
     | New functions |                               +
     +---------------+                               +
                                                     +
     Schema................+ public                  +
     Name..................+ calc_value_added_tax    +
     Argument data types...+ amount numeric          +
     Result data type......+ numeric                 +
     Language..............+ plpgsql                 +
     Type..................+ normal                  +
     Volatility............+ IMMUTABLE               +
     Owner.................+ postgres                +
     Source code (chars)...+ 70                      +
                                                     +
                                                     +
     +-------------------------------+               +
     | Updated or replaced functions |               +
     +-------------------------------+               +
                                                     +
     MD5 of changes: 74e9f61aa5a0ce1c333748ff589a3029
    (1 row)
    

    We will now pass the MD5 from the preview as second argument, confirming we want to deploy.

    SELECT deploy($SQL$
    CREATE OR REPLACE FUNCTION calc_value_added_tax(amount numeric) RETURNS NUMERIC AS $$
    DECLARE
    _vat numeric := 0.25;
    BEGIN
    RETURN amount * (1 + _vat);
    END;
    $$ LANGUAGE plpgsql IMMUTABLE;
    $SQL$, '74e9f61aa5a0ce1c333748ff589a3029');
    

    The function has now been deployed.

    DAY 2. Function is modified:

    The functions needs to be modified, VAT should not be added for amounts less than 100.
    Some evil consultant is assigned the task and makes the change locally in the database, but forgets to commit to the VCS.
    This causes a discrepancy between the source code in the production database and the VCS. Not good at all!

    CREATE OR REPLACE FUNCTION calc_value_added_tax(amount numeric) RETURNS NUMERIC AS $$
    DECLARE
    _vat numeric := 0.25;
    BEGIN
    IF amount < 100 THEN
        RETURN amount;
    ELSE
        RETURN amount * (1 + _vat);
    END IF;
    END;
    $$ LANGUAGE plpgsql IMMUTABLE;
    

    DAY 3. You modify the function.

    You are assigned the task to change the VAT from 25% to 12.5%.
    You edit the function in the VCS, and change the single line of code,
    and use the deploy function to make the deployment.

    SELECT deploy($SQL$
    CREATE OR REPLACE FUNCTION calc_value_added_tax(amount numeric) RETURNS NUMERIC AS $$
    DECLARE
    _vat numeric := 0.125;
    BEGIN
    RETURN amount * (1 + _vat);
    END;
    $$ LANGUAGE plpgsql IMMUTABLE;
    $SQL$, NULL);
    

    In the output from the deploy function you notice the change is not the single line
    you were expecting, but a much bigger change!

                          deploy                      
    --------------------------------------------------
     +-------------------+                           +
     | Removed functions |                           +
     +-------------------+                           +
                                                     +
                                                     +
                                                     +
     +---------------+                               +
     | New functions |                               +
     +---------------+                               +
                                                     +
                                                     +
                                                     +
     +-------------------------------+               +
     | Updated or replaced functions |               +
     +-------------------------------+               +
                                                     +
     Schema................: public                  +
     Name..................: calc_value_added_tax    +
     Argument data types...: amount numeric          +
     Result data type......: numeric                 +
     Language..............: plpgsql                 +
     Type..................: normal                  +
     Volatility............: IMMUTABLE               +
     Owner.................: postgres                +
     3 c _vat numeric := 0.25;                       +
     3 c _vat numeric := 0.125;                      +
                                                     +
     5 c IF amount < 100 THEN                     +
     5 c RETURN amount * (1 + _vat);                 +
                                                     +
     6 -     RETURN amount;                          +
     6 -                                             +
                                                     +
     7 - ELSE                                        +
     7 -                                             +
                                                     +
     8 -     RETURN amount * (1 + _vat);             +
     8 -                                             +
                                                     +
     9 - END IF;                                     +
     9 -                                             +
                                                     +
                                                     +
                                                     +
     MD5 of changes: 162d4dcc113345c71f6c9bc4448534aa
    (1 row)
    

    You realize someone must have made a local change without committing to the VCS!

    The deployment is aborted, the deploy function saved the day!

    https://github.com/joelonsql/PgDeploy

    Call PostgreSQL functions via HTTP JSON-RPC

    Everytime I deploy a new stored procedure, it bothers me I have to update the middle-layer to allow my front-end code to use the function.

    My middle-layer between the front-end code and the database simply translates JSON-RPC requests
    to SELECT queries calling stored procedures returning JSON back to the front-end.

    This is almost perfect.

    But, I still have to update a file where I keep all queries, queries.xml, every time I add a new function.


        <query name="Get_Bank_Transaction_Fee" 
               params="Username, Password, Host, TransactionType, ClearingHouse, BankCode, Amount, Currency" 
               result="hashref">
            SELECT * FROM Get_Bank_Transaction_Fee(?, ?, ?, ?, ?, ?, ?, ?)
        </query>

    This is a quite nice way to separate the SQL code from the programming code, but since the SQL query consists of
    nothing more than a SELECT statement, the name of the stored procedure and the input arguments,
    it might as well be automatically generated based on the JSON-RPC request.

    I wanted to do CREATE FUNCTION and nothing more before the front-end could start using the new function!

    PostgreSQL 9.0 introduced calling functions using named parameters, instead of relying on the position of the arguments.


    -- this,
    SELECT myfunc(foo := 'Hello', bar := 'World!');
    -- is the same thing as,
    SELECT myfunc(bar := 'World!', foo := 'Hello');
    -- before 9.0 you had to do,
    SELECT myfunc('Hello','World!');
    -- which means you must not get the positions wrong

    This nice feature made it trivial and safe to map a JSON-RPC request to a stored procedure.
    The method is the named of the stored procedure and the params is simply a key/value hash,
    where each key maps to a named input argument.

    The only tricky part which required some usage of pg_catalog.pg_proc was to determine if the function
    returned a SET OF or a single row.

    I’ve created two perl modules to do the trick, DBIx::Pg::CallFunction and pg_proc_jsonrpc.psgi.

    http://search.cpan.org/~joeljac/DBIx-Pg-CallFunction/scripts/pg_proc_jsonrpc.psgi

    http://search.cpan.org/~joeljac/DBIx-Pg-CallFunction/lib/DBIx/Pg/CallFunction.pm

    Proof-of-concept demonstration:


    joel@ubuntu:~$ psql -c 'CREATE OR REPLACE FUNCTION myfunc(foo text, bar text) RETURNS TEXT AS $ SELECT $1 || $2 $$ LANGUAGE sql SECURITY DEFINER'
    CREATE FUNCTION
    joel@ubuntu:~$ curl 'http://localhost/postgres/myfunc?foo=Hello&bar=World'
    {
       "error" : null,
       "result" : "HelloWorld"
    }
    joel@ubuntu:~$ curl 'http://localhost/postgres/now'
    {
       "error" : null,
       "result" : "2012-06-03 10:46:23.972644+07"
    }
    joel@ubuntu:~$ curl 'http://localhost/postgres/random'
    {
       "error" : null,
       "result" : "0.664589043240994"
    }
    joel@ubuntu:~$ 

    Extract from PostgreSQL log:


    2012-06-03 10:46:14.270 ICT,"www-data","joel",3601,"[local]",4fcade06.e11,2,"SELECT",2012-06-03 10:46:14 ICT,2/52,0,LOG,00000,"execute dbdpg_p26924_1: SELECT * FROM myfunc(bar := $1,foo := $2);","parameters: $1 = 'World', $2 = 'Hello'",,,,,,,,"pg_proc_jsonrpc"
    2012-06-03 10:46:23.972 ICT,"www-data","joel",3603,"[local]",4fcade0f.e13,2,"SELECT",2012-06-03 10:46:23 ICT,2/58,0,LOG,00000,"execute dbdpg_p27386_2: SELECT * FROM now();",,,,,,,,,"pg_proc_jsonrpc"
    2012-06-03 10:46:27.732 ICT,"www-data","joel",3605,"[local]",4fcade13.e15,2,"SELECT",2012-06-03 10:46:27 ICT,2/64,0,LOG,00000,"execute dbdpg_p17663_2: SELECT * FROM random();",,,,,,,,,"pg_proc_jsonrpc"

    No more middle-layer hassle!

    Well, OK, not yet, this is just a proof-of-concept, a lot of work and testing remains until it can be put into production.

    I think this could potentially be a very interesting way to make a PostgreSQL database more accessible to the outside world,
    in a transparent and convenient way, not requiring any extra work or restarts.

    Imagine front-end developers writing their own stored procedures in PLV8 (Javascript) and accessing them using JQuery (Javascript).

    All that’s left is the browser and the database!

    Less is more.