Call PostgreSQL functions via HTTP JSON-RPC

Everytime I deploy a new stored procedure, it bothers me I have to update the middle-layer to allow my front-end code to use the function.

My middle-layer between the front-end code and the database simply translates JSON-RPC requests
to SELECT queries calling stored procedures returning JSON back to the front-end.

This is almost perfect.

But, I still have to update a file where I keep all queries, queries.xml, every time I add a new function.


    <query name="Get_Bank_Transaction_Fee" 
           params="Username, Password, Host, TransactionType, ClearingHouse, BankCode, Amount, Currency" 
           result="hashref">
        SELECT * FROM Get_Bank_Transaction_Fee(?, ?, ?, ?, ?, ?, ?, ?)
    </query>

This is a quite nice way to separate the SQL code from the programming code, but since the SQL query consists of
nothing more than a SELECT statement, the name of the stored procedure and the input arguments,
it might as well be automatically generated based on the JSON-RPC request.

I wanted to do CREATE FUNCTION and nothing more before the front-end could start using the new function!

PostgreSQL 9.0 introduced calling functions using named parameters, instead of relying on the position of the arguments.


-- this,
SELECT myfunc(foo := 'Hello', bar := 'World!');
-- is the same thing as,
SELECT myfunc(bar := 'World!', foo := 'Hello');
-- before 9.0 you had to do,
SELECT myfunc('Hello','World!');
-- which means you must not get the positions wrong

This nice feature made it trivial and safe to map a JSON-RPC request to a stored procedure.
The method is the named of the stored procedure and the params is simply a key/value hash,
where each key maps to a named input argument.

The only tricky part which required some usage of pg_catalog.pg_proc was to determine if the function
returned a SET OF or a single row.

I’ve created two perl modules to do the trick, DBIx::Pg::CallFunction and pg_proc_jsonrpc.psgi.

http://search.cpan.org/~joeljac/DBIx-Pg-CallFunction/scripts/pg_proc_jsonrpc.psgi

http://search.cpan.org/~joeljac/DBIx-Pg-CallFunction/lib/DBIx/Pg/CallFunction.pm

Proof-of-concept demonstration:


joel@ubuntu:~$ psql -c 'CREATE OR REPLACE FUNCTION myfunc(foo text, bar text) RETURNS TEXT AS $ SELECT $1 || $2 $$ LANGUAGE sql SECURITY DEFINER'
CREATE FUNCTION
joel@ubuntu:~$ curl 'http://localhost/postgres/myfunc?foo=Hello&bar=World'
{
   "error" : null,
   "result" : "HelloWorld"
}
joel@ubuntu:~$ curl 'http://localhost/postgres/now'
{
   "error" : null,
   "result" : "2012-06-03 10:46:23.972644+07"
}
joel@ubuntu:~$ curl 'http://localhost/postgres/random'
{
   "error" : null,
   "result" : "0.664589043240994"
}
joel@ubuntu:~$ 

Extract from PostgreSQL log:


2012-06-03 10:46:14.270 ICT,"www-data","joel",3601,"[local]",4fcade06.e11,2,"SELECT",2012-06-03 10:46:14 ICT,2/52,0,LOG,00000,"execute dbdpg_p26924_1: SELECT * FROM myfunc(bar := $1,foo := $2);","parameters: $1 = 'World', $2 = 'Hello'",,,,,,,,"pg_proc_jsonrpc"
2012-06-03 10:46:23.972 ICT,"www-data","joel",3603,"[local]",4fcade0f.e13,2,"SELECT",2012-06-03 10:46:23 ICT,2/58,0,LOG,00000,"execute dbdpg_p27386_2: SELECT * FROM now();",,,,,,,,,"pg_proc_jsonrpc"
2012-06-03 10:46:27.732 ICT,"www-data","joel",3605,"[local]",4fcade13.e15,2,"SELECT",2012-06-03 10:46:27 ICT,2/64,0,LOG,00000,"execute dbdpg_p17663_2: SELECT * FROM random();",,,,,,,,,"pg_proc_jsonrpc"

No more middle-layer hassle!

Well, OK, not yet, this is just a proof-of-concept, a lot of work and testing remains until it can be put into production.

I think this could potentially be a very interesting way to make a PostgreSQL database more accessible to the outside world,
in a transparent and convenient way, not requiring any extra work or restarts.

Imagine front-end developers writing their own stored procedures in PLV8 (Javascript) and accessing them using JQuery (Javascript).

All that’s left is the browser and the database!

Less is more.

Add indexes without slowing down writes

I find myself facing a particular dilemma quite often.

Should I keep this rarely used index or drop it to enhance write performance?

If the index isn’t used at all, it’s no problem, just drop it, but if
something, sometimes is using the index, it might be necessary to keep it.

This something might be some script generating a monthly report or some rarely
used back office feature. Dropping the index would cause such applications
to take for ever due to seq_scan of the huge table.

In my case, our system processes millions of transactions, and the
write-performance will at some time become a bottle neck.
Dropping indexes would help me, but would backfire by causing problems a few
times a months when the indexes are necessary.

Now I am thinking, for these real-time write transactions, it would be awesome
if we commit at once, and post-pone the work of updating the rarely used
indexes until the point where some other read-query needs them. They could be
updated as soon as possible, but not during peak-time, allowing more important
real-time transactions to be prioritized.

Now you are thinking, this can’t be done, because the index is useless if it’s
not in sync, and that’s correct. But wait, we can fix that. Every time we need
to use the index, we simply process the buffer to get in sync before the index
is being used. It would probably take a second or two to process the buffer,
which would delay the monthly report batch job, but that’s totally OK as it’s
not a real-time feature and takes many seconds to complete anyway, but, would
again take forever without the index in place.

In total CPU time, we would not save anything. The amount of work would be the
same. But we would survive better during write-operation peak-times. We want
to maximize write throughput at peak-times, and do the updating of the rarely
used indexes later when the system load is lower.

I don’t know how/if this could be implemented in PostgreSQL, but the recently
added feature CREATE INDEX CONCURRENTLY is in a way similar to what I need,
it updates the index outside of a ACID transaction, and prevents anyone from
using the index until the index is marked as VALID.

Maybe a background process could handle the updating of indexes, like the
what we do with the vacuum daemon.

I can see there has been some discussion on this subject before on the mailing
lists:

“Scaling with lazy index updates”
http://archives.postgresql.org/pgsql-performance/2004-07/msg00114.php

But this was a very long time ago, the conditions may have changed since
2004.

ACID must be maintained, the index must not be used if not in sync, I only
want to delay the index updating work a bit, only be forced to get in sync
when necessary by a read query.

Thoughts, anyone?

In the test below, we can see how indexes affects insert speed.


joel@Joel-Jacobsons-MacBook-Pro ~ $ psql
psql (9.2beta1)
Type "help" for help.

joel=# CREATE TABLE Transactions (
joel(# TransactionID serial not null,
joel(# Amount numeric not null,
joel(# AccountID integer not null,
joel(# CustomerID integer not null,
joel(# ProductID integer not null,
joel(# Datestamp timestamptz not null default now(),
joel(# PRIMARY KEY (TransactionID)
joel(# -- No foreign keys in this example
joel(# );
NOTICE:  CREATE TABLE will create implicit sequence "transactions_transactionid_seq" for serial column "transactions.transactionid"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "transactions_pkey" for table "transactions"
CREATE TABLE
joel=# -- No indexes except the primary key:
joel=# 
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Insert on transactions  (cost=0.00..20.00 rows=1000 width=4) (actual time=454.232..454.232 rows=0 loops=1)
   ->  Function Scan on generate_series i  (cost=0.00..20.00 rows=1000 width=4) (actual time=14.762..131.197 rows=100000 loops=1)
 Total runtime: 489.204 ms
(3 rows)

joel=# CREATE INDEX Index_Transactions_AccountID ON Transactions(AccountID);
CREATE INDEX
joel=# 
joel=# -- 1 index:
joel=# 
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Insert on transactions  (cost=0.00..20.00 rows=1000 width=4) (actual time=739.295..739.295 rows=0 loops=1)
   ->  Function Scan on generate_series i  (cost=0.00..20.00 rows=1000 width=4) (actual time=10.160..135.399 rows=100000 loops=1)
 Total runtime: 741.141 ms
(3 rows)

joel=# CREATE INDEX Index_Transactions_CustomerID ON Transactions(CustomerID);
CREATE INDEX
joel=# 
joel=# -- 2 indexes:
joel=# 
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Insert on transactions  (cost=0.00..20.00 rows=1000 width=4) (actual time=1626.515..1626.515 rows=0 loops=1)
   ->  Function Scan on generate_series i  (cost=0.00..20.00 rows=1000 width=4) (actual time=9.739..530.770 rows=100000 loops=1)
 Total runtime: 1627.900 ms
(3 rows)

joel=# CREATE INDEX Index_Transactions_ProductID ON Transactions(ProductID);
CREATE INDEX
joel=# 
joel=# -- 3 indexes:
joel=# 
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Insert on transactions  (cost=0.00..20.00 rows=1000 width=4) (actual time=2161.321..2161.321 rows=0 loops=1)
   ->  Function Scan on generate_series i  (cost=0.00..20.00 rows=1000 width=4) (actual time=9.976..549.794 rows=100000 loops=1)
 Total runtime: 2164.164 ms
(3 rows)

Hasta la vista, SQL_ASCII

Converting the encoding of your production database from SQL_ASCII to UTF8 can be a tricky challenge if you cannot accept any down time.

The method presented here scans through all rows with text/varchar columns and writes any such rows to a separate schema, allowing you to dump the schema to analyze each individual rows.

1. Define a function to check if a string is valid UTF8 or not.

CREATE OR REPLACE FUNCTION Is_UTF8(_Str text) RETURNS BOOLEAN AS $BODY$
SELECT
encode(textsend($1),'hex')
~
$REGEX$(?x)
^(?:
(?:[0-7][0-9a-f])
|
(?:
(?:c[2-9a-f]|d[0-9a-f])
|
e0[ab][0-9a-f]
|
ed[89][0-9a-f]
|
(?:
(?:e[1-9abcef])
|
f0[9ab][0-9a-f]
|
f[1-3][89ab][0-9a-f]
|
f48[0-9a-f]
)[89ab][0-9a-f]
)[89ab][0-9a-f]
)*$
$REGEX$;
$BODY$ LANGUAGE sql IMMUTABLE;

Source: http://tapoueh.org/blog/2010/02/23-getting-out-of-sql_ascii-part-2.html

2. Create a schema for the invalid UTF8 rows:

CREATE SCHEMA invalidutf8;

3. Generate a script which will create and populate a table for each table with text/varchar columns, containing rows with invalid UTF8 text.


COPY (
SELECT 'CREATE TABLE invalidutf8.' || s.nspname || '_' || c.relname || ' AS SELECT * FROM ' || s.nspname || '.' || c.relname || ' WHERE Is_UTF8(' || array_to_string(array_agg(a.attname),') IS FALSE OR Is_UTF8(') || ') IS FALSE;'
FROM pg_attribute a
INNER JOIN pg_class c ON (a.attrelid = c.oid)
INNER JOIN pg_namespace s ON (s.oid = c.relnamespace)
WHERE a.atttypid IN (25, 1043) -- text, varchar
AND c.relkind = 'r' -- ordinary table
AND s.nspname NOT LIKE 'pg_%' AND s.nspname NOT IN ('pgx_diag','information_schema')
GROUP BY s.nspname, c.relname
ORDER BY s.nspname, c.relname
) TO '/tmp/invalidutf8.sql';

4. Run the script, in psql type:


\i /tmp/invalidutf8.sql

5. We will now dump the schema and analyze the rows:


pg_dump -f invalidutf8.dump -F p -n invalidutf8 -h 127.0.0.1 --column-inserts -a thedatabasename

It is now easy to either delete rows or convert then to UTF-8, provided you know the encoding, otherwise you can always replace any non-ascii chars with whitespace.

Let’s say you noticed you have a table Users with a text column Lastname, containing invalid UTF-8, which you identify being latin1.
The below SQL updates only the rows with invalid characters and also checks the text converted from latin1 is valid utf8.


UPDATE Users SET Lastname = convert( textsend(Lastname), 'latin1', 'utf8' )::text
WHERE UserID IN (
SELECT UserID FROM invalidutf8.Users
WHERE Is_UTF8(
convert(
textsend(invalidutf8.Users.Lastname),
'latin1',
'utf8'
)::text
) IS TRUE
);

6. Drop the schema to drop all the invalidutf8.* tables we created.


DROP SCHEMA invalidutf8 CASCADE;

To verify, do the whole thing again. If the pg_dump of the invalidutf8 schema doesn’t contain any INSERT rows, you can be certain you have a clean UTF8 database.

7. It should now be safe to change the encoding of the database to UTF8.


UPDATE pg_database SET encoding = pg_char_to_encoding('UTF8') WHERE datname = 'thedatabasename';

Lastly, after changing the encoding, do the whole thing again (step 2-6) and verify no invalid UTF8 rows have managed to be inserted during the time between your last check and the changing of encoding.

Done!

I’m not giving any guarantees this is a safe way to do it, but personally it was my only option as we wouldn’t accept a down time.
So far so good, I guess the only way to know for sure is to try doing a pg_dump / pg_restore and see if it complains.