Converting the encoding of your production database from SQL_ASCII to UTF8 can be a tricky challenge if you cannot accept any down time.

The method presented here scans through all rows with text/varchar columns and writes any such rows to a separate schema, allowing you to dump the schema to analyze each individual rows.

1. Define a function to check if a string is valid UTF8 or not.

CREATE OR REPLACE FUNCTION Is_UTF8(_Str text) RETURNS BOOLEAN AS $BODY$
SELECT
encode(textsend($1),'hex')
~
$REGEX$(?x)
^(?:
(?:[0-7][0-9a-f])
|
(?:
(?:c[2-9a-f]|d[0-9a-f])
|
e0[ab][0-9a-f]
|
ed[89][0-9a-f]
|
(?:
(?:e[1-9abcef])
|
f0[9ab][0-9a-f]
|
f[1-3][89ab][0-9a-f]
|
f48[0-9a-f]
)[89ab][0-9a-f]
)[89ab][0-9a-f]
)*$
$REGEX$;
$BODY$ LANGUAGE sql IMMUTABLE;

Source: http://tapoueh.org/blog/2010/02/23-getting-out-of-sql_ascii-part-2.html

2. Create a schema for the invalid UTF8 rows:

CREATE SCHEMA invalidutf8;

3. Generate a script which will create and populate a table for each table with text/varchar columns, containing rows with invalid UTF8 text.


COPY (
SELECT 'CREATE TABLE invalidutf8.' || s.nspname || '_' || c.relname || ' AS SELECT * FROM ' || s.nspname || '.' || c.relname || ' WHERE Is_UTF8(' || array_to_string(array_agg(a.attname),') IS FALSE OR Is_UTF8(') || ') IS FALSE;'
FROM pg_attribute a
INNER JOIN pg_class c ON (a.attrelid = c.oid)
INNER JOIN pg_namespace s ON (s.oid = c.relnamespace)
WHERE a.atttypid IN (25, 1043) -- text, varchar
AND c.relkind = 'r' -- ordinary table
AND s.nspname NOT LIKE 'pg_%' AND s.nspname NOT IN ('pgx_diag','information_schema')
GROUP BY s.nspname, c.relname
ORDER BY s.nspname, c.relname
) TO '/tmp/invalidutf8.sql';

4. Run the script, in psql type:


\i /tmp/invalidutf8.sql

5. We will now dump the schema and analyze the rows:


pg_dump -f invalidutf8.dump -F p -n invalidutf8 -h 127.0.0.1 --column-inserts -a thedatabasename

It is now easy to either delete rows or convert then to UTF-8, provided you know the encoding, otherwise you can always replace any non-ascii chars with whitespace.

Let’s say you noticed you have a table Users with a text column Lastname, containing invalid UTF-8, which you identify being latin1.
The below SQL updates only the rows with invalid characters and also checks the text converted from latin1 is valid utf8.


UPDATE Users SET Lastname = convert( textsend(Lastname), 'latin1', 'utf8' )::text
WHERE UserID IN (
SELECT UserID FROM invalidutf8.Users
WHERE Is_UTF8(
convert(
textsend(invalidutf8.Users.Lastname),
'latin1',
'utf8'
)::text
) IS TRUE
);

6. Drop the schema to drop all the invalidutf8.* tables we created.


DROP SCHEMA invalidutf8 CASCADE;

To verify, do the whole thing again. If the pg_dump of the invalidutf8 schema doesn’t contain any INSERT rows, you can be certain you have a clean UTF8 database.

7. It should now be safe to change the encoding of the database to UTF8.


UPDATE pg_database SET encoding = pg_char_to_encoding('UTF8') WHERE datname = 'thedatabasename';

Lastly, after changing the encoding, do the whole thing again (step 2-6) and verify no invalid UTF8 rows have managed to be inserted during the time between your last check and the changing of encoding.

Done!

I’m not giving any guarantees this is a safe way to do it, but personally it was my only option as we wouldn’t accept a down time.
So far so good, I guess the only way to know for sure is to try doing a pg_dump / pg_restore and see if it complains.