Thanks to Thomas Vondra’s presentation CREATE STATISTICS – What is it for? at Nordic PGDay 2018, my Trustly colleague Claes Jakobsson came up with the idea for a new extension.

The presentation explained how to manually compare the cost rows with the actual rows output from manually executing EXPLAIN ANALYZE.
Claes came up with the idea to do this comparison automatically, right after a query has been executed.
During the conference, Claes implemented the idea and emailed me some code for me to test.

Below is a demo of this extension, using the same data set and queries as in the presentation.

git clone https://github.com/tvondra/create-statistics-talk.git
psql -X -f create-statistics-talk/create.sql

Let’s first try the query without pg_badplan using EXPLAIN ANALYZE:

\copy zip_codes from ~/src/create-statistics-talk/no_postal_codes_utf.csv with csv header;
COPY 4574

EXPLAIN (ANALYZE, TIMING off)
SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Seq Scan on zip_codes  (cost=0.00..108.61 rows=90 width=36) (actual rows=642 loops=1)
   Filter: ((city = 'Oslo'::text) AND (county = 'Oslo'::text))
   Rows Removed by Filter: 3932
 Planning time: 0.357 ms
 Execution time: 0.679 ms
(5 rows)

As you can see, we can manually see the ratio is approx 7.133 (=642/90).

Next, let’s see how we can automate this manual process using pg_badplan.

git clone https://github.com/trustly/pg_badplan.git
cd pg_badplan
make install
echo "shared_preload_libraries = 'pg_badplan'" >> /usr/local/var/postgres/postgresql.conf
echo "pg_badplan.min_row_threshold = 10" >> /usr/local/var/postgres/postgresql.conf
brew services restart postgresql

The pg_badplan.min_row_threshold has been set to 10 just for testing,
the default is 1000, but the query in the example returns fewer rows than that.

We can now try to run the same query without EXPLAIN ANALYZE,
and rely on pg_badplan to notify us in the normal log file.

SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';
tail -f /usr/local/var/postgres/log/postgresql.csv | grep pg_badplan
2018-03-19 11:46:54.154 CET,"joel","joel",91666,"[local]",5aaf951a.16612,5,"SELECT",2018-03-19 11:46:50 CET,4/34,0,LOG,00000,"pg_badplan: rows expected/actual ratio 7.133 exceeded for query SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';",,,,,,"SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';",,,"psql"

Finally, after having analyzed the pg_badplan log output,
we will hopefully be smart enough to manually realise
it would be a good plan to CREATE STATISTICS to help
the planner make a better estimates, which in turn will
help it to select better faster plans.

CREATE STATISTICS s (dependencies) ON city, state, county FROM zip_codes;
ANALYZE zip_codes;

Run the same query again, and watch how pg_badplan will now be silent thanks to the ratio between the estimate and actual rows are now below the pg_badplan.ratio threshold value.

The code is available here: https://github.com/trustly/pg_badplan