I find myself facing a particular dilemma quite often.
Should I keep this rarely used index or drop it to enhance write performance?
If the index isn’t used at all, it’s no problem, just drop it, but if
something, sometimes is using the index, it might be necessary to keep it.
This something might be some script generating a monthly report or some rarely
used back office feature. Dropping the index would cause such applications
to take for ever due to seq_scan of the huge table.
In my case, our system processes millions of transactions, and the
write-performance will at some time become a bottle neck.
Dropping indexes would help me, but would backfire by causing problems a few
times a months when the indexes are necessary.
Now I am thinking, for these real-time write transactions, it would be awesome
if we commit at once, and post-pone the work of updating the rarely used
indexes until the point where some other read-query needs them. They could be
updated as soon as possible, but not during peak-time, allowing more important
real-time transactions to be prioritized.
Now you are thinking, this can’t be done, because the index is useless if it’s
not in sync, and that’s correct. But wait, we can fix that. Every time we need
to use the index, we simply process the buffer to get in sync before the index
is being used. It would probably take a second or two to process the buffer,
which would delay the monthly report batch job, but that’s totally OK as it’s
not a real-time feature and takes many seconds to complete anyway, but, would
again take forever without the index in place.
In total CPU time, we would not save anything. The amount of work would be the
same. But we would survive better during write-operation peak-times. We want
to maximize write throughput at peak-times, and do the updating of the rarely
used indexes later when the system load is lower.
I don’t know how/if this could be implemented in PostgreSQL, but the recently
added feature CREATE INDEX CONCURRENTLY is in a way similar to what I need,
it updates the index outside of a ACID transaction, and prevents anyone from
using the index until the index is marked as VALID.
Maybe a background process could handle the updating of indexes, like the
what we do with the vacuum daemon.
I can see there has been some discussion on this subject before on the mailing
lists:
“Scaling with lazy index updates”
http://archives.postgresql.org/pgsql-performance/2004-07/msg00114.php
But this was a very long time ago, the conditions may have changed since
2004.
ACID must be maintained, the index must not be used if not in sync, I only
want to delay the index updating work a bit, only be forced to get in sync
when necessary by a read query.
Thoughts, anyone?
In the test below, we can see how indexes affects insert speed.
joel@Joel-Jacobsons-MacBook-Pro ~ $ psql
psql (9.2beta1)
Type "help" for help.
joel=# CREATE TABLE Transactions (
joel(# TransactionID serial not null,
joel(# Amount numeric not null,
joel(# AccountID integer not null,
joel(# CustomerID integer not null,
joel(# ProductID integer not null,
joel(# Datestamp timestamptz not null default now(),
joel(# PRIMARY KEY (TransactionID)
joel(# -- No foreign keys in this example
joel(# );
NOTICE: CREATE TABLE will create implicit sequence "transactions_transactionid_seq" for serial column "transactions.transactionid"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "transactions_pkey" for table "transactions"
CREATE TABLE
joel=# -- No indexes except the primary key:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=454.232..454.232 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=14.762..131.197 rows=100000 loops=1)
Total runtime: 489.204 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_AccountID ON Transactions(AccountID);
CREATE INDEX
joel=#
joel=# -- 1 index:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=739.295..739.295 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=10.160..135.399 rows=100000 loops=1)
Total runtime: 741.141 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_CustomerID ON Transactions(CustomerID);
CREATE INDEX
joel=#
joel=# -- 2 indexes:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=1626.515..1626.515 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=9.739..530.770 rows=100000 loops=1)
Total runtime: 1627.900 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_ProductID ON Transactions(ProductID);
CREATE INDEX
joel=#
joel=# -- 3 indexes:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=2161.321..2161.321 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=9.976..549.794 rows=100000 loops=1)
Total runtime: 2164.164 ms
(3 rows)
I use partial indexes for this:
ALTER TABLE foo ADD COLUMN indexed BOOLEAN NOT NULL DEFAULT FALSE;
CREATE INDEX foo_indexed ON foo (indexed);
Then make your expensive indexes conditional on the indexed column:
CREATE INDEX foo_a_b_c ON foo (a, b, c) WHERE indexed;
Now inserts with indexed=false will not update the index. On a regular basis (during non-peak hours) or when you need to sync the index:
UPDATE foo SET indexed = TRUE WHERE NOT indexed;
I’m also using partial indexes, very useful for queues etc, where only a few rows have the value to index.
However, it won’t help when you need to index all values, which I would believe is the normal scenario.
Maybe I’m missing something in your example, but in my use case, all rows are to be indexed eventually. The partial index gives you a kind of “staging area” (those rows with indexed=false) so that inserting new data doesn’t block on updating all of the indexes.
Of course, you still have to pay the performance cost of updating those indexes, but you can defer those updates until off-peak hours if you don’t need the data indexed immediately.
I’m using this for web access logs on a moderately trafficked site. The inserts are very fast, but we can still generate relatively current stats hourly after rolling the last hour’s log entries into the partial index.
One minor gotcha with this approach is that you still need to phrase your queries in terms that Postgres can recognize as covered by the partial index, e.g. “WHERE my_condition AND indexed”.
Ah, now I see what you mean.
Nice work-around, hadn’t thought about this one before!
It would work if you have, like in my example, a lot of columns
you want to index. Then you only need to do the extra work of updating
the single indexed index within the transaction, and can do the other indexes
later on.
This won’t help though for situations where you have one single rarely
used index on a table.
It would of course be a lot better to have this built-into postgres, so you
wouldn’t have to customize all queries and avoid the extra work required to
update the indexed index, preventing the total work-load from increasing.
My husband and i have been quite reiveled when Louis managed to do his survey because of the precious recommendations he was given using your site. It’s not at all simplistic to just choose to be offering concepts that the rest could have been making money from. And we also know we’ve got the writer to thank for this. The type of illustrations you have made, the easy blog menu, the friendships you will make it easier to engender it’s mostly sensational, and it’s helping our son and our family reckon that this topic is amusing, which is truly fundamental. Thanks for all!
w1Rzpg szqojfieagpg
Marking an index as ‘UPDATE ON USE’ or some such, could save processing time on very infrequently used indexes, because:
(1) a record that had been updated multiple times since the last update of the index, would only result in just one index update and not one for every record modification
(2) very likely, the number of disk pages of the index that needed updating would be a lot less than the number or record insertions/deletes/updates (even assuming each recorded had been changed, had been modified exactly once), this would save on disk I/O
(3) potential for optimising batch update of the index, maybe having a bit map per delayed up index, to note the data pages that would need to be check for an actual index update? (losing them in a crash would not be too big a loss, as it would only mean a full scan rather than a scan of selected blocks)
another workaround is using table partitioning. When you create small buffer table without indexes to insert data and main table with indexes to update. At end of day move data to main table from buffer. I that case you can control when you use indexes.
Partitioning have few pluses – invisible for application if u correct inherits tables,
writes on two tables, constraint exclusion and so on…
i used this solution to heavy loaded tables for logs and it was good solution in my case.
I like the idea of using a per-index bitmap saying “this part of the table is indexed properly”. Rather like how autovacuum keeps track of to-be-vaccuumed pages. You could then mark your index as either “live” or “bitrotting”, and use some DDL to switch from one to the other. Switching to “live” state would use an algorythm similar to “create index concurrently”, but a bit more efficient (since it doesn’t have to scan all pages).
moltonel: Yes exactly, good description, thanks for finding the correct postgresish words for this. How could it be implemented then? Where in the source code should one start hacking?
joelonsql: no idea :p Bring the questions on the pg-hackers mailing list, it’s a better medium than a blog post.
I have realized that oinnle diploma is getting preferred because obtaining your degree oinnle has turned into a popular option for many people. A large number of people have not necessarily had a chance to attend a traditional college or university however seek the increased earning possibilities and career advancement that a Bachelors Degree gives you. Still others might have a college degree in one course but would wish to pursue one thing they now possess an interest in.