I find myself facing a particular dilemma quite often.
Should I keep this rarely used index or drop it to enhance write performance?
If the index isn’t used at all, it’s no problem, just drop it, but if
something, sometimes is using the index, it might be necessary to keep it.
This something might be some script generating a monthly report or some rarely
used back office feature. Dropping the index would cause such applications
to take for ever due to seq_scan of the huge table.
In my case, our system processes millions of transactions, and the
write-performance will at some time become a bottle neck.
Dropping indexes would help me, but would backfire by causing problems a few
times a months when the indexes are necessary.
Now I am thinking, for these real-time write transactions, it would be awesome
if we commit at once, and post-pone the work of updating the rarely used
indexes until the point where some other read-query needs them. They could be
updated as soon as possible, but not during peak-time, allowing more important
real-time transactions to be prioritized.
Now you are thinking, this can’t be done, because the index is useless if it’s
not in sync, and that’s correct. But wait, we can fix that. Every time we need
to use the index, we simply process the buffer to get in sync before the index
is being used. It would probably take a second or two to process the buffer,
which would delay the monthly report batch job, but that’s totally OK as it’s
not a real-time feature and takes many seconds to complete anyway, but, would
again take forever without the index in place.
In total CPU time, we would not save anything. The amount of work would be the
same. But we would survive better during write-operation peak-times. We want
to maximize write throughput at peak-times, and do the updating of the rarely
used indexes later when the system load is lower.
I don’t know how/if this could be implemented in PostgreSQL, but the recently
added feature CREATE INDEX CONCURRENTLY is in a way similar to what I need,
it updates the index outside of a ACID transaction, and prevents anyone from
using the index until the index is marked as VALID.
Maybe a background process could handle the updating of indexes, like the
what we do with the vacuum daemon.
I can see there has been some discussion on this subject before on the mailing
lists:
“Scaling with lazy index updates”
http://archives.postgresql.org/pgsql-performance/2004-07/msg00114.php
But this was a very long time ago, the conditions may have changed since
2004.
ACID must be maintained, the index must not be used if not in sync, I only
want to delay the index updating work a bit, only be forced to get in sync
when necessary by a read query.
Thoughts, anyone?
In the test below, we can see how indexes affects insert speed.
joel@Joel-Jacobsons-MacBook-Pro ~ $ psql
psql (9.2beta1)
Type "help" for help.
joel=# CREATE TABLE Transactions (
joel(# TransactionID serial not null,
joel(# Amount numeric not null,
joel(# AccountID integer not null,
joel(# CustomerID integer not null,
joel(# ProductID integer not null,
joel(# Datestamp timestamptz not null default now(),
joel(# PRIMARY KEY (TransactionID)
joel(# -- No foreign keys in this example
joel(# );
NOTICE: CREATE TABLE will create implicit sequence "transactions_transactionid_seq" for serial column "transactions.transactionid"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "transactions_pkey" for table "transactions"
CREATE TABLE
joel=# -- No indexes except the primary key:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=454.232..454.232 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=14.762..131.197 rows=100000 loops=1)
Total runtime: 489.204 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_AccountID ON Transactions(AccountID);
CREATE INDEX
joel=#
joel=# -- 1 index:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=739.295..739.295 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=10.160..135.399 rows=100000 loops=1)
Total runtime: 741.141 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_CustomerID ON Transactions(CustomerID);
CREATE INDEX
joel=#
joel=# -- 2 indexes:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=1626.515..1626.515 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=9.739..530.770 rows=100000 loops=1)
Total runtime: 1627.900 ms
(3 rows)
joel=# CREATE INDEX Index_Transactions_ProductID ON Transactions(ProductID);
CREATE INDEX
joel=#
joel=# -- 3 indexes:
joel=#
joel=# EXPLAIN ANALYZE
joel-# INSERT INTO Transactions (Amount,AccountID,CustomerID,ProductID)
joel-# SELECT i,i,i,i FROM generate_series(1,100000) AS i;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Insert on transactions (cost=0.00..20.00 rows=1000 width=4) (actual time=2161.321..2161.321 rows=0 loops=1)
-> Function Scan on generate_series i (cost=0.00..20.00 rows=1000 width=4) (actual time=9.976..549.794 rows=100000 loops=1)
Total runtime: 2164.164 ms
(3 rows)
I use partial indexes for this:
ALTER TABLE foo ADD COLUMN indexed BOOLEAN NOT NULL DEFAULT FALSE;
CREATE INDEX foo_indexed ON foo (indexed);
Then make your expensive indexes conditional on the indexed column:
CREATE INDEX foo_a_b_c ON foo (a, b, c) WHERE indexed;
Now inserts with indexed=false will not update the index. On a regular basis (during non-peak hours) or when you need to sync the index:
UPDATE foo SET indexed = TRUE WHERE NOT indexed;
I’m also using partial indexes, very useful for queues etc, where only a few rows have the value to index.
However, it won’t help when you need to index all values, which I would believe is the normal scenario.
Maybe I’m missing something in your example, but in my use case, all rows are to be indexed eventually. The partial index gives you a kind of “staging area” (those rows with indexed=false) so that inserting new data doesn’t block on updating all of the indexes.
Of course, you still have to pay the performance cost of updating those indexes, but you can defer those updates until off-peak hours if you don’t need the data indexed immediately.
I’m using this for web access logs on a moderately trafficked site. The inserts are very fast, but we can still generate relatively current stats hourly after rolling the last hour’s log entries into the partial index.
One minor gotcha with this approach is that you still need to phrase your queries in terms that Postgres can recognize as covered by the partial index, e.g. “WHERE my_condition AND indexed”.
Ah, now I see what you mean.
Nice work-around, hadn’t thought about this one before!
It would work if you have, like in my example, a lot of columns
you want to index. Then you only need to do the extra work of updating
the single indexed index within the transaction, and can do the other indexes
later on.
This won’t help though for situations where you have one single rarely
used index on a table.
It would of course be a lot better to have this built-into postgres, so you
wouldn’t have to customize all queries and avoid the extra work required to
update the indexed index, preventing the total work-load from increasing.
My husband and i have been quite reiveled when Louis managed to do his survey because of the precious recommendations he was given using your site. It’s not at all simplistic to just choose to be offering concepts that the rest could have been making money from. And we also know we’ve got the writer to thank for this. The type of illustrations you have made, the easy blog menu, the friendships you will make it easier to engender it’s mostly sensational, and it’s helping our son and our family reckon that this topic is amusing, which is truly fundamental. Thanks for all!
w1Rzpg szqojfieagpg
Marking an index as ‘UPDATE ON USE’ or some such, could save processing time on very infrequently used indexes, because:
(1) a record that had been updated multiple times since the last update of the index, would only result in just one index update and not one for every record modification
(2) very likely, the number of disk pages of the index that needed updating would be a lot less than the number or record insertions/deletes/updates (even assuming each recorded had been changed, had been modified exactly once), this would save on disk I/O
(3) potential for optimising batch update of the index, maybe having a bit map per delayed up index, to note the data pages that would need to be check for an actual index update? (losing them in a crash would not be too big a loss, as it would only mean a full scan rather than a scan of selected blocks)
another workaround is using table partitioning. When you create small buffer table without indexes to insert data and main table with indexes to update. At end of day move data to main table from buffer. I that case you can control when you use indexes.
Partitioning have few pluses – invisible for application if u correct inherits tables,
writes on two tables, constraint exclusion and so on…
i used this solution to heavy loaded tables for logs and it was good solution in my case.
Jon, velocity is the mesarue of the turnover of the currency to support a level of output. If that turnover is falling then economic activity is falling as well. EOE is an identity and is not very useful theorectically i.e. there is not a causation only a relationship among variables that are determined outside the EOE. If the FED had not increased the money supply then prices would likely have fallen in 2008 and 2009 along with more output contraction but even with the stimulus and monetary stimulus nominal GDP fell for two quarters something we hadn’t seen since the Great Depression. I still think the turn in velocity in q3 of 2010 just when the stimulus was trailing off is curious. This would indicate to me some fundamental weakness in the economy that has extended to the present. It also shows how important the deficit spending has been to keep output from falling further. It is still all about aggregate demand as far as I can tell. There would be a benefit to falling output and falling prices it would clear markets like housing faster though it would cause massive unemployment. But then we don’t have the stomach for that when we can run deficits and expand the money supply.
I like the idea of using a per-index bitmap saying “this part of the table is indexed properly”. Rather like how autovacuum keeps track of to-be-vaccuumed pages. You could then mark your index as either “live” or “bitrotting”, and use some DDL to switch from one to the other. Switching to “live” state would use an algorythm similar to “create index concurrently”, but a bit more efficient (since it doesn’t have to scan all pages).
moltonel: Yes exactly, good description, thanks for finding the correct postgresish words for this. How could it be implemented then? Where in the source code should one start hacking?
joelonsql: no idea :p Bring the questions on the pg-hackers mailing list, it’s a better medium than a blog post.
I will try to become more sciepstioathd. Dumb ol’ me, believing in the economy and market, when the DJIA has risen by 60 percent. I think it’s important to remember the difference between the stock market and the economy. The stock market is just one aspect of the economy. Just as we wouldn’t say “the economy is in recession because the Paper Production industry is in recession”, we cannot conclude the economy is growing simply because the stock market is growing. It may be an indicator, sure, but it, in and of itself, is not a reason.The stock market, in and of itself, is meaningless. The DJIA breaks 13,000. So what? That doesn’t mean anything. The stock market is an unreliable economic indicator, at best. Investors tend to react (or overreact) to small items. Take this past summer for example: the stock market plunged on debt talks in the US. Is that a sign the us economy was/would be falling? Nope. GDP still grew.We have a weird obsession with the stock market. I say “weird” because the number has no intrinsic value.The stock market is not the economy. It’s an element of it, but not the whole thing. Just as sugar is an element of cake
Hey since you’re successful in MLM there is a new gnurod floor opportunity I would like for you to check out. We launched a year and a half ago and we are doing phenomenal. There is a 2% platinum bonus for people who qualify during the month of Dec. 2011 for 5 years. This is a no brainier for anyone who is successful and experienced in MLM. You can contact me for more info if anyone is interested!
That’s an apt answer to an interesting question
Great insight. Relieved I’m on the same side as you.
I’m not worthy to be in the same forum. ROTFL
I have realized that oinnle diploma is getting preferred because obtaining your degree oinnle has turned into a popular option for many people. A large number of people have not necessarily had a chance to attend a traditional college or university however seek the increased earning possibilities and career advancement that a Bachelors Degree gives you. Still others might have a college degree in one course but would wish to pursue one thing they now possess an interest in.
Hello there! This post could not be written much better!
Reading through this post reminds me of my
previous roommate! He constantly kept talking about this.
I’ll forward this post to him. Fairly certain he’s going to have a great read.
I appreciate you for sharing!
whoah this weblog is excellent i really like reading your
articles. Stay up the great work! You recognize, many people are searching round for this information, you can aid them greatly.