I’ve recently been reminded, in no uncertain terms, of the importance of running the “ANALYZE” (or “VACUUM ANALYZE”) command periodically on a PostgreSQL database installation.
My problem:
Amongst other tables, a production database I administer has three tables: two moderately large tables, A(1) and B, and one quite large one (>700 000 records), C (for “Cross”), which is essentially the full cross product of A and B, with an additional piece of information for each tuple (record). So I’m associating some information with every combination of records from tables A and B.
Queries pulling the data for a single record of A or B worked quickly, as the database engine correctly used a lookup index instead of scanning the entire table. But any query that required records from C corresponding to more than one record of both A and B immediately triggered a sequential scan. So, say, for two records from A, a query that should take at most (2 x 1ms) was taking about 1.5s — nearly a thousand times longer, and very noticeable in applications.
Solution 1: Split the queries
One could, of course, handle this problem in the application code, by issuing a seperate query for each record from A, and combining the results in code. But this is really messy, and anyway surely direct data manipulation in PostgreSQL should be faster than PHP’s very flexible but not speedy number handling!
Solution 2: Set enable_seqscan to “off”
My initial research discovered the existence of a connection-wide variable “enable_seqscan”, that can be set to off using
SET enable_seqscan TO off;
Once so set, index scans are always forced. This works, and doesn’t require much additional application code. But it’s a rough tool, as it turns off ALL sequential scans, even those that would in fact be more efficient.
Really, what I wanted was the query analyzer and planner in PostgreSQL to be better — ie., to stop living in a fantasy world where sequential scans on table C were relatively fast. Turns out, however, that my problem was exactly that the planner was in such a world:
Solution 3: Run “ANALYZE”
This command takes a random sampling of rows from each table, to compile statistics about the tables, to help the planner plan. And it turned out I hadn’t run it in ages — in fact, since C had been a much, much smaller table. After running the command (it took a few seconds on over 50 tables), all my queries executed exactly as I wanted them to. Everywhere. A PHP page that took over two seconds to load is now effectively instantaneous, and much elsewhere is far improved.
Needless to say, my crontab now includes a nightly “VACUUM ANALYZE”. Yours should too.
(1)Names have been changed to protect witness confidentiality.
Can you go into detail how to do a full analyze on all tables in a database? Maybe that info easily available, but your’s was one of the first posts that came up in google, so would be a good plus to have that info here.
THanks.
LikeLike
Yeah, I’m really not sure how to be more precise, step 3 pretty much says it all: you run the query “ANALYZE”. Alternatively, you can run “VACUUM ANALYZE”, which cleans up deleted records as well, as a regular nightly cron job.
Details in the PostgreSQL docs, eg. http://www.postgresql.org/docs/8.2/interactive/sql-analyze.html
But if none of this means anything to you, you have a lot of reading to do 🙂
LikeLike
Maybe the question is “where do you run ANALYZE”. I first came upon ANALYZE after reading up on vacuum, which you can run from the command line as “vacuumdb -z ”
It took me a while to understand that ANALYZE and VACUUM ANALYZE are meant to be run from within the psql command line. I guess the all-caps is kind of a give-away, but the talk of cron-jobs confused me for a little bit. I kept looking for a command to run on the shell, rather than a command to run with psql.
Hope that helps.
LikeLike