CLUSTERING_FACTOR – AppCrawler

I had to analyze a performance problem this morning, and it occurred to me it is a good example of troubleshooting queries impacted by the clustering_factor of an index. As such, below is the email exchange.

I took a look, and I think I understand why the query optimizer is choosing a full scan and not using the index on institution_number.

The issue is the UPDATE_TIME filter. If we were just asking for the column2 for institution 22, then it could resolve the query with the index on column1, column2. Since we are also asking for UPDATE_TIME, we will *have* to visit the table at some point. That is when it gets dicey. There are only about 6 million storage *blocks* in the entire table, as the rows are fairly “skinny”. There are about 3 million holdings for OSU. If Oracle uses the index, it will get the 3 million records for the column1 from the index, and use those to lookup each record to see if it meets the UPDATE_TIME criteria.

The deciding factor for the query optimizer is what Oracle calls the “clustering_factor”. This is an indication as to how ordered the data in the table is when compared to the order of the data in the index. The bad news is ours is not ordered at all when compared to each other. In other words, the next entry in the index almost never refers to the same table block as the previous index entry did.

What this means is the following example for OSU (column1 = 22) and the entries for it in the column1, column2 index:

column1 22 – table rowid = file, 12, block 123456
column1 22 – table rowid = file, 14, block 345126
column1 22 – table rowid = file, 15, block 759253
column1 22 – table rowid = file, 12, block 784359
column1 22 – table rowid = file, 15, block 834127

This requires Oracle to jump back and forth between the index and the table, reading different blocks one at a time (that is the real problem). Since there are 3 million holdings, being read one at a time on up to 3 million table blocks, Oracle finds it quicker to full scan the 6 million blocks of the table (16 at a time), as the overall throughput will be better. Think of it like looking up a word in an index, where the word never occurs twice on the same page, and the pages on which it does occur are in no particular sorted order. If you could speed read, it may be quicker to just read the book if the number of entries you find is large enough.

The only solutions I can think of are:

1. Parallel with a higher db_file_multiblock_read_count parameter value – I am the last guy on the block to suggest parallel query, as the “need” for it usually indicates a fixable design issue. However in this case, it may be the least invasive, assuming you won’t be running multiple versions of this SQL simultaneously. I retrieved the entire set in about 90 seconds with a db_file_multiblock_read_count=128 and parallel 8. That leads me to believe you could probably do it with parallel 4.
2. Add update_time to the column1, column2 index. This would allow the same query to be resolved in about a minute, as it could use the index exclusively. It may also be useful for other queries that need update_time.

Post navigation

Leave a Reply