Space used with a multi-byte database characterset

We found our database was using a lot more space than when we issued a select dbms_lob.getlength() on CLOB columns. We found our issue was that our database characterset is AL32UTF8, so everything requires more space.

To test this, we first insert a row into each database and then dump the block on which it lives (not shown).


SQL> insert into lob_test values(1,rpad('*',1000,'*');

SQL> select value from v$nls_parameters where parameter = 'NLS_CHARACTERSET';

VALUE
----------------------------------------------------------------
WE8ISO8859P1

SQL>

…and in the block dump we see there are no null characters between the “*” characters we inserted (hexadecimal 2a, or 42 ASCII)…

col  0: [ 2]  c1 02
col  1: [3936]
 00 54 00 01 02 0c 00 00 00 01 00 00 00 01 00 00 00 05 04 2f 0f 4c 09 00 00
 00 00 00 0f 3c 00 00 00 00 00 01 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a

…but with a multi-byte characterset, we see there *are* null charaters (00 hex) between the ‘*’ characters we inserted (hexadecimal 2a, or 42 ASCII)…

SQL> select value from v$nls_parameters where parameter = 'NLS_CHARACTERSET';

VALUE
----------------------------------------------------------------
AL32UTF8

SQL>

col  0: [ 2]  c1 02
col  1: [2036]
 00 54 00 01 02 0c 80 00 00 02 00 00 00 01 00 02 3e 52 c5 e9 07 e0 09 00 00
 00 00 00 07 d0 00 00 00 00 00 01 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a
 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00
 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a
 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00
 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a

These nulls require real space, since they live between two other bytes on disk. The database characterset is a binding contract with oracle. When you define it as multi-byte, you allow oracle to simply issue a C fseek() call with two bytes in the call, rather than getting one and deciding whether the next byte is part of the current data, or part of some other row stored in the file. This makes it much faster and more efficient (less CPU for the check and less I/O for any calls that are a result). Oracle allocates (at least) two bytes for every character, even if the second one is null.

Any CLOB with less than 4000 bytes is stored in row. 36 bytes is used as control information, and the remaining can be used for the CLOB data. Since we use Unicode, however, any rows with less than 1982 bytes ((4000 – 36) / 2 bytes per character) are stored in row. That represents a lot of our INDEXDATA column rows. That is why the INDEXDATA CLOB segment is small (429GB) relative to the XMLRECORD (almost 1200GB) because the XMLRECORD is stored out of row in its CLOB segment, but the INDEXDATA CLOB “fits” in row. That is also what makes our table larger, since the INDEXDATA to a large extent is stored in the table row, rather than in the CLOB segment.

The bottom line is we require twice as much space as simple math dictates, so any estimate of space requirements we get based on “raw text” XML files need to account for this.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.