DatabaseOracle

The CTAS, a.k.a. Processing Data the Jetson’s Way

**Note-  This is an article I wrote about 4 years ago that, as far as I know, can only be found in hard copy and although I’ve changed my stance on statistics, etc. with 11g/12c, the main point of CTAS benefits still stands the test of time.  

As a kid did you ever watch the cartoon Hanna-Barbara’s, “the Jetson’s” and see Elroy drop out of George Jetson’s flying  saucer,  going to school via his personal,  hyper-pod and thought, “forget the bus, I want to go to school in the hyper-pod!”?

For any developer or DBA working with large objects in an OLAP environment, updating and deleting large tables, even when partitioned, can feel like you are on that school bus again.  So how do you get off the school bus and ride on Elroy’s hyper-pod to get things done, “Jetson’s Style”?

In my time working with large objects, I’ve seen and discovered more ways to use the CTAS, (“create table as” for those who are new to the acronym!)  than I ever thought possible.  It should be a pact of any DBA or developer in a warehouse or mart environment to have all the ways this feature can be used added to their arsenal.

For our current day vs. Jetson’s futuristic comparison, we’re going to say our environment consists of data partitioned by company_id, residing mostly in five primary tables.  This partitioning method has been quite successful for how the business loads and reports on data, but your first challenge is to prune out all purchase orders older than five years from the system.   Once complete, the business has some reporting requirements for you to invest time into, so keep that mind open and ready to work through some challenges!

The tables involved have the following info, examples of columns and sizes:

COMPANY_ACCT, 253 columns,  296GB, 1000 partitions total.
C_ID (partitioning key)
C_NBR
ACCT_TYPE_CD
SUB_CID
DEPT_ID
ADDRESS
PO_BOX
CITY
STATE
ZIP
LAST_ACTIVE_DT
… so on and so forth…

COMPANY_TRANS,120 columns, 312GB, 1000 partitions total.
C_ID(partitioning key),
TRAN_NBR
TRAN_DT
… so on and so forth…

ID_COMP_DIM, 412 columns, 212GB, 1000 partitions total.
C_ID(partitioning key),
MOD_KEY,
COMP_MOD_KEY,
… so on and so forth…

GEO_COMP_DIM, 82 columns, 180GB, 1000 partitions total.
C_ID,
GEO_KEY,
TERR_CD,
… so on and so forth…

The traditional OLTP processing method  to tackle the removal would be to either:

  • Delete with where clause against the tran_dt or last_act_dt and having an index on these columns.
  • Bulk process the deletes, using the same method but performing in parallel, using the c_id, (company id) to allow for most efficient performance.

How could you perform the delete faster?   How could you accomplish this task without actually utilizing a delete statement at all?

Elroy Jetson would say “CTAS it!”  Anything that can be done, can be done faster, more efficiently and more suave when it’s done with a CTAS.  Combine this with  exchange partition feature and there are opportunities here that anyone can appreciate.  Anything that can be done, can be done faster, more efficiently and more suave when it’s done with a CTAS, (“create table as” for those of you who are newer to the acronym..)  Pairing up CTAS with exchange partition and you have a match made in heaven.

Now to return to our challenge at hand-  let’s take our first table, the COMPANY_ACCT table and remove all the accounts that haven’t been active for over five years.

The traditional way,  version 1:

DELETE FROM COMPANY_ACCT
WHERE LAST_ACTIVE_DT < TO_DATE(’01-JUN-2005’,’DD-MON-YYYY’);

Or version 2, process via a procedure and pass in the company id and process multi-thread or in parallel:

DELETE FROM COMPANY_ACCT
WHERE C_ID=:b1
AND LAST_ACTIVE_DT < TO_DATE(’01-JUN-2005’,’DD-MON-YYYY’);

After each of these steps, you will have stale statistics and considerable empty blocks-  blocks that will be scanned on every hash and table scan that are empty until you “coalesce” this partition.   This is something that you must keep in mind when you process data the “old-fashioned” way.

How more efficient would this be to process this via a CTAS with an exchange partition?

Note**  No delete in the following statement!!

CREATE TABLE COMPANY_ACCT_229 AS
SELECT * FROM COMPANY_ACCT
WHERE C_ID=229
AND LAST_ACTIVE_DT > TO_DATE(’01-JUN-2005’,’DD-MON-YYYY’);
CREATE INDEX CA_229_idx1 on COMPANY_ACCT_229(LAST_ACTIVE_DT) LOCAL;

Update statistics and note that I use very, very small sample sizes for my estimate_percent.  I have very consistent data across my data sets, but I have very large amounts of data.  This gives me the luxury of collecting stats and not the time required for larger sample sizes.  Due to the amount of partitions, I am careful to gather only what I need to not impact the performance of my data dictionary.  I do not like seeing long ops on tables such as col$, tab$, ind$, etc…  If you see this, look into the amount of statistics-  are you gathering what you need or just flooding the data dictionary with data that can not give you the consistent and solid execution plans that are your goal?

Exec dbms_stats.gather_table_stats(ownname=>’OWNER’,tabname=>’COMPANY_ACCT_229’, estimate_percent=>.00001);

We are now ready to exchange our partition-

ALTER TABLE COMPANY_ACCT EXCHANGE PARTITION P_229 WITH COMPANY_ACCT_229 UPDATE INDEXES;

Your delete of the data for this partition is complete.  There is no fragmentation and you have updated the stale statistics, you are as clean as possible for all processing going forward.

What is the cost savings?

Original Delete process per partition, no updateof statistics, on average: 28 min.

CTAS and exchange partition, including index creation, minus statistics update on average:  7 min.

What if you’re challenged to reproduce the whole table-  all the partitions without the older than five years data.  How can you complete this task with the CTAS method?

CREATE TABLE DW_PROD.COMPANY_ACCTS2
(
  C_ID                     NUMBER not null,
  C_NBR                    VARCHAR2(30) not null,
  ACCT_TYPE_CD          VARCHAR2(1) not null,
  … ßColumns to be included in the new table
)
partition by list (C_ID)
(
 partition P_1001 values ('1001'),
partition P_1022 values ('1022'),
partition P_1023 values ('1023'),
partition P_1024 values ('1024'),
partition P_1025 values ('1025'),
partition P_1026 values ('1026'),
partition P_1029 values ('1029'),
partition P_1031 values ('1031'),
…  ßpartition list
)
<storage parameters, including parallel> PARALLEL (degree 8)
AS select /*+ PARALLEL(ca 8) */
ca.C_ID,
ca.C_NBR,
ca.ACCT_TYPE_CD,
FROM COMPANY_ACCT ca
WHERE ca.LAST_ACTIVE_DT > TO_DATE(’01-JUN-2005’,’DD-MON-YYYY’);

Create the indexes that exist on the original, update the statistics on the secondary and then perform a rename-  renaming the original to “old” and the COMPANY_ACCT2 to the original name, COMPANY_ACCT.

What are the time savings, performing this way vs. original:

Delete Statement, in parallel, against all the partitions, WITHOUT update of statistics:  23 hrs, 12 min.

CTAS and rename, INCLUDING update of statistics:  7 hrs, 4 min.

This is a 60% time savings-  60% less maintenance window of the DBA’s personal time or outage to the business.  How can you say no to this?

Dimensional data the CTAS way…

As a special request for the business, a developer needs to create a table with multiple joins to three of the largest tables.   The developer makes multiple attempts, but must stop due to performance hits on the database.

SELECT  /*+  parallel(g 8) parallel(i 8) use_hash(g i) */
            i.c_id,
            i.c_mod_key,
            i.comp_mod_id,
            DECODE (i.c_id,
                    0, 'N',
                    NVL2 (g.geo _tract, 'B', 'Z'))
               AS cmatch,
            g.next_col1,
            g. next_col2,
            g. next_col3,
            g. next_col4,
            g. next_col5,
            g. next_col5,
            ...
            g. next_col19
     FROM   geo_comp_dim g, id_comp_dim i
    WHERE   g.c_id = i.c_id;

The explain plan shows the impressive cost of this query, along with outrageous amount of temporary tablespace required to perform the hash and sort.

Explain Plan:

Description Owner Object Cost Rows Temp
SELECT STATEMENT, GOAL = ALL_ROWS

1145371

176422800

 PX COORDINATOR
 PX SEND QC (ORDER) SYS :TQ10002

1145371

176422800

   SORT ORDER BY

1145371

176422800

2.4 TB
   PX RECEIVE

3035

176422800

   PX SEND RANGE SYS :TQ10001

3035

176422800

      HASH JOIN

3035

176422800

     PX RECEIVE

163

373294

     PX SEND BROADCAST SYS :TQ10000

163

373294

     PX BLOCK ITERATOR

163

373294

     TABLE ACCESS FULL DM_OWNER GEO_COMP_DIM

163

373294

  PX BLOCK ITERATOR

2730

176422800

  TABLE ACCESS FULL DM_OWNER ID_COMP_DIM

2730

176422800

 

What is the heaviest weight in this process to the cost of the query?

The hash join between the two tables, but what does it really need from the two tables?  The original query needs almost everything from GEO_COMP_DIM, but it’s a relatively small cost vs. ID_COMP_DIM, which the developer only needs three columns of the total 412 columns!

Those of us who have studied a hash join know that we should expect the hash join to require 1.6 times our table we lead in with for our hash.  What is we created a smaller lead table for the hash then?

CTAS to the rescue again!

CREATE TABLE ID_COMP_CID_DATA AS
SELECT C_ID, C_MOD_KEY, C_COMP_ID FROM ID_COMP_DIM;

Replace the original ID_COMP_DIM table in the query with ID_COMP_CID_DATA.  What is the performance improvement?

 

Description Owner Object Cost Rows Temp
SELECT STATEMENT, GOAL = ALL_ROWS

41289

176422800

 PX COORDINATOR
  PX SEND QC (ORDER) SYS :TQ10002

41289

176422800

  SORT ORDER BY

41289

176422800

153G
    PX RECEIVE

623

176422800

    PX SEND RANGE SYS :TQ10001

623

176422800

        HASH JOIN

623

176422800

        PX RECEIVE

163

373294

        PX SEND BROADCAST SYS :TQ10000

163

373294

         PX BLOCK ITERATOR

163

373294

  TABLE ACCESS FULL DM_OWNER GEO_COMP_DIM

163

373294

  PX BLOCK ITERATOR

312

176422800

  TABLE ACCESS FULL DM_OWNER ID_COMP_CID_DATA

312

176422800

 

Same explain plan, different, narrow table to work with…  The time savings is considerable, too..

Original:  7 hrs, 16 min.

Same Query off of CTAS “dimensional”  table:

Creation of ID_COMP_CID_DATA: 10 min

Query:  28 min.

Total:  38min.

The business has now been asked to change how they report data.  Originally, the data was represented  as yes or no, but the data now needs to be presented as values requested by the client, no longer satisfied with Yes or No.

A simple CTAS of the table with a decode of the columns to the new values will create the new table in the format you need without having to perform a standard update.

create table COMP_BUS
 tablespace tbl_data1
    pctfree 10
    initrans 96
    maxtrans 255
 STORAGE (FREELISTS 48)  nologging as
 select /*+ parallel(I,8) */
         i.c_id
       , DECODE(i.sold_flag, 'Y', '1', '0') AS sold
       , i.pass_id AS p_id
       ,i.sess_id
       , TO_NUMBER(DECODE(i.sl_dt, null, null, i.sl_dt  - TO_DATE('01-JAN-1999', 'DD-MON-YYYY'))) AS sl_dt
      ,c_dev_id
from DM_OWNER_A.comp_bus_dim i ;

create index bus_cid_idx on DM_OWNER.comp_bus (c_id ) tablespace  BUS_INDX1 parallel 4;

exec dbms_stats.gather_table_stats…

ALTER TABLE COMP_BUS noparallel;
ALTER INDEX bus_cid_idx noparallel;

No concern over empty blocks or performance issues that would be experienced with an update will be experienced with the CTAS.  Create the table, create any indexes, update statistics and rename the table.  The steps may seem  more involved, but the time savings and ability for the CTAS to pay-forward in performance benefits are significant.

Here’s where you have to ask yourself- Do you want to be on the yellow school bus or the hyper-pod?

Features that can be utilized in a CTAS that should be investigated for performance benefits:

  • No-logging
  • Parallel/parallel DML
  • Compression for space savings
  • Initial transaction settings
  • Partitioning/sub-partitioning
  • In some specialized cases-  freelists/freelist groups instead of ASSM.

In the world of data warehouses and marts, theory would say, anyone should be able to easily update or delete rows without issue, but in reality, there are more efficient ways to process large amounts of data.  We, as DBAs and developers, must be more pro-active, more futuristic-  more ”Jetson”  like.

Kellyn

http://about.me/dbakevlar