The CTAS, a.k.a. Processing Data the Jetson’s Way

January 1, 2014March 21, 2014 Kellyn

**Note- This is an article I wrote about 4 years ago that, as far as I know, can only be found in hard copy and although I’ve changed my stance on statistics, etc. with 11g/12c, the main point of CTAS benefits still stands the test of time.

As a kid did you ever watch the cartoon Hanna-Barbara’s, “the Jetson’s” and see Elroy drop out of George Jetson’s flying saucer, going to school via his personal, hyper-pod and thought, “forget the bus, I want to go to school in the hyper-pod!”?

For any developer or DBA working with large objects in an OLAP environment, updating and deleting large tables, even when partitioned, can feel like you are on that school bus again. So how do you get off the school bus and ride on Elroy’s hyper-pod to get things done, “Jetson’s Style”?

In my time working with large objects, I’ve seen and discovered more ways to use the CTAS, (“create table as” for those who are new to the acronym!) than I ever thought possible. It should be a pact of any DBA or developer in a warehouse or mart environment to have all the ways this feature can be used added to their arsenal.

For our current day vs. Jetson’s futuristic comparison, we’re going to say our environment consists of data partitioned by company_id, residing mostly in five primary tables. This partitioning method has been quite successful for how the business loads and reports on data, but your first challenge is to prune out all purchase orders older than five years from the system. Once complete, the business has some reporting requirements for you to invest time into, so keep that mind open and ready to work through some challenges!

The tables involved have the following info, examples of columns and sizes:

COMPANY_ACCT, 253 columns,  296GB, 1000 partitions total.

C_ID (partitioning key)

C_NBR

ACCT_TYPE_CD

SUB_CID

DEPT_ID

ADDRESS

PO_BOX

CITY

STATE

ZIP

LAST_ACTIVE_DT

… so on and so forth…

COMPANY_TRANS,120 columns, 312GB, 1000 partitions total.

C_ID(partitioning key),

TRAN_NBR

TRAN_DT

… so on and so forth…

ID_COMP_DIM, 412 columns, 212GB, 1000 partitions total.

C_ID(partitioning key),

MOD_KEY,

COMP_MOD_KEY,

… so on and so forth…

GEO_COMP_DIM, 82 columns, 180GB, 1000 partitions total.

C_ID,

GEO_KEY,

TERR_CD,

… so on and so forth…

The traditional OLTP processing method to tackle the removal would be to either:

Delete with where clause against the tran_dt or last_act_dt and having an index on these columns.
Bulk process the deletes, using the same method but performing in parallel, using the c_id, (company id) to allow for most efficient performance.

How could you perform the delete faster? How could you accomplish this task without actually utilizing a delete statement at all?

Elroy Jetson would say “CTAS it!” Anything that can be done, can be done faster, more efficiently and more suave when it’s done with a CTAS. Combine this with exchange partition feature and there are opportunities here that anyone can appreciate. Anything that can be done, can be done faster, more efficiently and more suave when it’s done with a CTAS, (“create table as” for those of you who are newer to the acronym..) Pairing up CTAS with exchange partition and you have a match made in heaven.

Now to return to our challenge at hand- let’s take our first table, the COMPANY_ACCT table and remove all the accounts that haven’t been active for over five years.

The traditional way, version 1:

DELETE FROM COMPANY_ACCT

WHERE LAST_ACTIVE_DT < TO_DATE(’01-JUN-2005’,’DD-MON-YYYY’);

Or version 2, process via a procedure and pass in the company id and process multi-thread or in parallel:

DELETE FROM COMPANY_ACCT

WHERE C_ID=:b1

AND LAST_ACTIVE_DT < TO_DATE(’01-JUN-2005’,’DD-MON-YYYY’);

After each of these steps, you will have stale statistics and considerable empty blocks- blocks that will be scanned on every hash and table scan that are empty until you “coalesce” this partition. This is something that you must keep in mind when you process data the “old-fashioned” way.

How more efficient would this be to process this via a CTAS with an exchange partition?

Note** No delete in the following statement!!

CREATE TABLE COMPANY_ACCT_229 AS

SELECT * FROM COMPANY_ACCT

WHERE C_ID=229

AND LAST_ACTIVE_DT > TO_DATE(’01-JUN-2005’,’DD-MON-YYYY’);

CREATE INDEX CA_229_idx1 on COMPANY_ACCT_229(LAST_ACTIVE_DT) LOCAL;

Update statistics and note that I use very, very small sample sizes for my estimate_percent. I have very consistent data across my data sets, but I have very large amounts of data. This gives me the luxury of collecting stats and not the time required for larger sample sizes. Due to the amount of partitions, I am careful to gather only what I need to not impact the performance of my data dictionary. I do not like seeing long ops on tables such as col$, tab$, ind$, etc… If you see this, look into the amount of statistics- are you gathering what you need or just flooding the data dictionary with data that can not give you the consistent and solid execution plans that are your goal?

Exec dbms_stats.gather_table_stats(ownname=>’OWNER’,tabname=>’COMPANY_ACCT_229’, estimate_percent=>.00001);

We are now ready to exchange our partition-

ALTER TABLE COMPANY_ACCT EXCHANGE PARTITION P_229 WITH COMPANY_ACCT_229 UPDATE INDEXES;

Your delete of the data for this partition is complete. There is no fragmentation and you have updated the stale statistics, you are as clean as possible for all processing going forward.

What is the cost savings?

Original Delete process per partition, no updateof statistics, on average: 28 min.

CTAS and exchange partition, including index creation, minus statistics update on average: 7 min.

What if you’re challenged to reproduce the whole table- all the partitions without the older than five years data. How can you complete this task with the CTAS method?

CREATE TABLE DW_PROD.COMPANY_ACCTS2

  C_ID                     NUMBER not null,

  C_NBR                    VARCHAR2(30) not null,

  ACCT_TYPE_CD          VARCHAR2(1) not null,

  … ßColumns to be included in the new table

partition by list (C_ID)

 partition P_1001 values ('1001'),

partition P_1022 values ('1022'),

partition P_1023 values ('1023'),

partition P_1024 values ('1024'),

partition P_1025 values ('1025'),

partition P_1026 values ('1026'),

partition P_1029 values ('1029'),

partition P_1031 values ('1031'),

…  ßpartition list

<storage parameters, including parallel> PARALLEL (degree 8)

AS select /*+ PARALLEL(ca 8) */

ca.C_ID,

ca.C_NBR,

ca.ACCT_TYPE_CD,

…

FROM COMPANY_ACCT ca

WHERE ca.LAST_ACTIVE_DT > TO_DATE(’01-JUN-2005’,’DD-MON-YYYY’);

Create the indexes that exist on the original, update the statistics on the secondary and then perform a rename- renaming the original to “old” and the COMPANY_ACCT2 to the original name, COMPANY_ACCT.

What are the time savings, performing this way vs. original:

Delete Statement, in parallel, against all the partitions, WITHOUT update of statistics: 23 hrs, 12 min.

CTAS and rename, INCLUDING update of statistics: 7 hrs, 4 min.

This is a 60% time savings- 60% less maintenance window of the DBA’s personal time or outage to the business. How can you say no to this?

Dimensional data the CTAS way…

As a special request for the business, a developer needs to create a table with multiple joins to three of the largest tables. The developer makes multiple attempts, but must stop due to performance hits on the database.

SELECT  /*+  parallel(g 8) parallel(i 8) use_hash(g i) */

            i.c_id,

            i.c_mod_key,

            i.comp_mod_id,

            DECODE (i.c_id,

                    0, 'N',

                    NVL2 (g.geo _tract, 'B', 'Z'))

               AS cmatch,

            g.next_col1,

            g. next_col2,

            g. next_col3,

            g. next_col4,

            g. next_col5,

            g. next_col5,

...

            g. next_col19

     FROM   geo_comp_dim g, id_comp_dim i

    WHERE   g.c_id = i.c_id;

The explain plan shows the impressive cost of this query, along with outrageous amount of temporary tablespace required to perform the hash and sort.

Explain Plan:

Description	Owner	Object	Cost	Rows	Temp
SELECT STATEMENT, GOAL = ALL_ROWS			1145371	176422800
PX COORDINATOR
PX SEND QC (ORDER)	SYS	:TQ10002	1145371	176422800
SORT ORDER BY			1145371	176422800	2.4 TB
PX RECEIVE			3035	176422800
PX SEND RANGE	SYS	:TQ10001	3035	176422800
HASH JOIN			3035	176422800
PX RECEIVE			163	373294
PX SEND BROADCAST	SYS	:TQ10000	163	373294
PX BLOCK ITERATOR			163	373294
TABLE ACCESS FULL	DM_OWNER	GEO_COMP_DIM	163	373294
PX BLOCK ITERATOR			2730	176422800
TABLE ACCESS FULL	DM_OWNER	ID_COMP_DIM	2730	176422800

What is the heaviest weight in this process to the cost of the query?

The hash join between the two tables, but what does it really need from the two tables? The original query needs almost everything from GEO_COMP_DIM, but it’s a relatively small cost vs. ID_COMP_DIM, which the developer only needs three columns of the total 412 columns!

Those of us who have studied a hash join know that we should expect the hash join to require 1.6 times our table we lead in with for our hash. What is we created a smaller lead table for the hash then?

CTAS to the rescue again!

CREATE TABLE ID_COMP_CID_DATA AS

SELECT C_ID, C_MOD_KEY, C_COMP_ID FROM ID_COMP_DIM;

Replace the original ID_COMP_DIM table in the query with ID_COMP_CID_DATA. What is the performance improvement?

Description	Owner	Object	Cost	Rows	Temp
SELECT STATEMENT, GOAL = ALL_ROWS			41289	176422800
PX COORDINATOR
PX SEND QC (ORDER)	SYS	:TQ10002	41289	176422800
SORT ORDER BY			41289	176422800	153G
PX RECEIVE			623	176422800
PX SEND RANGE	SYS	:TQ10001	623	176422800
HASH JOIN			623	176422800
PX RECEIVE			163	373294
PX SEND BROADCAST	SYS	:TQ10000	163	373294
PX BLOCK ITERATOR			163	373294
TABLE ACCESS FULL	DM_OWNER	GEO_COMP_DIM	163	373294
PX BLOCK ITERATOR			312	176422800
TABLE ACCESS FULL	DM_OWNER	ID_COMP_CID_DATA	312	176422800

Same explain plan, different, narrow table to work with… The time savings is considerable, too..

Original: 7 hrs, 16 min.

Same Query off of CTAS “dimensional” table:

Creation of ID_COMP_CID_DATA: 10 min

Query: 28 min.

Total: 38min.

The business has now been asked to change how they report data. Originally, the data was represented as yes or no, but the data now needs to be presented as values requested by the client, no longer satisfied with Yes or No.

A simple CTAS of the table with a decode of the columns to the new values will create the new table in the format you need without having to perform a standard update.

create table COMP_BUS

 tablespace tbl_data1

    pctfree 10

    initrans 96

    maxtrans 255

 STORAGE (FREELISTS 48)  nologging as

 select /*+ parallel(I,8) */

         i.c_id

       , DECODE(i.sold_flag, 'Y', '1', '0') AS sold

       , i.pass_id AS p_id

       ,i.sess_id

       , TO_NUMBER(DECODE(i.sl_dt, null, null, i.sl_dt  - TO_DATE('01-JAN-1999', 'DD-MON-YYYY'))) AS sl_dt

      ,c_dev_id

from DM_OWNER_A.comp_bus_dim i ;

create index bus_cid_idx on DM_OWNER.comp_bus (c_id ) tablespace  BUS_INDX1 parallel 4;

exec dbms_stats.gather_table_stats…

ALTER TABLE COMP_BUS noparallel;

ALTER INDEX bus_cid_idx noparallel;

No concern over empty blocks or performance issues that would be experienced with an update will be experienced with the CTAS. Create the table, create any indexes, update statistics and rename the table. The steps may seem more involved, but the time savings and ability for the CTAS to pay-forward in performance benefits are significant.

Here’s where you have to ask yourself- Do you want to be on the yellow school bus or the hyper-pod?

Features that can be utilized in a CTAS that should be investigated for performance benefits:

No-logging
Parallel/parallel DML
Compression for space savings
Initial transaction settings
Partitioning/sub-partitioning
In some specialized cases- freelists/freelist groups instead of ASSM.

In the world of data warehouses and marts, theory would say, anyone should be able to easily update or delete rows without issue, but in reality, there are more efficient ways to process large amounts of data. We, as DBAs and developers, must be more pro-active, more futuristic- more ”Jetson” like.

Kellyn

You May Also Like

Blogger Registration Complete for OOW 2011

RMOUG Training Days 2018

Enterprise Manager Processes, EM12c vs. EM13c