Nov 22 2011

Restarting a Duplicate Process From a VERY Failed State

Category: DB ChaosKellyn Pot'Vin @ 8:38 pm

As part of an 11g Upgrade, it was found that a database environment could be built from one of the upgraded databases through a duplicate.  As this process had never been performed before in this fashion, a test was in order.  The test was an excellent chance to discover that the OSUser that performs the duplicate process was the proud owner of a .kshrc file with hard coded Oracle variables set which was an excellent choice if you want to really screw up a new duplicate database… :)

Scenario:

1. Duplicate has started with correct variables set.

2.  Subsequent shell scripts then “upset” the environment mid-process leaving a failed duplicate with ASM files for the controlfiles created, but the DBID and dbname still set to the target database, not the auxillary.

3.  Post the failure, the auxillary, (duplicate) database can only be mounted, not opened.

After setting environment to building auxillary database environment, after duplicate failure and you see in the spfile, name of db is no longer the auxilary database.
1.  Shutdown the auxilary database:
 [oracledbs]$ sqlplus ‘/as sysdba’
SQL> shutdown;
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.@
In a second screen set to +ASM instance for auxillary, remove controlfiles for target db that have been set to that name through the asmcmd command console:
[oracle@dbs]$ asmcmd
ASMCMD> ls
DATA_H/
DATA_RAM_H/
ASMCMD> cd DATA_H
ASMCMD> ls
DB_H/
ASMCMD> cd DB_H
ASMCMD> ls
CONTROLFILE/
ASMCMD> cd CONTROLFILE
ASMCMD> ls
control1.ctl <–these are control files for auxillary, stuck with dbname of target, can’t be renamed, can’t mount db!
control2.ctl
control3.ctl
current.389.766743537
current.445.766743537
current.500.766743537
ASMCMD> rm control*
You may delete multiple files and/or directories.
Are you sure? (y/n) y
ASMCMD> quit
Back on your original screen now, restart the auxillary with pfile set to corect dbname and start with this pfile-
SQL>  startup nomount pfile=’/u01/app/oracle/product/11.2.0/dbhome_2/dbs/initdb_h.ora’
ORACLE instance started.
Total System Global Area 7.6964E+10 bytes
Fixed Size                  2215704 bytes
Variable Size            3.0065E+10 bytes
Database Buffers         4.6708E+10 bytes
Redo Buffers              189513728 bytes
SQL> create spfile from pfile;
File created.
SQL> shutdown;
ORACLE instance shut down.
SQL> startup nomount;
ORACLE instance started.
Total System Global Area 7.6964E+10 bytes
Fixed Size                  2215704 bytes
Variable Size            3.0065E+10 bytes
Database Buffers         4.6708E+10 bytes
Redo Buffers              189513728 bytes
Now you can restart the duplicate process and the database will again be recognized correctly.


Nov 22 2011

How to Break and EM12c and Other Hobbies

Category: DB ChaosKellyn Pot'Vin @ 8:25 pm

As many know I’ve been busy trying to bring our many-times patched 10g Enterprise Manager migrated to a new server with EM12c.  I thought it would be interesting to see how fast I could break it, considering my skills, I knew it might not be a challenge.

Scenario

1.  Bug with listener refusing to connect dynamically to EM12c repository database.

2.  After releasing code to the database for advanced notification,  the SYSMAN.MGMT_ADMIN_DATA package has gone invalid, which is used to manage much of the repository at the command line.

3.  The repository owner password has become corrupted.

These were my high level notes as I went through the troubleshooting:

Reconfigure Repository:
$AGENT_HOME/bin/emctl config oms -store_repos_details -repos_port 1521 -repos_sid emrep12c -repos_host host_nm -repos_user SYSMAN -repos_pwd password
Stopping Old Oracle OMS and configuring:
export ORACLE_HOME=/opt/oracle/app/OracleHomes/oms10g
cd $ORACLE_HOME/opmn/bin
./opmnctl stopall
cd $ORACLE_HOME/bin
./emctl config oms -change_repos_pwd
cd $ORACLE_HOME/opmn/bin
./opmnctl startall
Invalid package after loop from listener bug, corrupt SYSMAN password and invalid MGMT_XXX pkgs!
Can’t reset password, pkg used for it is invalid!
ORA-04063: package body “SYSMAN.MGMT_ADMIN_DATA” has errors
ORA-06508: PL/SQL: could not find program unit being called: “SYSMAN.MGMT_ADMIN_DATA”
ORA-06512: at line 1
Can’t start repository, password error in logs:
Error occurred. Check the log /local/u01/app/oracle/product/12.1.0/gc_inst/em/EMGC_OMS1/sysman/log/secure.log
[oracle@vwgrid01 bin]$ ./emctl start oms
Oracle Enterprise Manager Cloud Control 12c Release 12.1.0.1.0
Soooo close…but did I get far enough?, (Nope, failure…have to remove everything!!)
Can’t drop repository:
<Database ORACLE HOME>/bin/emca -deconfig dbcontrol db -repos drop -SYS_PWD sys_password -SYSMAN_PWD password
Can’t reconfigure through the system, either!
<Database ORACLE HOME>/bin/emca -deconfig dbcontrol db -repos drop -cluster -SYS_PWD sys_password -SYSMAN_PWD password
No chance either…I’m stuck in a terrible loop!
Solution-
backup and then edit out the 12c environment from the inventory.xml file in the ContentsXML dir of the Lsinventory
kill any process that are still running out the the EM12c home.
rm -rf the 12c home directory
Uninstall the emrep12 database in the 11g home
Create a new emrep12 database in the 11g home
recreate a new inventory file:
$ORACLE_HOME/oui/bin/runInstaller -silent -invPtrLoc “/u01/app/oraInventory/oraInst.loc” -attachHome ORACLE_HOME=”/u01/app/oracle/product/11.2.0″ ORACLE_HOME_NAME=”OraDb11g_home2″
Install the EM12c once again, as now the installation appears to have never existed…
If a restart of the installation needs to be done, you need to look in the home it fails on, (*saying that it’s already installed there) and remove the *_temp file from the dir, then try again.


Oct 20 2011

The NO_INVALIDATE Option in DBMS_STATS with 10g

Category: DB ChaosKellyn Pot'Vin @ 1:51 pm

I had worked hard on a report, 47 SQL statements in all to tune it down from 5 hrs to under 30 minutes.  The first runs had been quite successful, so when a third run sent an alert on temp usage, I knew something was wrong.

     SID PROCESS      MACHINE SQL_TEXT    SQL_ID     TOTAL MB
-------- ------------ -------------------- ---------------------
     507 1137	      HOST SELECT *** 7t3muww36xhzn  45516
     600 1139	      HOST SELECT *** 7t3muww36xhzn  45516
     525 1132	      HOST SELECT *** 7t3muww36xhzn  45516
     509 1135	      HOST SELECT *** 7t3muww36xhzn  45516

I checked the stats first, as one of the fixes was to ensure the staging tables in this process were collecting stats after the initial feeds came in, but both tables involved showed valid statistics:

SQL> select num_rows, last_analyzed from dba_tab_partitions 2 where table_name='<I_STAGE>' 3 and partition_name='P170';

  NUM_ROWS LAST_ANAL
---------- ---------
 480900000 17-OCT-11
SQL> select last_analyzed from dba_tables 2 where table_name='<SML_TBL>';

LAST_ANAL
---------
17-OCT-11

I ran a quick AWR report for the specific SQL_ID to see what I was dealing with, execution plan wise..


              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     46198 18-Oct-11 09:00:09       282       7.9
  End Snap:     46200 18-Oct-11 10:00:13       245       8.4
   Elapsed:               60.07 (mins)
   DB Time:            1,874.62 (mins)

 

SQL ID: 7t3muww36xhzn           DB/Inst: PRODUCTION/PROD  Snaps: 46198-46200
-> 1st Capture and Last Capture Snap IDs
   refer to Snapshot IDs witin the snapshot range
-> SELECT ***

    Plan Hash           Total Elapsed                 1st Capture   Last Capture
#   Value                    Time(ms)    Executions       Snap ID        Snap ID
--- ---------------- ---------------- ------------- ------------- --------------
1   324636810               4,785,428             4         46199          46199
2   4097803110                  1,047             1         46200          46200
          -------------------------------------------------------------

Plan 1(PHV: 324636810)
----------------------

Plan Statistics                 DB/Inst: PRODBASE/prodbase  Snaps: 46198-46200
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100

Stat Name                                Statement   Per Execution % Snap
---------------------------------------- ---------- -------------- -------
Elapsed Time (ms)                         4,785,428    1,196,357.0     4.3
CPU Time (ms)                             3,450,070      862,517.6     8.4
Executions                                        4            N/A     N/A
Buffer Gets                                 759,453      189,863.3     0.0
Disk Reads                                  683,619      170,904.8     2.8
Parse Calls                                      35            8.8     0.0
Rows                                              0            0.0     N/A
Execution Plan
------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name        | Rows  | Bytes | Cost  | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |             |       |       |     9 |       |       |        |      |
|   1 |  COUNT STOPKEY                |             |       |       |       |       |       |        |      |
|   2 |   PX COORDINATOR              |             |       |       |       |       |       |        |      |
|   3 |    PX SEND QC (ORDER)         | :TQ10002    |     1 |   240 |     9 |       |       |  Q1,02 | P->S | QC (ORDER)
|   4 |     VIEW                      |             |     1 |   240 |     9 |       |       |  Q1,02 | PCWP |
|   5 |      SORT ORDER BY STOPKEY    |             |     1 |   120 |     9 |       |       |  Q1,02 | PCWP |
|   6 |       PX RECEIVE              |             |     1 |   240 |       |       |       |  Q1,02 | PCWP |
|   7 |        PX SEND RANGE          | :TQ10001    |     1 |   240 |       |       |       |  Q1,01 | P->P | RANGE
|   8 |         SORT ORDER BY STOPKEY |             |     1 |   240 |       |       |       |  Q1,01 | PCWP |
|   9 |          HASH JOIN            |             |     1 |   120 |     5 |       |       |  Q1,01 | PCWP |
|  10 |           PX RECEIVE          |             |     1 |   109 |     3 |       |       |  Q1,01 | PCWP |
|  11 |            PX SEND BROADCAST  | :TQ10000    |     1 |   109 |     3 |       |       |  Q1,00 | P->P | BROADCAST
|  12 |             PX BLOCK ITERATOR |             |     1 |   109 |     3 |   KEY |   KEY |  Q1,00 | PCWC |
|  13 |              TABLE ACCESS FULL| I_STAGE     |     1 |   109 |     3 |   KEY |   KEY |  Q1,00 | PCWP |
|  14 |           PX BLOCK ITERATOR   |             |  5002 | 55022 |     2 |       |       |  Q1,01 | PCWC |
|  15 |            TABLE ACCESS FULL  | SML_TBL     |  5002 | 55022 |     2 |       |       |  Q1,01 | PCWP |
------------------------------------------------------------------------------------------------------------------------

This is the one that was eating up all the temp! Note that even though I checked stats, stats were correct as of the previous day, no changes to the partition stats, the execution plan only shows one row, which anyone who listens to Maria Colgan knows, that’s just Oracle giving you the benefit of the doubt and saying, “I don’t think there’s any rows in this object, (or sub-object in this case..) but I’ll give you 1 row for the fun of it!”

Second execution plan in the report is the one I desired:

Plan 2(PHV: 4097803110)
-----------------------

Plan Statistics                 DB/Inst: PRODUCTION/PROD  Snaps: 46198-46200
-> % Total DB Time is the Elapsed Time of the SQL statement divided
   into the Total Database Time multiplied by 100

Stat Name                                Statement   Per Execution % Snap
---------------------------------------- ---------- -------------- -------
Elapsed Time (ms)                             1,047        1,046.7     0.0
CPU Time (ms)                                   967          967.0     0.0
Executions                                        1            N/A     N/A
Buffer Gets                                   2,007        2,007.0     0.0
Disk Reads                                        3            3.0     0.0
Parse Calls                                       9            9.0     0.0
Rows                                             40           40.0     N/A
User I/O Wait Time (ms)                           1            N/A     N/A
Execution Plan
------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name        | Rows  | Bytes |TempSpc| Cost  | Pstart| Pstop |    TQ  |IN-OUT| PQ
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |             |       |       |       |  2834 |       |       |        |      |
|   1 |  COUNT STOPKEY                |             |       |       |       |       |       |       |        |      |
|   2 |   PX COORDINATOR              |             |       |       |       |       |       |       |        |      |
|   3 |    PX SEND QC (ORDER)         | :TQ10002    |  1272K|   291M|       |  2834 |       |       |  Q1,02 | P->S | QC
|   4 |     VIEW                      |             |  1272K|   291M|       |  2834 |       |       |  Q1,02 | PCWP |
|   5 |      SORT ORDER BY STOPKEY    |             |  1272K|   152M|   389M|  2834 |       |       |  Q1,02 | PCWP |
|   6 |       PX RECEIVE              |             |    40 |  9600 |       |       |       |       |  Q1,02 | PCWP |
|   7 |        PX SEND RANGE          | :TQ10001    |    40 |  9600 |       |       |       |       |  Q1,01 | P->P | RA
|   8 |         SORT ORDER BY STOPKEY |             |    40 |  9600 |       |       |       |       |  Q1,01 | PCWP |
|   9 |          HASH JOIN            |             |  1272K|   152M|       |     7 |       |       |  Q1,01 | PCWP |
|  10 |           PX RECEIVE          |             |  5005 | 55055 |       |     2 |       |       |  Q1,01 | PCWP |
|  11 |            PX SEND BROADCAST  | :TQ10000    |  5005 | 55055 |       |     2 |       |       |  Q1,00 | P->P | BR
|  12 |             PX BLOCK ITERATOR |             |  5005 | 55055 |       |     2 |       |       |  Q1,00 | PCWC |
|  13 |              TABLE ACCESS FULL| SML_TBL     |  5005 | 55055 |       |     2 |       |       |  Q1,00 | PCWP |
|  14 |           PX BLOCK ITERATOR   |             |  1271K|   139M|       |     4 |   KEY |   KEY |  Q1,01 | PCWC |
|  15 |            TABLE ACCESS FULL  | I_STAGE     |  1271K|   139M|       |     4 |   KEY |   KEY |  Q1,01 | PCWP |
------------------------------------------------------------------------------------------------------------------------

So what changed? What impacted my statistics?

Upon investigation, I came to the conclusion that it is a combination of a “feature” with what I think is a bug in 10g dbms_stats.

A search of stats processing showed that during the one process that was executing against the P170 partition on the I_STAGE, there were a number of other partitions in this same table having stats gathered post loading.

declare v_stage_table_name varchar2(64); begin select min(stage_table_name) into v_stage_table_name from stage_tables 
where stage_table_type_cd = '<I_STAGE>'; dbms_stats.gather_table_stats (ownname => 'dw_user', tabname => v_stage_table_name, 
partname => 'P450' ,estimate_percent =>.01, granularity=>'PARTITION', method_opt=>'for all columns size 1', 
no_invalidate=> false, cascade=>false, degree=>4); end;

Now the key here in the statement above is:

no_invalidate=>false

If you read the description for this from Oracle:

no_invalidate Does not invalidate the dependent cursors if set to TRUE. The procedure invalidates the dependent cursors immediately if set to FALSE. Use DBMS_STATS.AUTO_INVALIDATE. to have Oracle decide when to invalidate dependent cursors. This is the default. The default can be changed using the SET_PARAM Procedure.

The surmised bug is one where even though the dbms_stats being performed by another process is partition level, the invalidation of the cursors is across all partitions in the object, causing them all to be invalidated, requiring them to re-parse the SQL.  (There are a number of similar bugs already documented in 10.2.0.4.0 for partition level statistics gathering…)

The feature to allow Oracle to re-parse and take advantage of the newest statistics information in the data dictionary resulted in a poor performance challenge in this instance, as the cursors were invalidated on a process that needed no changes to statistics.

I tested repeatedly against partitions, collecting stats with the no_invalidate set to false or true and even to AUTO to see what would occur and it consistently impacted my cursors against other partitions.  I can find not documented bug, but as many know, I’m about to move everything to 11g  in short order and expect it would be a waste of time to pursue it to far…

I notified the Java developer who owns this code to please update the no_invalidate=>true to correct the performance impact short term and look forward to 11g bugs to replace my exhaustion on 10g ones! :)