KSCOPE 2012 and Other Happenings

This next week I’ll be speaking at KSCOPE 2012, helping out with the database track and flying the Enkitec flag, as well as hanging out at the Enkitec booth at the event.  Come by and see us, we have some cool stuff on hand and to discuss.  If you are interested in anything Exadata, Exalogic, you know the drill, let us know and we can tell you all about our upcoming E4 conference in the Dallas/Ft. Worth area on August 13-14th, E4
I will be attending the E4 event and the day after, August 15th, will be speaking on “Oracle for the SQL Server DBA” with the SQLPass group of Ft. Worth. I want to thank Chris Shaw for introducing me to Mary Mathais of the Ft. Worth SQLPass group for welcoming me to the group to speak with them.
As with everyone, we are looking forward to Oracle Open World in the fall. Tim and I have booked the train, the California Zephyr, for our trip out again. Its a 33hr trip, which is quite pleasant due to the lovely scenery, wonderful food, observation areas and our Bedroom car we booked, which is quite comfortable for the duration of the trip. Last year we took this same route, enjoyed the aspens in full, fall glory and then dreaded the simple three hour flight back as soon as we arrived to the first security terminal at the airport!
I’ve submitted abstracts for UTOUG’s, (Utah Oracle User Group) Symposium, coming up on September 1st, which is a quick one day trip for me from Denver. I’m also waiting, as are many, on UKOUG’s approval of abstract submissions.
Before I present at KSCOPE this next week, I’m going to attempt to add a few additional slides to my presentation on Windows installations gotchas with EM12c and go over best practices with job migrations as Oracle is still has yet to release a utility to assist DBA’s in this crucial area for many.
See many of you in San Antonio next week for KSCOPE 2012!

payday loans lenders online

EM12c and Alerting Check, Double-Checks and Where Else Can I Check?

Metric settings, alerting, notifications and escalations have been enhanced in EM12c to support the demands that are falling heavier on the DBA and the larger support scale of the database environment, (ASM, Exadata, middle-tier systems, etc…)  As frustrated as I know some folks became when trouble-shooting alerting issues in previous versions of the Enterprise Manager, it should be acknowledged that EM12c has many new features and a new interface for this important part of the Enterprise Manager.

I know many times, I’m so busy keeping my eye on the ball, that I can miss what’s stealing third base right under my nose.  This has always been a challenge for me and from discussion, a number of DBA’s when in the middle of large migrations.  I completed a while back, “a low target count, high quantity/complex job” EM12c migration and as such was the case, ended up with a very small issue building into the perfect storm.

The migration was multi-step, the first servers and jobs migrating over quite seamlessly and without any issues, even with demanding job run schedules, I was able to migrate everything over and have it up and running without any downtime to the customer.  The last move was a RAC cluster with 40+ jobs, including backup and maintenance jobs that all needed to be created manually in the new EM12c, as this was the only supported process for moving the previous EM12c from one server being taken out of service to a new one to replace it.  I completed the transition and through the night, the only issue appeared to be a few jobs that had been created with credentials that the DBA team did not possess and required a few simple changes to work-around the issue in the new environment.

The next day, no issues were seen in the beginning morning hours and all appeared to be working well, then there was a notification of the database experiencing issues from app support.  Logging into the system, I was able to quickly access a hung archiver problem and started to research how to address most efficiently.  I allocated more space to the FRA, but the FRA was on an ASM disk and had hit the limit on space on the target in question.  Inspecting the backups, the migrated backup job had failed on execution to the tape settings and had switched to write to the FRA location.  During this time, the sessions in the system had piled up and I was unable to connect with RMAN to delete the backup, either through the EM console job or the command line.  I could see the backup pieces residing in the ASM location with ASMCMD and with time running out and access to the production environment compromised, I removed the files “the hard way”.

I still could not switch a logfile-  the system was simply refusing to release the space to the FRA so that archiving could resume.  This was a 24X7 system for the client and as it would have it, this was the busiest time of the day for them. I notified the customer that I would likely have to cycle the database to force the database and server to recognize the freed space and they agreed.  Post the cycle, the re-allocated space was quickly recognized and immediate archiving commenced.  I performed crosschecks of the backups, cleaning up from my earlier forced cleanup and removed the job that were failing and re-directing to the Oracle choice of the FRA destination.

So, we now need to circle back to what I have concerns with and feel is a disconnect in the default configuration with the alerts and notifications in EM12c.  As soon as the backup job failed due to it filling up the FRA location when it could not write to the SBT_TAPE local, I should have been notified by EM12c.  This is not like a job that I built in the EM job library where I choose to be notified or not.  There isn’t a section that asks if I want to be notified and I would think it would do so automatically, as important as backups are.

I also received the following errors in the alert log when the FRA filled up again from archiving:

 ORA-19809: limit exceeded for recovery files
 ORA-19804: cannot reclaim 536870912 bytes disk space 
 from 107374182400 limit
 ARC0: Error 19809 Creating archive log file to '+ORAASM02'
 ARCH: Archival stopped, error occurred. Will continue retrying
 ORACLE Instance <sid> - Archival Error

The above errors did not send out an alert notification email, even though an incident was created.  I was highly concerned as to what was mis-configured to cause this and as this is a brand new EM12c with the latest bundle patch, I was concerned that this was the result of the current configuration for anyone implementing the newest release.

When first setting up your EM12c, you will fill out your notification methods and email addresses, ensuring that you are set to receive emails and who will receive them during the correct monitoring window.  The next important piece and this is one of the areas that caused the problem, is to look at the default rule sets.  I always recommend copying and editing the rule set to fit your needs, but you need to also look a little deeper into what the rules are doing individually, otherwise you may end up in the same scenario I did.

You first inspect your rule set, (a copy of the original, so the default rule set will appear the same as the one you see below.)

Is it enabled?  Does it exclude any targets?  These are all valid questions, unfortunately, this was not the case:

 

Are there rules missing, not sending emails for the problem?

More valid questions, but no, this was also not the case:

 

So now we venture onto metric settings.  If our rules are correct, our incidents are creating, what is it about the metrics that don’t email?

What you see below is just the section of metric settings for the errors above and note the left column with metric settings of “0″ are warnings, where critical is set to null.  There are no settings for critical!

If you look back up at the rules, you will notice they are all set to alert when “Severity is Critical”.

If there are no metric settings for critical, then no response with emails will be created.  The rules do go on to create incidents, but again, no generation of alert notifications to the DBA to address the problem.  The easy fix for me, as I did not want to add to what I already had in rules, was to edit the metric settings and include “0″ in the critical columns as well.  The EM will warn you that they are set the same, but for this type of metric alert, it functions fine and I was able to test successfully by forcing an ORA- error into the alert log:

 

SQL> exec dbms_system.ksdwrt(2,'ORA-00600: 
Test message, verifying alert log monitoring in EM12c');
PL/SQL procedure successfully completed.

Viewed the alert log:

Wed May 30 18:26:53 CDT 2012
 ORA-06512: Test message, verifying alert log monitoring in EM12c

I then uploaded the agent to speed along the alerting process, (I’m so impatient…)

Email received and the incident shows, the alerting mechanism now works with the rule set:

 

With the new EM12c, (this includes the latest bundle patch), GO THROUGH and inspect ALL OF THE METRIC SETTINGS for CRITICAL VALUES.  The rule set unless you update your rules to either alert on warnings, which to me, creates a lot of extra paging or add secondary rules to look for warning alerts for these metrics that are crucial to any DBA monitoring an environment.

 

 

 

EM on a VM- OxyMoron?

I’ve been part of multiple conversations, via Twitter, Facebook, personal and professional email on the choice of housing Enterprise Managers on a Virtual Machine.

Now in the title, I am not to be taken seriously, this is a bit of sarcasm, so please know, I am having a bit of fun with the title and hope to seriously look the logic of why I avoid VM’s for my Enterprise Manager homes.

The conversations didn’t get heated, but included many who were passionately against an EM on VM, another set who stated that “Virtualization is the future, including the Enterprise Manager-  EMBRACE!” and I agree with both sides of the argument, but will continue to sit in the first group for reasons of experience.

My history involves four Enterprise Manager environments that resided on VM’s.  One was an EM12c linux/windows VM x86/VMWare combo, one 11g Linux VM x86 and two on 10g, one Linux, one Windows but too far back to remember much in specs.  I did not build or design any of them, solely came in as support after the environment was already in production.  For three of these environments, I was a remote DBA supporting the customers, so this must be taken into consideration.

I want to start off with the goal of any Enterprise Manager environment:

  • Robust, 24×7 environment monitoring.
  • No “white noise”- in other words, no alerting or paging outside of actual issues/incidents.
  • Secure, non-impacted by other applications/systems.
  • Complete,  multi-tiered monitoring for host, database, application, cloud environments and anything else the DBA can find and need a plug-in for.

Some of the cool features and goals of a VM:

  •  Software and hardware isolation and part of the VM, but still able to share one set of hardware.
  • Para-virtualization, which can also be seen as load-balancing
    • IO resource allocation across VM’s as needed.
    • Memory resource allocation across VM’s as needed
  • Cost-saving distribution of a server for many purposes
  • Resource scheduling-  yes, a schedule of resources and where they may be needed most at scheduled times.

With both of these bullet point areas we can surmise a quick one line goal of each:

Enterprise Manager-  Consistent, reliable monitoring and alerting of the environment a DBA is responsible for.

Virtual Machine- Flexible architecture allowing dynamic re-allocation of resources to where most needed and often saving money by extending use of one server to many.

Looking at this again, different wording:

The concept of each, lends the DBA and those they support to view the Enterprise Manager as their window, the sentinel of their environment-  It must be trusted to be there for them 24 hrs a day, 7 days a week, rain or shine.  The administrator and those they support view Visualization as a less expensive way to get things done, dynamic allocation of resource features, hosts on demand.

On a stand alone server, what is the DBA most concerned with, let alone for their Enterprise Manager environment?

  • Do I have enough Memory?
  • Do I have the IO to perform the tasks that are necessary to day to day business?
  • Is my environment secure?
  • Who has access to these resources and could impact what I think I have, restricting my database and in turn, impacting me?

In each environment that I’ve been a DBA, Lead DBA or DBA and Developer in, there was a learning curve that was demanded of the server administration team to manage VM servers.   A VM doesn’t appear much different to the DBA than any other server, so the learning curve was much less on our side.  I was very sensitive to this for the administrator-  I experienced it for Oracle, SQL Server and MySQL installations, along with applications that interacted with the databases I supported.  Many of them released the servers to users with the default settings, receiving no additional training to offer anything more, which was acceptable for development or test or lesser impacting for a file server solution, but for anything mission critical, caused repeated impacts to service up-time.

For Windows environments, I’ve experienced automatic updates causing outages in my EM environment  once during a critical incident period, so no notifications were sent by the EM.  I received a page that the EM was down due to a secondary server with a EM monitoring cron, but this didn’t let me know there was an issue in another environment being monitored by that EM.  When the hosting company was contacted in regards to the outage, I completely understood when I was informed that they hosted over 4000 Windows VMware hosts and it is in the SLA that automatic updates are turned on.

For a Linux VM that housed my OMS repository, each night, somewhere between 1:30am and 3am, there were pages escalated to the DBA oncall due to loss of contact between the OMS and the console, (Message=Agent is unable to communicate with the OMS. (REASON = Agent is Unreachable (REASON : Agent to OMS Communication is brokenOMS application is unavailable )

What is the issue with this?  Can’t you just ignore it?  Yes, you can, but this is the risk:

1.  White noise-  you learn to ignore pages at these times, assuming it’s just this bogus alert and sooner or later, you experience a real issue and ignore it.  This is not efficient alerting!

2. If the console can’t communicate with the OMS, how do you know your agents are uploading all appropriate data in a timely manner?

3.  If the console can’t communicate with the OMS, are you receiving alerts in a timely manner if there is an issue in our environment from any given target?

The cause of the OMS communication error?  a second VM sharing resources is utilized as an FTP server and floods the network each night around this time, impacting the OMS and console’s ability to communicate.

As their is definitive hardware isolation it took a while and some research to figure out what was causing the outage.  This is where there is a catch-22 to the VM environment for trouble-shooting issues.

With 10g as many have noted, there were the standard whining about memory and CPU starvation, poor management of the immature VM environments, etc.   I started down the path with my previous employer tasked with  building an 11g EM on a Windows VMWare, that due to such poor network connectivity and inconsistent resource allocation, sat completely idle and quickly was upgraded to a Linux stand alone server with EM12c last fall as part of a huge 11g upgrade project.

I think any DBA knows what is required for their Enterprise Manager hardware.  I think this is a solid case for RAC or Data Guard, but then we get into higher licensing cost and remember, the choice to put the EM on a VM was most likely to KEEP COSTS DOWNN.  I believe that the EM is often looked upon as a luxury for the DBA by the business.  It is not producing revenue.  The users do not utilize it, it is for support.  It is easy to comprehend why it is often going to be an application/system that is deemed perfect for virtualization.

Pointing fingers is also easy when the EM does not get the resources it requires due to the basic nature and features of the VM without clear and skilled knowledge of what the EM requires.  I have spent a large amount of my time, due to this type of issue, creating very politically correct explanations of why the DBA staff was not aware of an outage/critical issue/spam bogus alerting and it is important to me to do so.  Everyone is a professional and they are doing their job to the best of their ability.  Knowing the critical importance of my Enterprise Manager environment to the business to notify me as the technical specialist to address issues and to keep revenue flowing, why would I put myself, the administrator of the VM environment or the company I am there to support in the position of having this critical environment reside on a technology who’s basic nature is better suited to other applications/uses?

There are times when a VM is going to be the only choice for hardware for an Enterprise Manager project or environment a customer has and I will do everything in my power to ensure it has the best support and that I continue to learn more about virtualization.  I hope that enhancements and education on how best to build virtualized environments to support production 24×7 mission critical systems will continue and that at some point VM will be as easy a choice to recommend as other options when it comes to Oracle’s Enterprise Manager.