EM12c- Managing Incidents, Stopping the Insanity, Part III
Nothing is more annoying that getting alerted on things that are not critical to you or that you already know is occurring and there is not a darn thing you can do about it.
I’ve also been frustrated to wake up in the morning and to see my inbox flooded with a ton of alerts from numerous EM12c systems and really- none of them are truly critical, but can appear to be simply overwhelming!
How to we steer the inbox from madness to manageable?
The answer to this question is not just managing metric settings and thresholds- there are a number of different ways to control what end up in your inbox or alerting you at every waking and sleeping hour.
The part in this series is focused on fine grain rule sets. This is a post that will show you how in a few, simple steps, with the hopes that you can recreate this as simple or complex as you wish, based to make your EM12c server you better.
Note: This next post is a change, that as I said, is about changing a rule set- Remember- changing a rule set is a change to ALL targets utilizing the rule set, vs. changing a metric policy or threshold at a target level, so think through what you want to change and how you want to make the change BEFORE you make the change. All options I show you here can be performed in many, many different ways to address the problem. Rationally think about the change you need to make, test it out and use common sense.
So for our email today, one of my clients systems have started to send this incident to my inbox at night:
Host=hostxyz.client7.com Target type=Oracle Management Service Target name=hostxyz.client7.com:4892_Management_Service Categories=Performance Message=Loader Throughput (rows per second) for Loader_D exceeded the critical threshold (75). Current value: 65.08 Severity=Critical Event reported time=Dec 4, 2013 8:50:10 AM CST Operating System=Linux Platform=x86_64 Associated Incident Id=51218 Associated Incident Status=New Associated Incident Owner=SYSMAN Associated Incident Acknowledged By Owner=No Associated Incident Priority=Very High Associated Incident Escalation Level=0 Event Type=Metric Alert Event name=Management_Loader_Status:load_processing Metric Group=Active Loader Status Metric=Loader Throughput (rows per second) Metric value=65.08 Key Value=Loader_D Key Column 1=Loader Name Rule Name= Incident Management Ruleset,Incident creation Rule for metric alerts. Rule Owner=SYSMAN Update Details: Loader Throughput (rows per second) for Loader_D exceeded the critical threshold (75). Current value: 65.08 Incident created by rule (Name = Incident Management Ruleset, Incident creation Rule for metric alerts.; Owner = SYSMAN).
I could really care less about this and am already receiving this information once it hits the warning threshold for review. I don’t want to be woke up in the middle of the night, nor is it something that I can really address as a critical outage for this client.
Let’s edit our rule set by taking the information offered us in the email from the “Rule Name” section:
Rule Name= Incident Management Ruleset,Incident creation Rule for metric alerts.
Incident Rule Sets
Log into the EM12c and click on “Setup”, “Incidents” and then “Incident Rules”
You should see your Incident Rule Sets listed.
We’ll take the following information from our Incident email we recevied:
Rule Name= Incident Management Ruleset,Incident creation Rule for metric alerts.
and then one more section, “Categories”, (remember, some Incidents can belong to more than one category):
Taking just these two lines above, this tells us what incident rule is alerting us. What we may not realize, is that by default, critical metric alerts notify on ALL categories and you then distinguish this by rule set, by target, by group, etc. This is where the EM12c again proves itself to be a self-service product, giving the power to the administrator to receive notifications on any incident in the way that they want to receive or NOT receive.
Armed with this information, we are now going to take this example for our client and edit the rule set-
We’ve highlighted the rule set that matches the first part of the Rule name “combination” and click on Edit.
Then upon entering the info for the rule set, we’ll need to edit the actual rule, which is the second part of the combination offered in the “Rule Name” from the email. Click on the Rule Tab, the rule which we wish to edit, then click on “Edit”.
This will take you into the rules basic information on what it uses for requirements it needs to trigger an incident.
By default, most rules are created to be triggered by very simple choices-
- Type of Alert, in this case- Metric Alert
- Severity, in this case- Critical
All other granularity has been left wide open, but you can change this and finely tune the granularity.
Now in the above email, we are told that the Category involved is “Performance”, but we really don’t want this waking us up in the middle of the night, as-
1. This may be a server that regularly has high resource usage.
2. It’s not a critical “pending outage” issue, but an issue that we would need to investigate in the morning or may already be a known issue that is scheduled to be addressed.
To address the emails, we are going to make two changes.
1. Only email on what is mission critical outage issues for Metric Alerts
2. Create a new rule that will create an incident for any categories that are outside of what we want to be notified of for metric alerts.
As you can see in the above example, I’ve added a check-mark for the Category and chosen the following:
I can then click on Next and Save the change.
We still have the other categories that are important to us, but we just don’t want them emailing anymore. I want them to create an incident and I’ll review them when I review my Incident Manager, as I’ve been a strong proponent of using the Incident manager in this manner.
Create a New Rule for Category Level Metric Alert Coverage
We have now returned to the Rules that make up the Rule Set- Click on “Create” to create a new rule. We are going to create a rule very similar to the one we just edited, (just in case you need an example to use as a reference…) but we are going to choose the other categories and have the rule handle these categories of Metric Alerts incidents differently.
Choose the default radio button, “Incoming events and updates to events” and click Continue, which will take you to the rule wizard.
For this rule, we choose the following:
- Type: Metric Alert
- Severity: Critical
- Category: All the categories we didn’t choose for our rule that is still in place.
Click on Next to proceed to the next set the step of actions when the Metric is triggered in the wizard:
We want the EM to ALWAYS perform the actions when a metric alert for these categories are triggered, so leave the default here.
To save us time and energy, I believe is automating the assignment of the metric alerts and other common incident.
- Assign to SYSMAN or a User created for this purpose in the EM12c. You can even assign a specific email address if one person or group are in charge of addressing these types of incidents, (more options, more options… :))
- This is still critical, so assign the priority to “Urgent” or “Very High”.
We want these categories to NOT email, so skip the email section.
Choose to clear events permanently, unless you wish to retain this data.
Proceed to the next section of the wizard, where you can review your Condition and Action Summary.
Click to proceed to the next screen and add a meaningful name for your rule and a meaningful description-
Click on Save and then you will need to click on OK to save the rule to your existing rule set.
By following this process, we have
1. Removed specific categories from emailing from a specific rule so that only critical “possible” production outage incidents are going to email/paging.
2. Added a second rule to handle those categories no longer in the original and to create incidents for review when appropriate time.
This could easily be built out so that a unique user is created with an email address to page uniquely and only assign mission critical, production OUTAGE to alert.
One rule could handle this for just production admin group, one business line, etc. to happen only after hours by editing the notification schedule.
You have the power in your hands to build out your Enterprise Manager 12c environment in the way you need it to support you and to do what your business needs to be more productive.