Most of you know I don’t like pink, but I also am not a fan of yellow.
- Yellow in my EM12c summary page
You know what yellow I’m talking about:
Now I’m just as disturbed by the red section of that graph, but today, we’re going to focus on the unknown, the yellow.
If something is down, then you know it isn’t uploading data and is most definitely a critical issue, but what about the unknown? Why is Enterprise Manager reporting something is unknown?
To first tackle this type of summary, we need to know the categories the unknown status are in. By clicking on the Unknown section in the graph or to the right in the legend, we can gather the first step of information we’ll need:
We now know that of our 244 targets that have been marked as “Unknown”:
- 0 are Agent Down
- 158 are Agent Unreachable
- 7 are in Status Pending
- 79 are Metric Collection Errors
Note that the categories that do have counts are also links to a filtered target list for easy viewing.
We’ll start our investigation by clicking on the category Agent Unreachable.
With a list this large, it’s going to be pretty easy to get overwhelmed, so I recommend clicking on the header for target type, sort ascending, which will bring the agent list to the top and make it a little easier to dig into the agents that aren’t reachable, as this is the source of our problem pinpointed by the console:
Ah, much better! Now you can go through just these agents and inspect what is going on with each of them. You know they are unreachable, so this gives you a quick list of a few hosts that you need to log into and inspect what has happened, (host is no long available, agent just needs to be started, etc…)
Once you correct these issues, removing the hosts that are no longer available for monitoring or starting agents or resyncing agents, you can go onto the next unknown category in the list.
We have seven targets that are listed as Status Pending. Just as before, we click on our category to enter the filtered All Targets list.
This is a small list, but that’s a good thing, because this category doesn’t just have items from many different target types, the reasons they are in a pending status could be for countless reasons and must be investigated thoroughly:
The first item shown in the list is an ASM storage target. If we click on the ASM2 target name in the list, we’ll go to the target to investigate.
Helpful things to try when troubleshooting is:
1. Look at the last collected timestamp and the info tabs to see how long it’s been down and what other targets are connected to the unknown target that may signal is there is a bug or configuration issue that is causing the target to sit in a status pending.
2. A target may have never been added/configured correctly in the first place, making it more difficult to troubleshoot. If the target is not a parent for another target, consider removing it and re-adding.
If you are unable to remove the target in the console, you can remove the target via the command line interface with the following command:
emcli delete_target -name="<target_name>"
Ensure before you remove it that there aren’t any targets dependent upon it, (host targets can’t be removed with databases, listeners, etc. still part of the EM environment, etc…)
Exalogic Control for VMs
For the Exalogic target stuck in a pending status, we’re quickly told what the issue is:
We then just need to check the agent version on the Exalogic and if it checks out, then verify that the VM Manager is registered.
This is a more common target type for EM users and the if no immediate errors show up on the screen when first entering the database summary screen, I’d recommend clicking on Oracle Database, Monitoring, Status History. This will tell you when the last time the database had an active status.
As you can see, the database in question has been reporting in a pending status since May 15th, so for one week now. This can offer us the information we need to investigate that time in incidents, email notifications and/or the alert log.
And the incident manager does show that this database target did report a metric error cleared and to status pending on that day:
If you open this incident, the bottom section, under related events, you can see the target was up and running until 10:47pm on May 15th:
Next, test the monitoring configuration by clicking on Oracle Database, Target Setup, Monitoring Configuration. Click on Test Connection:
Upon testing the connection, we can see this is the issue and needs to be corrected for the database target. Edit the configuration for the database in the database targets list and either unload the agent manually or wait a few minutes for it to completely synchronize and update the console to correct the status.
I won’t spend too much time, but give you some quick tips on listeners-
- Verify what version of EM12c you are on and check for any missing patches. There are a couple bugs reported for scan listeners with earlier versions of EM12c, but even with EM12c Release 3, there was one I am aware of.
- Verify that the listener reported is the LISTENER BEING USED by the targets! Often I find that there are multiple listeners on a host that DBAs have created and only one is really being used.
- Check for misconfiguration or manually edits that may be causing the issue with the listener.
All the listeners in our example above fell into the reasons I’ve listed above and could be either removed, reconfigured or required a patch.
Metric Collection Error
A metric collection error is when the Enterprise Manager is unable to collect information in a timely manner on a metric from a target.
For our example, this is another long list, (79 targets) and once you’ve entered the filtered targets list, you may find sorting by target type or target name to assist in how to initially dig into the sources for each target “grouping”, (targets that all belong to one host or family).
For me, I found working from the agent down is the easiest. As the agent is the communication from any target to the OMS, it makes sense to me to always sort from the agent, clear those, which will often clear many other targets from the list, and then view what targets are left once I’ve resolved ones that are often connected through the agent issue.
With the sort on target type down to agent, I ended up with only three agents to start my analysis with. Clicking on the first one, quickly let me know that the agent was broken and best course of action was to log into the host it resided on to troubleshoot further.
As the agent is marked as broken, there isn’t much more you can do from the console side.
Misc. Targets After Agent Grouping Fixed
The list after correcting the broken agents was quick and easy:
- Service Entry Point: Remove, post clean up was never performed originally.
- Weblogic Welcome Page: They even give you a link to configure it and clear up this one… 🙂
Don’t let yellow in your Enterprise Manager summary page take over your life, as you saw in this example, there was a high number of targets that fell into this category and the preference should be to avoid this. An Unknown status can be just as critical as a Down status if you paid attention to some of these screenshots, you will have noticed how often these targets with this status were in the Availability category for incident management. Unknown often equaled no status information and no metric data being uploaded and that is not a good thing.
Managing it when it arises is a small task to keep your Enterprise Manager environment spotless and you in the KNOW instead of the UNKNOWN.