After upgrading to 10.2.0.5 on Linux, our Oracle Enterprise Manager would report intermittently that the OC4J was down-
Target type=Oracle Application Server
Occurred At=March 12, 2010 3:09:52 PM MDT
Message=The application server instance is down
Notification Rule Name=Application Server Availability and Critical States
Notification Rule Owner=SYSMAN
If you immediately checked the status of the OEM, all responses reported acceptable-
I first blamed the introduction of flash and additional targets being monitored by the OEM, extending the interval on the thread timeouts for the alert errors per numerous recommendations from Oracle and others affected by the same issue:
# Timeout: The number of seconds before receives and sends time out.
# KeepAlive: Whether or not to allow persistent connections (more than
# one request per connection). Set to “Off” to deactivate.
# Changed parameter to address bug 5717633 KJP, 4/26/10
#KeepAlive Off <–Commented out for the bug shown above
# MaxKeepAliveRequests: The maximum number of requests to allow
# during a persistent connection. Set to 0 to allow an unlimited amount.
# We recommend you leave this number high, for maximum performance.
# KeepAliveTimeout: Number of seconds to wait for the next request from the
# same client on the same connection.
this unfortunately did not correct the problem and we continued to be paged from time to time, without a particular issue being experienced as an instigator.
I was finally able to locate the actual source of the problem while digging around deep in the agent for the Oracle Application Server that is part of the OEM.
Through the OEM interface, Go to the OEM host > Middleware > Application Server Name
Availability (%) 99
(Last 24 Hours)
Application URL http://serv3r:3338/
Installation Type J2EE and Web Cache
Oracle Home /u01/app/oracle/product/10.2.0/oms10g
Host : Serv3r
Each link worked well except for one, which reported issues- OC4J_EM. When clicked on, I received an error, “Can’t load oc4j_all_instances_rollup” . I did a quick Google search on “oc4j_all_instances_rollup” and received only two responses, but one of them was to the OEM XML file that supports this final “up check” for the OC4J processes-
The file, $OMS_HOME/j2ee/OC4J_EM/applications/em/em/WEB-INF/config/webappTargetTypes.xml
I noted that I had two lines that did not match, mine referred to an “oc4j_instances_rollup”, but not the “oc4j_all_instances_rollup” that the OEM was searchin for. Since the example was very close to my own file, I updated the two lines metric names to match the one from the web example, only after making a backup copy of the original, (always best to keep a copy!)
I then saved the file and reloaded the OEM-
Upon viewing the same link in the GUI interface, post OEM reload, no error was received and the response times are shown successfully. The timeout alert stopped now that ALL checks for up status resolve successfully, but this was an inaccurately reporting error deep in the agent mechanism for the OC4J monitoring that does not reside in the Apache or standard directories we would inspect for misconfiguration.