OEM OC4J False Down/Timeouts

After upgrading to 10.2.0.5 on Linux, our Oracle Enterprise Manager would report intermittently that the OC4J was down-

Target Name=EnterpriseManager0.serve3r
Target type=Oracle Application Server
Host=mtlincoln
Occurred At=March 12, 2010 3:09:52 PM MDT
Message=The application server instance is down
Severity=Critical
Acknowledged=No
Notification Rule Name=Application Server Availability and Critical States
Notification Rule Owner=SYSMAN

If you immediately checked the status of the OEM, all responses reported acceptable-

./opmnctl status

Processes in Instance: EnterpriseManager0.serv3r

——————-+——————–+———+———
ias-component                 | process-type              |pid | status
——————-+——————–+———+———
HTTP_Server                   | HTTP_Server           | 25823 | Alive
LogLoader                        | logloaderd                |   N/A | Down
dcm-daemon                     | dcm-daemon            |   N/A | Down
OC4J                                | Home                      |   25822 | Alive
OC4J                                | OC4J_EMPROV    |   25824 | Alive
OC4J                               | OC4J_EM              |    3539  | Alive
WebCache                       | WebCache               | 25837 | Alive
WebCache                       | WebCacheAdmin     | 25827 | Alive

I first blamed the introduction of flash and additional targets being monitored by the OEM, extending the interval on the thread timeouts for the alert errors per numerous recommendations from Oracle and others affected by the same issue:
$OMS_HOME/Apache/Apache/conf/httpd.conf

#
# Timeout: The number of seconds before receives and sends time out.
#
Timeout 300
#
# KeepAlive: Whether or not to allow persistent connections (more than
# one request per connection). Set to “Off” to deactivate.
#
KeepAlive On
# Changed parameter to address bug 5717633 KJP, 4/26/10
#KeepAlive Off <–Commented out for the bug shown above
#
# MaxKeepAliveRequests: The maximum number of requests to allow
# during a persistent connection. Set to 0 to allow an unlimited amount.
# We recommend you leave this number high, for maximum performance.
#
MaxKeepAliveRequests 100
#
# KeepAliveTimeout: Number of seconds to wait for the next request from the
# same client on the same connection.
#
KeepAliveTimeout 15
#

this unfortunately did not correct the problem and we continued to be paged from time to time, without a particular issue being experienced as an instigator.

I was finally able to locate the actual source of the problem while digging around deep in the agent for the Oracle Application Server that is part of the OEM.

Through the OEM interface, Go to the OEM host > Middleware > Application Server Name

Status Up

Availability (%) 99

(Last 24 Hours)
Application URL http://serv3r:3338/
Version 10.1.2.3.0
Installation Type J2EE and Web Cache
Oracle Home /u01/app/oracle/product/10.2.0/oms10g
Host : Serv3r

http://mtlincoln:4889/em/cabo/images/t.gif

Components

 
 
 

Select All | Select None

Select

NameSorted in ascending order

Type

Current Status


home


OC4J

Up


HTTP_Server


Oracle HTTP Server

Up


OC4J_EM


OC4J

Up


OC4J_EMPROV


OC4J

Up


Web
Cache



Web Cache

Up

 
Each link worked well except for one, which reported issues- OC4J_EM.  When clicked on, I received an error, “Can’t load oc4j_all_instances_rollup” .  I did a quick Google search on “oc4j_all_instances_rollup” and received only two responses, but one of them was to the OEM XML file that supports this final “up check” for the OC4J processes-

The file, $OMS_HOME/j2ee/OC4J_EM/applications/em/em/WEB-INF/config/webappTargetTypes.xml

I noted that I had two lines that did not match, mine referred to an “oc4j_instances_rollup”, but not the “oc4j_all_instances_rollup” that the OEM was searchin for.  Since the example was very close to my own file, I updated the two lines metric names to match the one from the web example, only after making a backup copy of the original, (always best to keep a copy!)

I then saved the file and reloaded the OEM-
 
./opmnctl reload
 
Upon viewing the same link in the GUI interface, post OEM reload, no error was received and the response times are shown successfully.  The timeout alert stopped now that ALL checks for up status resolve successfully, but this was an inaccurately reporting error deep in the agent mechanism for the OC4J monitoring that does not reside in the Apache or standard directories we would inspect for misconfiguration.

Print Friendly
May 3rd, 2010 by

facebook comments:

  • andjelko

    Hi,
    I have the same problem.
    After a server crash (Linux/Oracle NFS), I reboot the system… but in order to start repository, oms, oma i have to unlock a bunch of files (control file, datafiles, … emkey.ora,… and somehow the GC stared to work but …
    After this “crash” I’m getting the same error message like you:
    Notification Rule Name=Application Server Availability and Critical States
    I checked oms (./opmnctl status) everithing up and running, (exactly the same problem)
    I took the following steps to fix this problem:
    – in the file $OMS_HOME/jdk/jre/lib/security/java.security
    I set networkaddres.cache=180 (default -1)
    The system become more stable.
    Meanwhile I unlock all files and reboot the system.
    The problem is gone!
    Unfortunately I’m not sure which step fixed this problem (networkaddres.cache=180 or unlock/reboot)!
    Sometime encrypted data in Enterprise Manager will become unusable if the emkey.ora file is lost or corrupted.
    So check the emkey.ora:
    $emctl status emkey

    Regards,
    Andjelko Miovcic

  • Facebook
  • Google+
  • LinkedIn
  • Twitter