Let’s say you’re on call and you’re woke from a deep, delightful sleep from the pager, stating the Enterprise Manager Cloud Control isn’t available.
You log into the host, check the status, it tells you that the Weblogic server is up and everything else is down. The host logs show that the servers were restarted unexpectedly, so you want a clean shutdown before bringing Enterprise Manager back up, so you shut it down and then attempt to start it clean:
$ ./emctl start oms Oracle Enterprise Manager Cloud Control 13c Release 1 Copyright (c) 1996, 2015 Oracle Corporation. All rights reserved. Starting Oracle Management Server... WebTier Could Not Be Started. Error Occurred: WebTier Could Not Be Started. Please check /u01/app/oracle/gc_inst/em/EMGC_OMS1/sysman/log/emctl.log for error details
Well, of course you’re going to follow the recommendations and look at the log for errors!
2016-02-22 00:18:32,342 [main] INFO ctrl_extn.EmctlCtrlExtnLoader logp.251 - 2016-02-22 00:18:32,342 [main] INFO ctrl_extn.EmctlCtrlExtnLoader logp.251 - 2016-02-22 00:18:32,360 [main] INFO ctrl_extn.EmctlCtrlExtnLoader logp.251 - Connection refused 2016-02-22 00:18:32,360 [main] INFO commands.BaseCommand printMessage. 426 - extensible_sample rsp is 1 message is JVMD Engine is Down
The error we notice is the connection is refused. This is odd and it really doesn’t provide us a lot of information to go on. Logs are our friends, but this time, we’re going to move from a log onto a message file that may be able to assist us further, as in the emctl.msg file. There’s not a lot of data in this message file, in fact, but the health monitor does direct us to what we need:
HealthMonitor Feb 22, 2016 12:18:32 AM JobDispatcher error: Could not connect to repository /u01/app/oracle/gc_inst/user_projects/domains/GCDomain/servers/EMGC_OMS1/ logs/EMGC_OMS1.out
This points us to the Weblogic domain log directory and to an output file that will offer us the insight we need:
view /u01/app/oracle/gc_inst/user_projects/domains/GCDomain/servers/ EMGC_OMS1/logs/EMGC_OMS1.out
Unlike the message file, this output file contains a LOT of information, but there are some really cool entries in here to be aware of for the Node Manager- including environment paths used for each service/process, usernames used for logging in, IP addresses, resource allocation, ports used AND process information.
Sure enough though, if we view the out file that coincides with the log entries, we see the following:
<Feb 22, 2016 00:18:32 AM PST> <INFO> <NodeManager> <The server 'EMGC_OMS1' with process id 3016 is no longer alive; waiting for the process to die.>
These errors are due to the inability of the OMS, (Oracle Management Service and JVMD to connect to the Weblogic tier processes. Even if you do a clean shutdown and restart, it’s still unable to spawn the weblogic processes due to secondary
There’s a couple ways to look for these old processes and clean them up.
- If you’re EM environment runs as a separate OS user, grep for the user to identify the processes and kill them.
- Look at the path of the middleware home to identify something unique to grep for: /u01/app/oracle/13c/ohs/bin/, (in this case, 13c and ohs are our best bet.)
Perform the search for the processes:
$ps -ef | grep 13c
oracle 3016 1 0 Jan28 ? 00:02:38 /u01/app/oracle/13c/ohs/bin/httpd.worker oracle 3019 3016 0 Jan28 ? 00:01:27 /u01/app/oracle/13c/ohs/bin/odl_rotatelogs oracle 3020 3016 0 Jan28 ? 00:01:10 /u01/app/oracle/13c/ohs/bin/odl_rotatelogs
Bad, BAD Web Tier! Look at you leaving all those orphan processes after the reboot! The easiest way to address this is to kill these processes manually, but ensure that you only kill these, not your Oracle Management repository, (the database) or agent or other processes such as the listener. You are looking for the OHS or Weblogic processes here.
Note that they processes are running, even though all of EM is down and have a date from Jan. 28th. By killing the parent: 3016, I’ll remove the other two child processes, but always good to check if they are orphaned.
$ kill -9 3016
Once I’ve verified that everything is clean and no orphaned processes for the EM tiers exist, restart Enterprise Manager:
$ ./emctl start oms Oracle Enterprise Manager Cloud Control 13c Release 1 Copyright (c) 1996, 2015 Oracle Corporation. All rights reserved. Starting Oracle Management Server... WebTier Successfully Started Oracle Management Server Successfully Started Oracle Management Server is Up JVMD Engine is Up Starting BI Publisher Server ... BI Publisher Server Successfully Started BI Publisher Server is Up
All happy again!
The Weblogic log directory holds historical information, too, so don’t despair if you’re looking into something that happened in the last restart. The out files are numbered, retaining 8 total and numbering from newest with the .out exension and then reversing to 00001 for the oldest.
ls -ltr *.out* -rw-r----- 1 oracle dba 28261 Jan 12 14:30 EMGC_OMS1.out00001 -rw-r----- 1 oracle dba 5120147 Jan 19 06:31 EMGC_OMS1.out00002 -rw-r----- 1 oracle dba 2568825 Jan 22 23:16 EMGC_OMS1.out00003 -rw-r----- 1 oracle dba 25591 Jan 22 23:16 EMGC_OMS1.out00004 -rw-r----- 1 oracle dba 5121593 Feb 4 10:16 EMGC_OMS1.out00005 -rw-r----- 1 oracle dba 4215122 Feb 10 13:21 EMGC_OMS1.out00006 -rw-r----- 1 oracle dba 224605 Feb 22 00:52 EMGC_OMS1.out00007 -rw-r----- 1 oracle dba 233190 Feb 22 16:00 EMGC_OMS1.out