Subscribe to Blog via Email
Follow me on TwitterMy Tweets
The other day I had a customer who was to send me an ASH report after there was a challenge in providing me EM Diagnostics from their repository database. After the report hadn’t been submitted after a number of days, the customer admitted that they would execute an ASH report request and would subsequently kill the session after eight hours or so of non-completion.
Although I hadn’t experienced this myself, there are a number of causes behind an ASH report not running and troubleshooting it is a straight forward approach.
The first thing to do, is to gather some information on the cause of the issue. Running the report repeatedly without looking into the cause isn’t going to make the end result any better.
Upon executing the ASH report, (either from Enterprise Manager or SQL*Plus, using the $ORACLE_HOME/rdbms/admin/ashrpt.sql report) the completion should be seen in less than a minute or so. If it doesn’t complete in that time, inspect the session details, pulling the execution plan for the SQL that is currently running for the session.
In the customer’s experience, the ASH report was “hung up” on a step to capture the top SQL using a join between GV$_ACTIVE_SESSION_HISTORY and a number of X$ tables. The plan had a number of merge join cartesians, ending with a result of 3.501 million rows.
The amount of ASH data involved could in no way come up to that kind of row count and the X$ tables aren’t “static” values you can work off of, but it became apparent that the stats were definitely awry, but not on the ASH object. To update the X$ tables, I had the user collect fixed stats and then run the ASH report again.
Upon executing the report afterwards, the report ran in a matter of a few minutes and I could now have the original report I had requested, but I also asked them to run one more for me:
This is the General AWR Information report. I often recommend this report for sizing an AWR Warehouse, but for a very specific reason. Many earlier releases of 11g experienced issues with clean up of ASH data.
There is a section in the report that states the size of individual components in the SYSAUX tablespace.
Here’s the output from a “healthy” AWR in a clustered repository, DB Version 126.96.36.199.0:
COMPONENT MB % AWR KB_PER_SNAP MB_PER_DAY MB_PER_WEEK TABLE% : INDEX% --------- --------- ------ ------------ ---------- ----------- ---------------- SQLPLAN 697.0 30.2 1,028 24.1 168.7 69% : 31% FIXED 618.5 26.8 913 21.4 149.7 50% : 50% EVENTS 180.5 7.8 266 6.2 43.7 47% : 53% SQL 52.1 2.3 77 1.8 12.6 73% : 27% ASH 48.2 2.1 71 1.7 11.7 89% : 11% SPACE 41.4 1.8 61 1.4 10.0 72% : 28% SQLTEXT 16.3 0.7 24 0.6 3.9 98% : 2% RAC 12.2 0.5 18 0.4 3.0 52% : 48% SQLBIND 1.3 0.1 2 0.0 0.3 55% : 45%
Here’s the output from the customer’s AWR in their repository, DB version 11..2.0.2:
COMPONENT MB % AWR KB_PER_SNAP MB_PER_DAY MB_PER_WEEK TABLE% : INDEX% --------- --------- ------ ------------ ---------- ----------- ---------------- ASH ######### 67.0 222,497 5,214.8 36,503.5 87% : 13% FIXED 28,382.3 10.9 36,149 847.2 5,930.6 48% : 52% EVENTS 19,503.7 7.5 24,841 582.2 4,075.4 40% : 60% SQLPLAN 16,093.0 6.2 20,497 480.4 3,362.7 70% : 30% SPACE 1,460.4 0.6 1,860 43.6 305.2 67% : 33% RAC 552.6 0.2 704 16.5 115.5 59% : 41% SQL 100.6 0.0 128 3.0 21.0 69% : 31% SQLBIND 6.0 0.0 8 0.2 1.3 50% : 50% SQLTEXT 4.2 0.0 5 0.1 0.9 96% : 4%
Although both Enterprise Manager’s are 188.8.131.52, are both on RAC and both have the same interval and retention time for AWR set, the customer’s 184.108.40.206.0 is retaining much larger amounts of data and ASH is by far the largest contributor to space allocated, (to the point that its more than the report was formatted to display…)
There are a number of MOS notes covering what may be of the cause, but there are helpful steps to follow to diagnose the specific challenge that’s causing the unbridled size demands for AWR and/or ASH in your own environment.
One of the most important is to follow the appropriate steps to isolate what is consuming space, often in the way of partitions not splitting properly and manually purging data, as seen in the following MOS Doc:
The Oracle Management Repository, (OMR) is a busy database, having to serve the Oracle Management Service, the web tier, the targets and anything else that EM expects of it. If undue pressure is added by other features, as we see here with ASH and AWR data, it is important to dig down into the cause and address it. This issue was impacting our ability to even pull diagnostics data to offer an intelligent recommendation on upgrade options. By investigating the problem in a logical manner, we’re able to quickly diagnose the cause to the problems and correct the issue.
This question was posted in Twitter from @matvarsh30, who asked, “How can I display CPU usage over different periods of time for databases in Enterprise Manager?”
Everyone loves their trusty Top Activity, but the product’s functionality is limited when it comes to custom views and this is what our user had run into. There are numerous ways to display this data, but I’m going to focus on one of my favorite features in the product that was created to replace Top Activity, ASH Analytics.
Note: This process to display CPU graphs will work for EM12c and EM13c. Other than the location of the target menu, not much else has changed.
The default display is for one hour and as ASH Analytics is dependent upon AWR data, so although 8 days of detailed information is easy, it is important that you set your retention in the source, (target) databases appropriately to ensure you’re able to view and or research past the default 8 day retention of AWR in any database. I am a firm believer that if you have the diagnostic and tuning pack for your EE databases, you should be getting the most out of these tools and up the retention time from the default by running the following command via SQL*Plus with the appropriate privileges:
BEGIN DBMS_WORKLOAD_REPOSITORY.modify_snapshot_settings( retention => 86400, -- In minutes, 86400 is 60 days interval => 30); -- In minutes, only change from 60 if doing workload tests, etc. 60 min interval is sufficient. END; /
Now you have not just the EM metric data that rolls up, but the AWR data for ASH Analytics to do deep analysis and reporting.
With this in place, let’s create some graphs that answer the question – “How do I display CPU usage for a database over one week, 10 days, 30 days, 60 days?”
Once logged into any database, you can access ASH Analytics from the Performance drop down. If you have an 11g or earlier database, you may have to install the package to create the EMCC views, but this will need to be done to utilize this powerful tool in Enterprise Manager. ASH Analytics works in databases version 10.2.0.4.0 and above.
Logging into ASH Analytics will display data for the instance based off one hour, but to change to a one week view, you’ll simply click on “Week” and then move the displayed view for the bottom section graph out to encompass the entire week:
Using this example, you can then see that I’m now showing a similar graph to Top Activity, but for a whole week and without the aggregation that Top Activity some times suffers from.
We’re not going to stick with this view though. Leaving it on “Activity”, click on Wait Class and go to Resource Consumption and click on Wait Event, (it’s on Wait Class by default.)
As you can see on the right side, there is an overlap in the legend that needs to be fixed, (I’ll submit an ER for it, I promise!) but luckily, we’re focused on CPU and we all know, in EM, CPU is green!
When we highlight the green, it turns a bright yellow. Now that you have chosen this by hovering your cursor over it, double click to choose it. The graph will now update to display CPU:
You now possess a graph that displays all CPU usage for over the last week vs. total wait classes. You can see the overall percentage of activity in the left hand side table and on the right bottom is even displayed by top user sessions. You can also see the total CPU cores for the machine, which can offer a clear perspective on how CPU resources are being used.
Now you may want to see this data without the “white noise”. We can uncheck the “Total Activity” box to remove this information and only display CPU:
We could also choose to remove the CPU Cores and just display what is being used:
By removing the top core info, we see the patterns of usage and just cores used much clearer. We could also decide we want to view all the wait classes again, without the CPU Cores or Total Activity. The only drawback is the overlap in the legend, (I so hate this bug in the browser display….)
Now, as requested, how would you do this for 10, 30 and 60 days? As noted in the top view, you note that you’re offered a view by hour, day, week, month and custom. As many months have 31 days in them, you may choose to do custom view for all three of those requests, but a custom request is quite simple:
Yep, just put in your dates that you request and then click OK. If you have already stretched your window to beginning to end on the lower view, don’t be surprised if it retains this view and shows you all the data, but yes, all of it will display, that is if your example database was active during that time… 🙂 Yes, the database I chose, as it’s from one of our Oracle test environments was pretty inactive during the Christmas and January time period.
And that’s how you create custom CPU activity reports in ASH Analtyics in Enterprise Manager!
SQL Monitor is one cool tool. Via EM12c, it empowers everyone, (yes, that means the developer, too, so give them the access to view this data via Cloud Control, please!) This is a top five list, but it doesn’t stop there, just remember, this is my favorite features when using Cloud Control with SQL Monitoring.
Once logged into the database, you can gain access to SQL Monitor by clicking on Performance –> SQL Monitoring.
The main view of active and inactive processing, (just remember, SQL Monitor processes will age out, so keep that in mind that this tool isn’t for looking at historical processing.) is a great place to start when we talk about value to your day to day job.
From the main view, you can easily see what sessions are active, the SQLID, if there is parallel and what the DOP, (Degree of Parallelism) granted, database time, (broken down by wait event percentage) and IO Requests. There is a ton of pertinent data here, including plan hash values, start and stop time, plus the SQL Text, but everyone has to admit, this is what we all want and need to know.
By clicking on any of the SQLIDs displayed in the SQL Monitor main page, you can then dig into the details surrounding a SQL process.
For the given SQLID, you can view the duration and type of wait event resource consumption. You can quickly see what the cost is and in graphs added, quickly identify what performance challenges a particular execution plan is most hindered by. There are arrows to help the DBA or developer navigate the plan to understand what pain points are most evident and where the plan connects from bottom to top as the process filters through data to come up with it’s results.
Yes, you can use a query to see the offload percent for a process, but SQL Monitor is nice enough to show you right in Cloud Control-
If you’re working on an Exadata and want to make sure you’re using all those great engineered system features you paid for, it’s nice to know you’re offloading to a cell node vs. doing the work on your database node that should be used for other work than scanning objects.
Knowing how many resources are going to a process are essential to knowing how to eliminate extra resource usage.
Under Metrics for each SQLID captured by SQL Monitor, you’ll be able to view the amount of CPU, PGA and IO used by the process. You’ll even be able to tell what kind of waits you’re experiencing if you don’t use Google Chrome, (I’m joking…OK, maybe I’m not, but you understand browser challenges, I’m sure… :))
By clicking on the User from the main view, I’m able to log into the session details. The actual session ID won’t be displayed, (in this case, we just see SH for each session) but we can click on the link, there by displaying vital info about the session details.
We can quickly link to the SQLID, view if there are any session blocking the one we are investigating and see pertinent wait information including the P1, P2 and P3 waits.
If I click on the bottom link in the WAIT section, it will take me to the object that is causing the most waits, which happens to be the SALES table.
I can now view column and constraint info without having to go to the schema feature within Cloud Control.
I can also view information about indexes and statistics, which no longer is accessible from SQL Details in Cloud Control.
As a bonus, knowing how many of us like to see tabular vs. graphical execution plans, it may be worthwhile to also view them in graphical form. Below you can see as I hover my cursor over the sections that are highlighted with yellow lines from the Sales table and then Red for the filtering, you can note that it clearly displays that there are 13 million rows having a nested loop performed out of a 26 million row table. I think most of us know that a hash join would have performed much faster and this is why the section has been highlighted in red as the pain point for the SQL in question.
Just a little bonus for those of you still as hooked on tabular execution plans as I am… 🙂
Have a great week after the long weekend and hope your Thanksgiving was wonderful!
So I’m going to start this post with an admission- I don’t have access to a cloud environment to test this out, but I know what I would do first if I experienced slow response time on database creation or cloning via EM12c to the cloud and I would like to at LEAST post what I would do to give others the chance to test it out and see if it offers them some help.
Knowing what is causing slow performance is essential to trouble shooting any problem. I’ve had a few folks come to me with complaints that their hybrid cloud clones or database creations in the cloud are slow, but with the little data I’ve seen in the exchanges, its starting to give telltale signs that the cloud isn’t the issue. There are some Enterprise Manager features with 220.127.116.11 that may be able to assist and that’s what we’ll discuss today.
Although log data and diagnostic utilities are crucial when I’m trouble shooting anything, simple visuals based on this data can be very helpful. The Page Performance for the Repository is one of those features, but there is one caveat- the page that is experiencing slow performance, (i.e. database creation wizard with Cloud Control to the PaaS in the cloud, etc.) must be run at least two times in the last 24 hrs to be captured in this tool.
The data provided by this tool is based off a considerable amount that we collect via the EM Diagnostics utilities, so although it doesn’t provide as deep a diagnostic as the utility does, it may shed some high level light on what you’re facing.
To access the page, you need to be in the OMR, (Repository) target, which commonly is accessed via Setup, Manage Cloud Control, Repository and then from there, click on OMS and Repository, Monitoring, Page Performance. You’ll see the second tab in is Page Level Performance, which will have the data that we’re about to go deeper into today.
Upon accessing the dashboard, you’ll quickly notice that many pages have very different requirements to produce the output that you view in the console pages, (i.e. not all plugins and collections are created the same… :))
Now I want you to focus on the headers to note the data collected:
Cloud Control breaks down the page processing time for a number of different areas-
The one I’m going to focus on is the Processing Time in Browser or Network. If you note, there are sort options for each of the Avg and Max totals that are collected in each column.
If we sort by Max Processing Time in Browser/Network and we’ll choose the first in the list that contains an active report, (the others are stale, so those pages aren’t links you can click on…):
Again, I don’t have any cloud access to demonstrate this on and as you can see, the RAC Top Activity is what we’ll use for this example of how we can inspect wait information on page response.
The top of the page clearly shows what IS the time distribution across Database, Java, Agent and Network/Browser. You can in the lower section, (actually to the far right in the console) the amount of seconds allocated to each type of task by the process, which for our example is considerably Java Time. If there is SQL involved, it also breaks down the execution plans and wait events.
For this example, there is a small amount of network time, (this is the “High Availability” environment at my work, so it’s a pretty buff setup) but it does display the light pink section that shows there are some waits for the network and/or browser.
If WE WERE inspecting a cloud usage process, we should see what is the time distribution for the network and I’m pretty sure we would see that displayed in these pages clearly, (again, this is an assumption on my part, as I haven’t had the first hand experience to investigate it…)
If you are working in the Oracle cloud and are experiencing slowness, run the cloud process via the console a couple times, (even if it does timeout) and consider using the Page Performance feature to make a quick inspection of the report. Although the example above clearly shows how much time is being spent on Java, you may find for the cloud, you’re dealing with network slowness that requires some investigation into firewall, DNS resolution and other challenges, but this report may quickly show the Oracle cloud as the innocent party.
Creating a trace file from EM12c is quite easy and doesn’t require a DBA offering up the world to allow a developer or support person to perform this action.
Enterprise Manager also removes some of the syntax options that could get a developer or app support person that is less specialized in creating trace files, all the options and the knowledge of the syntax, etc. from getting into trouble, which creates an option for everyone to be successful.
The proposed solution that I’ll post here is where the EM12c user will be able to create trace files, but the DBA will control the OS level files, (delivering the file to the user after the trace file is created, will use a defined user to log into the database from Cloud Control and will have no target access granted previously.
Connect to your PDB and create your user that you will be using for tracing….
create user dev_trace identified by <password>;
grant dev_trace connect, select_catalog_role, create session, alter session;
Limit the size of trace files for the database to ensure that no trace file can be forgotten and grow unconstrained:
alter system set max_dump_file_size =40960;
Once this is complete, you can create your database target credentials for your developer role that will now use this new login and limit their rights in the target as they login.
Grant the following privileges at the “Target Type Privilege” for the new user:
|Monitor Enterprise Manager||Monitor Enterprise Manager performance|
|Add any Target||Add any target in Enterprise Manager|
|View any Target||Ability to view all managed targets in Enterprise Manager|
Yes, you do have the right to switch from “View any Target” to a list of specified targets. I just figured I’d lesson the steps here, so do what fulfills your IT security model.
Add the the new role that you’ve created for your Tracing User and then complete the wizard creation for a new user in Cloud Control.
Once the user logs in, (and updates their password… :)) They should be able to log into a target and view performance data. For our example, we’ll use the following database scenario, where the developer or app support user is concerned about “Other” in Top Activity. They’ve been given a high level overview of Top Activity wait events and know that green, (CPU) is expected, dark blue, (IO) often occurs and that colors of red, (application or concurrency) brown, (configuration or network) orange, (commit) and pink, (other) should be monitored more closely.
Now tracing is a session level process that created a trace file in an OS, (Operating System) directory. For Oracle 11g and higher, this creates the file in the Automatic Diagnostic Repository, (ADR) home. The base location can be easily located by the parameter DIAGNOSTICE_DEST + /rdbms/diag/<CDB_NM>/<CDB_NM>/trace.
show parameter DIAGNOSTIC_DEST;
In our above Top Activity example, we can see the sessions on the right hand side and see a couple that are of concern. Since we’ve decided that we need to inspect the “Other” waits more, we will look in the Session ID column and double click on the 1107 Session ID.
This brings us to the details page for the 1107 session:
We have a number of tabs in the Details page to inspect, but let’s say we want to create a trace file to dig deeper into the ENQUEUE waits instead. To the right, at the upper and lower section, you can see the button “Enable SQL Trace”. As the user has the privileges to “Alter Session”, they can click this and it will step into the creation steps for a trace file.
This is like performing a 10046 trace. As you can see, its very straight forward and simple vs. executing the SQL*Plus commands that would require the following for both options set to “No”:
execute dbms_support.start_trace_in_session (1107,39645, true);
To create the option with the wait information set to “Yes” we’d update our request to the following:
execute dbms_support.start_trace_in_session (1107,39645,waits=>true);
And for both to be set to “Yes”, we’d execute this command that has both:
execute dbms_support.start_trace_in_session (1107,39645,waits=>true, binds=>true);
Back to our Cloud Control option, (the easier option) once we decide to gather the wait and bind information, we can click “OK” and the trace will subsequently begin.
Once we’ve finished tracing, we should remember to turn the trace back off, (which is the reason for the DBA setting the max dump file size to have a safety net for those that forget!)
The option will then show right in the session, even if you leave and come back to the session in Cloud Control. You can click on it and disable the trace:
If you were to forget, the DBA could easily locate the session being traced in the file and they can issue the following statement from the command line to disable it:
execute dbms_support.start_trace_in_session (1107,39645, false);
Now this is where your DBA comes in and you need to make friends with them. Most DBAs are not going to be comfortable granting access to OS level files and I don’t blame them. It’s best if you don’t have to worry about this access and better if they just retrieve the files for you. Before they do, you want them to convert the trace data into a readable report. This is done via the TKPROF utility.
The trace file should be easy to locate from the server in the ADR directory for trace files, (the directory is called “trace”, duh… :)) and there are two files that are created as part of your trace file- The .trm trace mapping file, which you won’t need and the .trc trace file that is the one that is required to create the report.
The DBA will run the following to create the report
tkprof <trace_file_nm>.trc <output_rpt_nm>.txt
They can SCP or FTP the output file to their workstation and this is the one that you’ll want them to send to you.
And there you have it- creating a secure user that can create trace files and then how to create the trace files once you have the access. Now you DBAs and developers make friends and bring each other donuts and maybe buy each other a drink… 🙂
When you need to have information about one SQL_ID and don’t need everything and the kitchen sink, there are a few different ways to collect this via Oracle. I’m surprised how rarely this is covered in performance tuning/optimization, (whatever the current “acceptable” term is for fixing a database when there are performance issues arise… J) classes, manuals and documentation. I’d like to cover two of my favorite here and know, there are other options that I won’t be digging into, (SQLd360, SQLTXPLAIN, tracing, etc…)
I also refer to this as the “SQL ID Specific AWR Report”. Now many of you may have come across this report via Cloud Control when you go into a unique SQL_ID in the console and upon looking at the details, you notice there are two options below the wait event actrivity, one for run an AWR SQL Report and one for an ASH report. What many don’t realize, is that this report also is available from the command line on the database host.
Report Location: $ORACLE_HOME/rdbms/admin/awrsqrpt.sql
Report Format: HTML from Cloud Control and the command line, TEXT is also an option from SQL Plus.
Requirements: Execution DBMS_Repository and privileges to run reports and store on the host or if you run via Cloud Control, then you just can save off the HTML report to your desktop.
Licensing: Yes, you do need diagnostics pack for this feature.
This report, like “Search SQL”, which we’ll go over soon enough, allows you to go through all of the AWR history you possess and let you know the minimum and maximum runs, execution plans for each plan hash value that exists, along with elapsed time and other pertinent statistical information about the SQL ID in question.
Notice that the version of the database, that it’s a RAC instance, along with which node in the RAC environment the statement was run in, is quickly stated, so if anyone questions this, you can quickly locate this information at the top of the report.
The report will display the SQL_ID in question, along with identifying if there was more than one plan hash value, (which the report will number and identify by PHV, (plan hash value) and the plan hash value ID. This will make it easy to know which execution plan and events belong to which if there is more than one. For our example, we do only have one here.
Remember for the elapsed time, this is for the total time for the snapshot timeline involved and you need to look at the per execution to see how long each one actually required to complete. These statistics are very helpful when identifying the wait event(s) that are involved in the performance challenge.
The plan is nicely formatted and you can step into the time used for each step in the process. I wish it defaulted with temp usage, but alas, that is not to be in the HTML report. This is the only area of the report where I prefer the HTML report over the text, as with this report, there is considerable size difference for storing the report and I hate giving up any data…. 🙂
If you’ve had an outage during your snapshot timeline, then this report, just like AWR, is dependent upon up time and will notify you of the issue, requiring you to pick snapshots that have no downtime involved.
Search SQL has been around since EM10g, but it’s one of the least known features. I like it as it gives a clean, high level view of performance over a large period of time by snapshot and then you can dig down to do research from there.
Report: None really from the command line- it’s a compilation of report features.
Report Format: None- this is via the console, but some of the reports can be saved off, (such as ADDM and other reports…) you can also copy and paste or screenshot pages.
Licensing: Due to the way we are using it here, both the diagnostics and the tuning pack will be required.
Using our same run as we used above, we can identify a SQL_ID that is causing us concern and we want to investigate further-
Copy the SQL_ID so you can paste it in the proper field in the Search SQL console. You’re now going to leave the Top Activity, (or ASH Analytics, whichever you are using, both work great for this type of initial identification…)
Each Search SQL console is database specific, so click on the Performance drop down, SQL and then Search SQL for EM12c, (EM10g and EM11g, this is directly under the Performance menu….)
Update the following sections:
Four tabs will be shown, don’t be surprised if you don’t notice the completion as the cache often only returns one row:
You CAN examine the existing run and looking at it this way really shows you how it’s currently running vs. how it has run, but if you’re just looking for when it really ran well, then skip the curor section and jump to the AWR snapshots to see this SQL ID for ALL snapshots in the AWR-
You’re seeing all of this from the largest elapsed time to the least descending, so although we’re able to view any patterns on how the SQL may be degrading over time, note if there are plan has value changes, (which in our example, there are and we didn’t see that for this statement in our AWR SQL Report) and click on the “Next 25” link at the bottom to get to the lesser elapsed run times.
There are some runs with 0.00 on the run times. You can investigate these, but we’d find that they hadn’t completed or no rows were returned. What is more important are the runs where we see overall run times that are efficient in the elapsed time and that rows were returned. I’ve marked them in our report.
Upon closer inspection, you’ll also notice that they all belong to the same plan hash value except for one run. This is a good indicator that this plan has value should be identified as the best over all for this query.
Now as we all know, it’s not just about the one query. The database could impact the run as well.
On the right, you’ll notice that the snapshots are links. You can use these to inspect the over all performance of the database during the snapshot interval the execution occurred.
Here’s all that high level statistics info at the database level. You also have the opportunity to run an ADDM report from here and see that “all that’s important” view of your database.
Now the second tab is going to offer you the workload repository report for this database, including the summary to tell you, is there anything up in this database at that time that you should be aware of.
You can go back to the Search SQL report and then look at the individual executions listed, too.
This will take you to the SQL Details page for whichever one you click on-
Just as with any SQL Details, you can see the activity, the plan, plan control, etc. The example above is for an execution from the day before, so it’s kind of an easy way to get to it vs. going and searching historically through Top Activity.
Now why did we choose the Baselines as part of our search? Baselines, especially when they are automatically implemented, can impact performance at times and knowing this information can quickly let you know it wasn’t another culprit involved.
We can see what plan hash value is attributed to what baseline, the overall elapsed time and you can click on the Baseline Name link to see when and if it was implemented to diagnose if its part of your problem.
Hopefully this gives you some more tools for your arsenal when that boss or peer is standing at your desk and demanding an answer to a challenge- what more can we ask for but answers! 🙂
While presenting at HotSos in Dallas, Tx, an attendee, (thank you Wenju! :)) asked how page performance issues could be identified when the call was being made remotely and not from the Oracle Management Repository, (OMR) using the Page Performance console within Cloud Control.
This is a very good question and although not straight forward, I would like to demonstrate what is available to assist in answering the challenge.
First, I have to add that the answer is provided using release 18.104.22.168 of the Enterprise Manager 12c, so if you are still running an earlier version, your results may not render as successful an outcome, but I will also provide you with a command line option to help those of you on those earlier releases, too.
The Page Performance Console is accessed once you’ve entered the Cloud Control Management feature from the Setup –> Manage Cloud Control –> Health Overview. After entering the Health Overview, at the left, top menu, click on OMS and Repository –> Monitoring –> Page Performance. There are three tabs available in this console and using the second tab, Page Level Performance, you will view the list of pages that are experiencing the longest waits, in descending order.
Using the first one in the list, we note that this is the AWR Warehouse. The calls are made to a remote database, (the AWR Repository is not stored in the same database as the Oracle Management Repository, so of course, the calls are 90%+ remote…) so it’s an excellent choice to use as an example.
Clicking on the link for the page, I’m brought to the Page Performance Report that has all the details on the performance waits for the console page. As I’ve been a heavy user of this, it’s no wonder if it showing up as the top in the page processing time.
The initial section clearly shows us how much of the performance issue is derived from java waits. Knowing that this is the case, you would think that we can just bypass this, but to have the data verifying what the culprit is for waits is essential for a DBA. We should never assume or guess when it comes to performance and this is just one more valuable feature of EM12c that eliminates that from the demanding challenges from users that are posed to you each day. To make it easier to understand the above section, if there were additional time distribution for database, agent or network/browser time, it would be displayed similarly to the top sessions and top SQL in the Top Activity:
When you are viewing the data in the Page Performance report, the translation from Top Activity for the DBA is:
Time Distribution = Activity(%)
Even though the database calls are completely insignificant, the data is still provided as part of the report. Just saying the remote calls are not enough, we need to see why they were deemed insignificant. If we want to see how our distributed (aka remote) calls are contributing to the time consumed by the total process, then all we have to do is scan down in the Top SQL part of the report to investigate.
Note that no SQL_IDs are displayed, even though these are all queries. The reason is that the OMR is not able to pull these remote calls in the AWR Warehouse that are being queried by Cloud Control. What is displayed is the time CONSUMED by each of the executions till the results are returned to the console and no Top SQL is listed as nothing was over 10ms. As I’ve stated, there is incredible value in knowing that database time is still not the main consumer of the remote calls. The second valuable information is displayed to the very right of the image, which are the remote session IDs, which can be used if you wished to investigate this via EM12c in the AWR Repository database with an ASH report at the time this executed or other search session opportunity, (there are a number options available.)
If you are on an earlier version and don’t have the Performance Page Report available to you, there are great command line tools available via the EM CLI. The following graphic takes you through the steps and the translations here for the DBA is that its similar to TKProf- you give collect the session ID, create a trace of that session, stop the trace, then create the report from the trace file.
Although the report isn’t as aesthetically pleasing as what I’ve shown you from the Performance Page Report, you still are provided the data you need to answer those important performance questions about console page response and this data can be very valuable if you are looking into an issue or working with Oracle Support.
Active Session History, (ASH) reports are one of my favorite when investigating a performance issue. The biggest challenge I think that faces many administrators is to know when to turn to an ASH report and how to use the report to diagnose an issue.
Today’s example is brought to us thanks to a Workload replay that should be a simple, but also, a clear step by step of when and why to use an ASH report.
In our EM12c environment, we note some high activity in a database and use Top Activity to view what’s going on:
As a DBA, I’m not a fan of pink, red, brown or worse, black, in my Top Activity. I’m displaying the information from the Top SQL and Top Sessions from just before 3pm and we can see that “Other” was the wait event that is shown to be most impacting in our Top SQL.
We can then double click on the highlighted Top SQL session to dig down a bit more into that unique SQL_ID.
Noting that this is an insert statement, I can then scan lower to see more detailed information about the SQL statement, waits and sessions involved:
The top wait is Buffer Exterminate and this wait event might not be familiar to everyone, but we’re going to work through this post to investigate the information provided from the ASH report to experience the type of impact this wait event has on the database.
You’ll notice that the SQL Details page offers you two reports that you can run- AWR SQL Report and the ASH Report. These reports can both be found in the $ORACLE_HOME/rdbms/admin directory.
AWR SQL_ID specific report is named awrsqrpt.sql and the ASH report is named ashrpt.sql if you’re interested in running them from the command line via SQLPlus instead of using EM12c.
We are going to choose to use the ASH report instead of the SQL_ID specific AWR report OR a full AWR report for a couple reasons:
Running an ASH report after clicking on the button is quite easy. Not only is the time pre-filled-
we also have the SQL_ID filter to give us more accurate information about this one SQL_ID. You can change the window for the beginning and end time if you’d like to extend it, but know that ASH is much more effective for smaller snapshots of time.
Once satisfied with the criteria for the report, click on Run Report and the HTML version of an ASH Report will display in short order.
The first thing to remember is that the report is generated by database node if you are in a RAC environment. The left hand Target Navigation will show you what target node the report has been sourced from:
The report starts out with basic information about the environment the report was run in:
The first section we jump to, which we almost always jump to, is the top wait events:
I find it difficult to spend much time on anything that comes up under 10%, (at least its a pretty good rule of thumb to follow…) but as you can see, we are showing over 86% of the event waits on buffer exterminate.
If we go down to the second section we go to most often, the top SQL, we’ll see more verification of this issue:
For SQL_ID 67sjxrgv720p4, we can see that over 86% is spent on the wait event buffer exterminate.
We can also view the sessions involved with this SQL_ID and if there were any Blocked Sessions, that is listed in its own section. No blocked sessions here, so we can go onto more important research.
Objects are most impacted by the waits for this statement are also displayed and if this had been part of a stored procedure of other internally stored code, this would have been shown, again, in it’s own section.
At the bottom of the report, we see what top waits happened in succession during the time we chose to examine the ASH samples:
As you can see, the samples are all about one minute each of an approximate five minute examination of the ASH sample time. We can see at the 3pm time, (just as was displayed in the Top Activity) the impact of the buffer exterminate wait, then the wait for CPU and cluster GC waits post the “hit”.
So what is Buffer Exterminate? A lot of folks like to refer to this as “SGA Thrashing”- sounds lovely, doesn’t it? The wait event happens when a database uses Automatic Memory Management, (AMM) and the MMON background process decides to shrink the buffer cache and reallocate the memory to other areas of the SGA after advice from the memory advisors. A lot of DBAs haven’t been sold on AMM yet and I have to admit, I’m one of them. I only use it in small, non-important databases and often avoid setting it for any OLTP environment. One thing to keep in mind though, I’m more often in the EM12c world these days vs. working with memory allocation issues in DB12c, so I don’t have the experience in DB12c and multi-tenant to give an educated review of AMM with the new version.. 🙂
The above ASH Report should be enough to let you know that this database is impacted by current AMM settings during resize, but if you need to provide more information, you can do this in two ways:
Here’s a simple query to start:
SELECT parameter, initial_size, target_size, start_time FROM v$memory_resize_ops WHERE initial_size > = 0 and final_size > = 0 ORDER BY parameter, start_time;
Another choice is to set DB_CACHE_ADVICE to off and then set the db_cache_size directly.
And that concludes an example of how and why you would use an ASH report to investigate an issue in Oracle. Have a great day and as always, do everything you can to be bulletproof!
Today we’re going to review another great feature in the EM12c that you may not have realized was available. Once logged into a database target, click on the Performance menu and navigate to the Optimizer Statistics Console:
The new console page is clean, easy to navigate and has great access points to manage and monitor optimizer statistics for the given database target.
We’ll actually start at the bottom of the page with the Statistics Status and go up into the links. Viewing the graph, you get a quick and clear idea of the status of your statistics for the database target you are logged into. You can easily see if there are any stale stats that may be impacting performance and if there are any missing stats. You are shown how many objects are involved in the status category and can then move your way up into the links to review and manage your database statistics configuration.
We’re going to go through the Operations by order of logic and not by order in the console, so we’ll start with View.
This link will take you to a great little report console that will display information about statistics in the database. Even though our example will display results for Stale statistics, note the other great filters for the report:
As we want to see everything, we’re not going to choose any other filters for our report until we get to the bottom and have the options of Current, Pending or All for our Scope We’re going to change it to All considering the version of database is 22.214.171.124 and we could have pending statistics waiting to be implemented.
The report quickly showed that both data dictionary and fixed objects were stale, (schemas are up to date!) so we could multi-select objects on the left of the report and gather stats, (along with other options) or we could use the next section we’ll be covering to gather those stats in an EM job and address the stale statistics issue in what I feel, is a more user friendly interface.
Back in the Optimizer Statistics Console, we can click on the Gather link, you will be taken directly to the Gather Statistics Wizard:
There is a clear warning at the top letting you know that as of DB11g, automated maintenance tasks should be enabled to gather nightly statistics. This is turned on by default in most databases, so this warning is a nice addition to this page for those that may not be aware.
Below this warning, you are able to choose what level of statistics gathering you wish to perform, (database, schema, objects, fixed objects or data dictionary…)
By default, Oracle’s guidelines for statistic collection options will be chosen, but you can change this to customize if you wish to work outside of Oracle’s recommendations. You can view the default values before deciding and if for some reason, you wish to use manual configuration options:
The wizard won’t ask you to set the manual configurations until later into the setup steps and if you change your mind, you can still choose the defaults.
At the bottom of the wizard, you also have the opportunity to use the Validate with the SQL Performance Analyzer, but as noted, the changes won’t be published and you’ll have to do that manually post the statistics collection run.
The next page will take you through the customizes options you want to use instead of GATHER AUTO, (although, like I said, you could just leave it as is and have it just perform the default anyway! :))
Then you get to schedule it via the EM Job Service and would monitor and manage this job via the EM12c Job Activity console.
This means that this is not an automated maintenance task in the Database Job Scheduler and if you are not aware of how to view jobs via the DBMS_JOB_SCHEDULER, then you could have two stats jobs running for a database or even worse, simultaneously, so BE AWARE.
As the Lock, Unlock and Delete links take you to similar wizards that do just the opposite action, we’ll group them together in one section. Using the Unlock statistics wizard in our example, you can click on the link and choose to unlock a schema or specific tables:
If you decide to unlock just a few or even just one object, the wizard makes it quite easy to search and choose:
In the example above, I clicked on the magnifying glass next to the box for the Schema and then chose the DBSNMP schema. I can use a wild card search in the object name box or leave it blank and all tables in the schema are returned and a simple click in the box to the left of the object name will select it to lock, delete or unlock it, (depending which wizard you’ve chosen…) You also can view information on IF the object is locked or unlocked already, along with partitioning information, as you may have partitions that are locked while the table may not be.
The restore option is a great feature for those that may not be on top of their “restore statistics syntax on the top of their head” game. Now, I have to admit, some of the options in this wizard makes me very nervous. The idea that someone would dial back database level statistics vs. locating the one or two offenders that changed just seems like throwing the baby out with the bath water, but it is an option in the restore statistics command, so here it is in the wizard, as well.
You have the option to override locked objects and force a restore, too. Like with locking and unlocking objects, the next screen in the wizard will allow you to choose a schema and object(s) that you wish to restore from and then once chosen, you will be asked when to restore to, including the earliest restore timestamp available:
Post these choices, you then schedule the EM Job to run the task and you’re set.
Manage Optimizer Statistics
You must be granted the Create Job and Create Any Job privileges to take advantage of these features and will be warned if you haven’t been granted one or both.
Operations links include the ability to Gather Optimizer Statistics, which includes database and schema level, along with distinct object level. Secondary links to restore, lock, unlock and delete statistics for each statistics gathering type is available as well.
The Related Links section includes links for research and configuration settings, such as current object statistics, global statistic gathering options, the job scheduler to view current intervals for jobs involving statistics as well as automated maintenance tasks which inform you of any clean up and maintenance jobs that are part of the overall Cost Based Optimizer world.
These links will configure the Automated Maintenance Tasks, allowing you to update schedules of execution, disable/enable and work with SPA results, (SQL Performance Analyzer.)
If you haven’t used SPA yet, it has some pretty cool features allowing you to simulate and analyze different performance changes before you make them. Nothing like being able to see in the future!
Working with some of these features may require a few management packs, (tuning, real application testing, etc.) but if you’re curious if you’re wandering into a new management pack land, it’s easy to locate from any EM12c console page:
You will receive information about any management packs involved with the features you are using in the EM12c console for the page you’re on:
So embrace the power with optimizer statistics in EM12c Cloud Control and if you want to know more about managing Optimizer Statistics click here for the Oracle documentation or this whitepaper for more info.
While enjoying the lovely Liverpool, UK weather at Tech14 with UKOUG, (just kidding about that weather part and apologies to the poor guy who asked me the origin of “Kevlar” which in my pained, sleep-deprived state I answered with a strange, long-winded response…. :)) a customer contacted me in regards to a challenge he was experiencing starting an agent on a host that was home to 100’s of targets.
oracle_database.DB301.com - LOAD_TARGET_DYNAMIC running for 596 seconds oracle_database.rcvcat11 - LOAD_TARGET_DYNAMIC running for 596 seconds oracle_database.DB302.com - LOAD_TARGET_DYNAMIC running for 596 seconds oracle_database.B303.com - LOAD_TARGET_DYNAMIC running for 596 seconds oracle_database.DB304.com - LOAD_TARGET_DYNAMIC running for 596 seconds oracle_database.DB305.com - LOAD_TARGET_DYNAMIC running for 596 seconds oracle_database.DB307.com - LOAD_TARGET_DYNAMIC running for 596 seconds oracle_database.DB309.com - LOAD_TARGET_DYNAMIC running for 596 seconds oracle_database.B311.com - LOAD_TARGET_DYNAMIC running for 596 seconds Dynamic property executor tasks running ------------------------------ --------------------------------------------------------------- Agent is Running but Not Ready
The output from the “emctl start agent” wasn’t showing him anything he didn’t already know, but I asked him to send me the output and the following showed the actual issue that was causing the Agent not to finish out the run:
MaxThreads=96 agentJavaDefines=-Xmx345M -XX:MaxPermSize=96M SchedulerRandomSpreadMins=5 UploadMaxNumberXML=5000 UploadMaxMegaBytesXML=50.0 Auto tuning was successful ----- Tue Dec 9 12:50:04 2014::5216::Finished auto tuning the agent at time Tue Dec 9 12:50:04 2014 ----- ----- Tue Dec 9 12:50:04 2014::5216::Launching the JVM with following options: -Xmx345M -XX:MaxPermSize=96M -server -Djava.security.egd=file:///dev/./urandom -Dsun.lang.ClassLoader.allowArraySyntax=true -XX:+UseLinuxPosixThreadCPUClocks -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCompressedOops ----- Agent is going down due to an OutOfMemoryError
This host target was a unique environment in that it contained so many targets, especially database targets. One of the reasons that the management agent was created and OEM processing removed from an internal database back-end process was to lighten the footprint. As EM12c introduced numerous features that has assisted its direction towards the center of the Oracle universe, the footprint became heavier, but I’ve been very impressed with development’s continued investment into lightening that footprint, even when considerable additions with plug-ins and metric extensions are added.
With all of this, the server administrator may have a different value set to limits on resource usage than what may be required for your unique environment. To verify this, I asked the customer to run the following for me:
ulimit -Su ulimit -Hu
Which returned the following expected values:
$ ulimit -Su 8192 $ ulimit -Hu 3100271
The user limit values with these added arguments are to locate the following information:
-H display hard resource limits.
-S display soft resource limits.
I asked him to please have the server administrator set both these values to unlimited with the chuser command and restart the agent.
The customer came back to confirm that the agent had now started, (promptly!) and added the remaining 86 database targets without issue.
The customer and his administrator were also insightful and correctly assumed that I’d made the unlimited values not indefinitely, but as a trouble-shooting step. The next step was to monitor the actual resource usage of the agent and then set the limits to values that would not only support the existing requirements, but allocate enough of a ceiling to support additional database consolidation, metric extensions, plug-in growth.
Before heading off to UKOUG’s Tech 14 conference, thought I would jump back from the Agents performance page and look into the very important page in the Oracle Management Service, (OMS) and Oracle Management Repository, (OMR) regarding metrics. Standard metrics collection is demanding in itself, so when we add plugins, metric extensions and manual metric collection changes, it’s important to know how these changes can impact the performance of the Enterprise Manager 12c, (EM12c) Cloud Control environment.
We have returned to the same location in the EM12c console page: Setup–>Manage Cloud Control –> Repository. The Repository Page through three tabs, which we’re currently working with the middle tab, Metrics.
We’ll jump into the first graph, which is a fantastic bubble graph, (love seeing us use these more in EM12c console pages!)
Note the data is distributed across the graph by the left side, Number of Rows Loaded vs. bottom, Number of Collections. This information is important as we note that our heaviest hitters are the blue and the orange circles. If we hover our mouse over the blue circle, we then get to see a few more details:
We now can see that the blue circle is the metric collections for cluster information and can see not just the two graphing points, but the amount of data in MB loaded. We can use this information to make decisions on updating collections intervals to ease stress on the EM12c OMS and OMR.
If we then focus on the orange circle, we can view the same type of detail information:
So there are a number more collections on the infiniband. This is expected, as this is the network connectivity between our nodes on the engineered systems. The amount of rows are higher, too, but note that the MB of data is no more than what the EM12c had to handle for the cluster data being uploaded. We can use this data to see if the collection interval pressure justifies the impact to the OMS and OMR. As we work through the rest of the data offered on this page, these are two important pain points to keep in the back of our mind regarding metric data uploaded vs. number of collections.
Now lets say we want to dig deeper into the inifiniband metric info that’s been shown here. We can double click on the orange circle in the graph and we’ll be taken to specific detail regarding this one metric and how target and metric details.
Now we see the top 25 metrics data load information not just for the parent metric target type, but broken down by metric specifics. We can quickly see that the Switch Port Configuration Monitor consists of the most metric data. As the data is compiled and uploaded as one collection, the bubbles are superimposed on top of each other.
If we switch to a Target view, a very different story is presented:
Note that six different collections interval schedules [most likely] displayed here. In the center, you can see the bubbled super-imposed on top of each other that are interconnected, but the large, red bubble is of significant interest. If we hover our cursor over this bubble:
One scan listener, (scan listener 3 for an exalogic environment) is uploading more data and more often than the rest of the environment? This is over 26% of the total metric impact for infiniband on the EM12c. Investigating this, reviewing agent patch levels and comparing collection intervals would be a good idea!
For those that prefer a more standard graph of the data, the right side graph displays the high level data in just a second format:
You also have the option, instead of the top 10, to display 15 or 20 top metrics.
Awareness of metric alerts per day can be valuable, especially when there are OMS, OMR or agent patches missing! I can commonly look at this graph and tell quickly if there an environment has skipped applying important EM12c patches, (can be located in the master note Enterprise Manager Base Platform (All Releases) (Doc ID 822485.1))
Now you see the breaks in the graph and may wonder what’s up with that- this environment has been patched and down for quarterly maintenance. We can see this when we click on the Table View link and see the Unavailable sections:
This also gives you quick access to the recent, raw data without having to query the MGMT$EVENTS view in the OMR directly or using EM CLI. Once we close this view, we can go back and highlight the different links below the graph to show advanced options.
For the Opened, Closed and Backlogged Metric Alerts, we can view Problem Analysis, Metrics Detail or go to the Target home for this metric data. This is so cool that I’m going to do a separate blog post to do it justice, so be patient on this topic… 🙂
The last graph on this page is another one that can give away if patches are missing pretty quickly.
This is over 30 days, so if you are seeing 100’s of metric collections errors, you should first check to verify that there aren’t any patches that address metric collections of the type you are experiencing. If this isn’t the case- investigate the error messages for the collections in the MGMT$METRIC_ERROR_HISTORY view in the OMR. You can start with something as simple as:
SELECT TARGET_NAME, METRIC_NAME, COLL_NAME, ERROR_MESSAGE FROM MGMT$METRIC_ERROR_HISTORY WHERE ERROR_TYPE='ERROR' AND COLLECTION_TIMESTAMP >=sysdate-30;
There is still a lot to cover in this series, but now it’s time to get ready for head over to Liverpool, England next week!