August 30th, 2017 by dbakevlar

Delphix Engineering and Support are pretty amazing folks.  They continue to pursue for solutions, no matter how much time it takes and the complex challenges they’re faced with supporting heterogenous environments, hardware configurations and customer needs.

This post is in support of the effort from our team that resulted in stability to a previously impacted Solaris 11.2 cluster configuration.  The research, patching, testing and then resulting certification from Oracle was a massive undertaking from our team and I hope this information serves the community, but in no way is recommended by Delphix.  It’s just what was done to resolve the problem, after logical decisions for the use of the system by our team.

Challenge

Environment:  Solaris 11.3 (with SRU 17.5) + Oracle 12.2 RAC + ESX 5.5
Situation:
Post an upgrade to 12.2, environments were experiencing significant cluster instability, memory starvation due to the new demands for memory post the upgrade.
Upon inspection, it was found that numerous features required more memory than previous and the system simply didn’t have the means as to support it.  As our environment was a Solaris environment with 12.2, there was a documented patch we needed to request from Oracle for RAC performance and node evictions.  The environment was still experiencing node evictions, etc data showed that we’d have to triple the memory on each node to have continue using the environment as it had before.  Our folks aren’t one to give up that easily, so secondary research was performed to find out if some of the memory use could be trimmed down.
What we discovered, is that what is old can become new again.  My buddy and fellow Oakie, Marc Fielding had blogged, (along with links to other posts, including credit to another Oakie, Jeremy Schneider) about how he’d limited resources back in 2015 after patching to 12.1.0.2 and this post really helped the engineers at Delphix get past the last hump on the environment, even after implementing the patch to address a memory leak.  Much of what you’re going to see here, came from that post, focused on its use in a development/test system, (Delphix’s sweet spot.)

Research

Kernel memory out of control
Starting with kernel memory usage, the mdb -k command can be used to inspect at a percentage level:
$ echo “::memstat” | mdb -k
Page Summary           Pages                 MB          %Tot
  ————                 —————-             —————-           —-
  Kernel               151528              3183            24%
  Anon                 185037              1623            12%
  ...

We can also look at it a second way, breaking down the kernel memory areas with kmsastat:

::kmsastat

cache                        buf    buf    buf    memory     alloc alloc 
name                        size in use  total    in use   succeed  fail 
------------------------- ------ ------ ------ --------- --------- ----- 
kmem_magazine_1               16   3371   3556     57344      3371     0 
kmem_magazine_3               32  16055  16256    524288     16055     0 
kmem_magazine_7               64  29166  29210   1884160     29166     0 
kmem_magazine_15             128   6711   6741    876544      6711     0 
...

Oracle ZFS ARC Cache

Next- Oracle ZFS has a very smart cache layer, also referred to as ARC (Adaptive replacement cache). Both a blessing and a curse, ARC consumes as much memory that is available, but is supposed to free up memory to other applications if it’s needed.  This memory is used to supplement any slow disk I/O.  When inspecting our environment, a significant amount was being over-allocated to ARC.  This may be due to the newness of Oracle 12.2, but in a cluster, memory starvation can be a common cause of node eviction.

We can inspect the size stats for the ARC in the following file:

view /proc/spl/kstat/zfs/arcstats

This assumes ZFS is mounted on /proc, so your actual arcstats file may reside in a different path location than shown above.  Inside the file, review the following information:

  • c is the target size of the ARC in bytes
  • c_max is the maximum size of the ARC in bytes
  • size is the current size of the ARC in bytes

Ours was eating up everything left, taking 100% of memory left, as we’ll discuss in the next section of this post.

Oracle Clusterware Memory

The Oracle clusterware is a third area that was investigated for frivolous memory usage that could be trimmed down.  There’s some clear documented steps to investigate issues with misconfigurations and feature issues from Oracle that can assist in identifying many of these.

So, post upgrade and patching, what can you do to trim down memory usage to avoid memory upgrades to support the cluster upgrade?

Changes

From the list of features and installations that weren’t offering a benefit to a development/test environment, these were what made the list and why:
Update were made to the /etc/system file, (requires a reboot and must be performed as root):
  • Added set user_reserve_hint_pct=80
    • This change was made to limit the ZFS on how much memory for the ARC cache.  There was a significant issue for the customer when CRS processes weren’t able to allocate memory.  80% was the highest percentage this could be set without a node reboot being experienced, something we all prefer not to happen.
  • Stopped the Cluster Health Monitor, (CHM) process.  This is a brand new background process in 12c Clusterware and collects workload data, which is significantly more valuable in a production environment, but in development and test?  It can easily be a subsequent drain on CPU and memory that could be better put to use for more virtual databases.
  •  To perform this, the following commands were used as the root user:
$ crsctl stop res ora.crf -init
$ crsctl delete res ora.crf -init
  • Removed the Trace File Analyzer Collector (tfactl).  This background process collects the many trace files Oracle generates into a single location.  Handy for troubleshooting, but it’s Java-based and has a significant memory footprint and subject to java heap issues.
  • It was uninstalled with the following command as the $ORACLE_HOME owner on each node of the cluster:
$ tfactl uninstall
  • Engineering stopped and disabled the Cluster Verification Utility, (CVU).  In previous version this was a utility that could be manually added to the installation or performed post to troubleshoot issues via an Admin.  This is another feature that simply eats up resources that could be reallocated to dev and test environments, so it was time to stop and disable it with the following:
$ srvctl cvu stop
$ srvctl cvu disable

Additional Changes

  • Reduced memory allocation for the ASM instance.
    • The ASM instance in 12.2 is now using 1Gb of memory, where previous 256Mb.  That’s a huge change that can impact other features dependent on that memory.
    • Upon research, it was found that 750Mb was adequate, so if more memory reallocation is required, consider lowering the memory on each node to 750Mb.
  • To perform this set of instance level parameter change, run the following on any of the nodes and then restart each node until the cluster has been cycled to put the change into effect:
$ export ORACLE_HOME=<Grid Home>

$ export ORACLE_SID=<Local ASM SID>

$ sqlplus / as sysasm
alter system set "_asm_allow_small_memory_target"=true scope=spfile;
alter system set memory_target=750m scope=spfile;
alter system set memory_max_target=750m scope=spfile;

High CPU usage features can be troubling for most DBAs, but when it’s experienced on development and test databases that are often granted less resources to begin with vs. production, a change can often enhance the stability and longevity of these environments.

  • Disabled high-res time ticks in all databases, including ASM DBs, regular DBs, and the Grid Infrastructure Management Repository DB (GIMR, SID is -MGMTDB).  High-res ticks are a new feature in 12c, and they seem to cause a lot of CPU usage from cluster time-keeping background processes like VKTM.  Here’s the SQL to disable high-res ticks (must be run once in each DB):
alter system set "_disable_highres_ticks"=TRUE scope=spfile;
The team, after all these changes, found the Solaris kernel was still consuming more memory than before the upgrade, but it was more justifiable:
  • Solaris Kernel: 1GB of RAM
  • ARC Cache: between 1-2GB
  • Oracle Clusterware: 3Gb

Memory Upgrade

We Did Add Memory, but not as much as expected to.
After all the adjustments, we still were using over 5GB of memory for these three features, so upped each node from 8GB to 16GB to ensure enough resources to support all dev and test demands post the upgrade.  We wanted to provision as many Virtual databses, (VDBs) for any development or test the groups needed, so having a more than 3Gb free for databases was going to be required!
The Solaris cluster, as this time, has experienced no more kernel panics, node evictions or unexpected reboots, which we need to admit is the most important outcome.  It’s more difficult to explain an outage to users than why we shut down and uninstalled unused features to Oracle…. 🙂

Posted in Delphix, Oracle Tagged with: , ,

July 31st, 2017 by dbakevlar

This is the Part III in a four part series on how to:

  1.  Enable VNC Viewer access on Amazon EC2 hosts.
  2.  Install DB12c and upgrade a Dsource for Delphix from 11g to 12c, (12.1)
  3.  Update the Delphix Configuration to point to the newly upgraded 12c database and the new Oracle 12c home.
  4.  Install DB12c and upgrade target VDBs for Delphix residing on AWS to 12.1 from the newly upgraded source.

In Part II, we finished upgrading the Dsource database, but now we need to get it configured on the Delphix side.

Log into the Delphix Admin console to make the changes required to recognize the Dsource is now DB12c and has a new Oracle home.

Log into the Delphix console as the Delphix_Admin user and go to the Manage –> Environments.

Click on the Refresh button and let the system recognize the new Oracle Home for DB12c:

Once complete, you should see the 12.1 installation we performed on the Linux Source now listed in the Environments list.

Click on Manage –> Datasets and find the Dsource 11g database and click on it.

Click on the Configuration tab and click on the Upgrade icon, (a small up arrow in the upper right.)

Update to the new Oracle Home that will now be listed in the dropdown and scroll down to save.

Now click on the camera icon to take a snap sync to ensure everything is functioning properly.  This should only take a minute to complete.

The DSource is now updated in the Delphix Admin console and we can turn our attentions to the Linux target and our VDBs that source from this host.  In Part IV we’ll dig into the other half of the source/target configuration and how I upgraded Delphix environments with a few surprises!

 

Posted in AWS, Delphix, Oracle Tagged with: , , ,

April 21st, 2016 by dbakevlar

I was surprised on April 20th when I awoke to find a 1.3G OS update on my Samsung Galaxy 6 Edge+.  I’d never experienced any issues with an update before, so I quickly connected my phone to the WiFi and let it download then upgrade my phone, anxiously awaiting what new Android features awaited me.

no_michael_bluth

It Broke

I proceeded through my day, but was concerned as battery usage was higher than usual and I suffered email failures from Gmail and a few tweets didn’t go through.  I consider myself quite familiar with mobile phone trouble shooting and promptly performed the standard steps to address issues, but upon the next morning, I was faced with the same issues.

I researched and found that I wasn’t the only one, as numerous Note, Galaxy and even new Galaxy 7 users were reporting similar issues with texts, emails and network connectivity.

I happened to be running work errands and stopped at my neighborhood T-Mobile store to see if they’d heard anything.  The tech was surprised by what I’d tried:

  • Cleared the app cache for affected applications.
  • Uninstalled and reinstalled.
  • Cycled between the steps.

Then he was even more impressed with my phone-  I have my Samsung set up at the most optimal settings.

  • Shut off all unnecessary notifications at the application level.
  • Shut off Wifi scanning.
  • Keep Location and Wifi off unless I’m somewhere that I need to use it.
  • Keep my screen set to auto-dim to conserve battery.
  • All advanced features for the phone are set to optimum performance, balanced with conservative usage.

Yaaaass

I was having an issue clearing the cache partition on the phone and was looking to how it should be done with the 6.01 release.  There had been a change in the button combination, (volume down, home button, power button combo brings to a screen instead of clears the partition) and he was able to help me out with this, clearing the partition.

There was a second fix that was added to the clearing of the partition:

The actual 6.01 upgrade system update HADN’T finished!  Upon clearing the cache partition, the update completed and many of the issues I was experiencing stopped.

Then the second part of the problem showed itself.  To conserve battery, on many Samsung 6 and 6 Edge devices, it was recommended to run in “Power Saving Mode“.  In 6.01, there is a change to the features provided as part of this mode.

It now LIMITS the amount of data allowed to be SENT or RECEIVED.

Reason tweets with pictures and emails with attachments failing SOLVED.  Take the phone out of “Power Saving Mode” and these emails and tweets stuck in “limbo” should immediately be sent!

So, to summarize-  If you are having issues with emails, network connectivity and social media, do two things:

  1.  Using the button combination for your device and clear the cache partition.  You most likely will also see a secondary system update that will complete afterwards.  This means your phone didn’t complete it’s update to to 6.01 to begin with.
  2. If you use “Power Saving Mode”.  Shut it off, try to resend a tweet or email with an attachment.  If it works, you now know that you hit the size limitations.  You can tweak this and do a lot of it manually by the following:
  • Go to settings
  • Click on Battery
  • Click on Ultra Power Saving Mode, App Power Saving and Go to Details
  • Choose apps that are using power and ask the app to Save Power and for those that need to send attachments and such, like email, twitter, Facebook, leave them to Automatic only.

I’m happy I figured this out, as the Samsung Galaxy Edge Plus has been my favorite phone ever, so having it at top functionality after the upgrade was important!  Hope these tips to fix issues after the upgrade to 6.01 helps you, too!

 

Posted in DBA Rants Tagged with: , , , ,

  • Facebook
  • Google+
  • LinkedIn
  • Twitter