AzureOracle

Addressing Oracle Redo Latency with Ultra Disk

*Previously posted on the Microsoft Data Architecture Blog for the Tech Community.

An enterprise cloud, like Azure, handles an incredible variety of workloads and to be successful running Oracle in Azure means you need to know what you’re doing and where the sweet spot is for a relational workload.

I don’t want to get too deep here, as Azure Oracle SMEs are both data and infra, which is a hybrid area resulting in us splitting between the two focuses.

The Oracle Must-Knows

An Oracle database is a complex relational database, but there are terminology and physical architecture that is important to know before we proceed into this post:

  • Database instance, (background processes)
  • Memory, (SGA/PGA)
  • Control files
  • Datafiles
  • Redo log files
  • Archive logs

We’re going to discuss a common performance challenge when coming to the cloud and that’s around REDO LOGS.  Oracle redo logs are the equivalent in Microsoft databases to the transaction log.  Unlike SQL Server, Oracle has multiple logs that are part of the database that it switches between active, non-active and archiving.  All operations that modify are written to the redo log buffer and then the log writer (LGWR) writes it to the active redo log.  It’s not uncommon for the fastest storage IO to be required for the redo logs vs. datafiles because of this demand.

Redo logs are commonly mirrored to protect any changes not written to the datafiles and/or from a physical removal of one redo log “member” from a redo log member group.

Group 1st Member 2nd Member
Group 1 Redo01a.log Redo01b.log
Group 2 Redo02a.log Redo02b.log
Group 3 Redo03a.log Redo03b.log
Group 4 Redo04a.rdo Redo04b.rdo

Table 1

Notice that the extension can be anything you want- .log or. rdo is most common.

Oracle Wait Events

For Oracle, wait events are metrics that provide information about what is causing latency in performance of an Oracle workload.  This information can be provided in multiple ways, but most often it is provided by the following:

  1. An Automatic Workload Repository (AWR) Active Session History (ASH) or Statspack report
  2. A query of top waits in Oracle
  3. A trace file or trace file TKPROF report

There are two main wait events that we see for Oracle databases:

  • Log File Sync
  • Log File Parallel Write

A report showing this type of latency would look like this:

Top 5 Timed Events                    Avg %Total

~~~~~~~~~~~~~~~~~~    wait   Call

Event                            Waits       Time (s)   (ms)         DBTime

—————————– ———— ———– —— ——

CPU time                                                 5,099            41.0

log file sync                       621,615       3,520      6    28.3

log file parallel write        575,295       2,573      4    20.7

 

For the example above, we expect CPU time to be the top consumer-  CPU is on or off and for a database, we expect the usage to be high and Oracle has included the “wait for CPU” in the total, too, so again, it’s not surprising that it’s #1.  I commonly look for those wait events that have over a 10% of DB time.  In the example, log file sync is 28% and log file parallel write is almost 21%.  The two waits together are 49% of the database time consumption.  Fix these two waits and the database performance will significantly improve.

What is Log File Sync

As changes happen in the Oracle database, changes are written to the redo log.  Once changes are committed or rolledback, the LGWR will then write what is in the redo buffer to the redo log that is active.  Before the process is considered complete, a confirmation has to be sent back and this latency is what is referred to as log file sync or the time required to flush the log buffer to disk and writes confirmed.  To address this type of wait event, optimization of the code can be performed or to physically address it, faster storage for the redo can be recommended.

What is Log File Parallel Write

An Oracle database will have at least three redo logs and if the DBA wants to mirror those redo logs, that means each redo log has been mirrored and placed in a group.  This results in all writes having to be done twice from the changes in the database.  As shown in table 1, the database could have two or more, (older days we saw more than 2 redo log members per group) that are being written to each time a write is performed.  This is another area where faster disk, (write faster) or optimized code, (write less and less often) can remove latency.

Ultra Disk to the rescue

Ultra disk is a managed disk with configurable IO that the customer pays for depending on the IO demand.  For a large Oracle database, it rarely makes sense for datafiles due to the cost, but for redo logs, the size is small and the ability to configure the IO specifically for the high write needs can make a huge difference for the wait events discussed in this blog post.

Calculate the needs of the log files using the IO STAT FUNCTION SUMMARY in the AWR report, which will tell you the number of MBPs/IO Requests that are needed to meet demand.  Pad a little over this to address averages in the report and this should give you the values to target for:

 

 

 

 

 

 

 

 

 

Using this information, along with the size of your redo logs, you can then price out ultra efficiently and pay for what you need:

 

 

 

 

 

 

 

If you only have a Log File Sync issue, then a single ultra disk and moving the redo logs to this new disk can address this problem.  If you have both Log File Sync and Log File Parallel Write, then allocate TWO ultra disks, moving the 1st members of each group to one ultra disk and the 2nd members of each redo log group to the second ultra disk:

 

 

 

 

 

 

 

 

Redo logs can be moved online, so outside of enabling the use of ultra disk on a VM, there isn’t an outage required on the Oracle side to move the redo logs, which simply relies on new redo log groups created in the new ultra disk storage, checkpointing, switching logs and dropping the old groups on the old storage.

Once this has been performed, a full workload should be run and the waits should be eliminated on the log files.

Top 5 Timed Events                     Avg %Total

~~~~~~~~~~~~~~~~~~     wait   Call

Event                            Waits        Time (s)   (ms)          DBTime

—————————————– ———– ————–

CPU time                                                     4,057           9.1

log file sync                           988,793       1,382      1    3.1

log file parallel write          1,060,731       1,130      1    2.5

 

You should see the time waited and the percentage of DB Time consumed drop considerably using this method of optimizing with faster storage.  Anything under 10% is often a good indication that you can focus on other priorities.

Caveats of using Ultra Disk with Oracle on Azure IaaS

  • Ultra Disk is still not an option for volume snapshot backups, so RMAN or backups to a secondary location is required.
  • Ultra disk is only available on some VM series, so make sure it is available for the system you are hoping to use it on.
  • Azure Backup for Oracle is impacted when Ultra disk is introduced, but there is a disk exclusion policy that will address this in a future release.
  • Ultra disk can be expensive, so identify how much IO is required and optimize it to save, using the wait events and acceptable performance from the user experience.
  • Ultra disk, just like managed disk, is subject to IO throttling, based on the IO limit posed for the VM it resides on