Oracle I/O latency monitoring
One thing that I have found sorely missing in the performance pages of Enterprise Manager is latency values for various types of I/O. The performance page or top activity may show high I/O waits but it won’t indicated if the latency of I/O is unusually high or not. Thus I put together a shell script that shows latency for the main I/O waits
- db file sequential read
- db file scattered read
- log file parallel write
- direct path reads
- direct path reads temp
Of course it would be nice to add a few others like direct path writes, direct path writes temp and log file sync but there is only so much room in the screen width.
oramon.sh
The script is called oramon.sh and is available on github at
https://github.com/khailey/oramon/blob/master/oramon.sh
Example:
$ oramon.sh Usage: oramon.sh [username] [password] [host] [sid] <port=1521> <runtime=3600> $ ./oramon.sh system sys 172.16.100.81 vsol RUN_TIME=-1 COLLECT_LIST= FAST_SAMPLE=iolatency TARGET=172.16.100.81:vsol DEBUG=0 Connected, starting collect at Wed Apr 18 18:41:13 UTC 2012 starting stats collecting single block logfile write multi block direct read direct read temp ms IOP/s ms IOP/s ms IOP/s ms IOP/s ms IOP/s 20.76 27.55 32.55 .71 3.50 .00 .00 .01 .00 .00 .00 .20 .00 .00 .00 .00 34.93 369.64 116.79 3.55 .00 .00 .00 31.43 640.33 92.40 8.33 .00 .00 .00 39.39 692.33 111.69 8.00 .00 .00 .00
The first line of output is the average since the database started up.
The subsequent lines are the averages since the last line which is 5 seconds by default.
One should be able to see immediately how much activity there is on the database and the latency for the basic types of database I/O.
Reads
Single block reads are the typical I/O from a database which would happen for example when reading a row in a table with indexes in place.
Multi block reads are common as well is which would happen when for example summing the values over all rows in a table.
Direct reads are less common but quite normal and happen almost exclusively for parallel query though may be used for other activities especially in newer version of Oracle such as 11.2. Direct reads are multiblock reads that by pass the Oracle buffer cache. The size varies from a datablock, such as 8k to 1MB.
Direct read temp happens when a sort has overflowed memory limits and been written to disk. Direct reads temp are multiblock reads that by pass the Oracle buffer cache. The size varies from a datablock, such as 8k to 1MB.
Writes
Logfile writes are the only writes that database users wait for in general. Actually users only wait when the commit, which then is a wait for a signal from the log writer that their particular redo data is on disk which could have already happened. Typically the user wait time is a bit slower than the logwrite time but in general it’s close, ie within a few milliseconds. The farther apart the user wait time is from the log write time the more likely there is a CPU, paging or other concurrency problem on the VDB host slowing down the users signalling and wake up time.
oramon.sql : Oracle Latency Query
If for some reason the shell script isn’t able to connect to the database, then the same data can be collected manually by running the SQL query in SQL*Plus by hand.
The following two SQL queries, oramon_setup.sql and oramon.sql are available on github at
https://github.com/khailey/oramon
If you want to see the latencies over periods shorter than 60s, then you have to collect the values of the cumulative counters at time A, then again at time B and take the difference. The following two queries, oramon.sql and oramon_setup.sql, are available on ftp site
column seq_ms for 9999.99 column seq_ct for 9999.99 column lfpw_ms for 9999.99 column lfpw_ct for 9999.99 column seq_ms for 9999.99 column scat_ct for 9999.99 column dpr_ms for 9999.99 column dpr_ct for 9999.99 column dprt_ms for 9999.99 column dprt_ct for 9999.99 column prevdprt_ct new_value prevdprt_ct_var column prevdprt_tm new_value prevdprt_tm_var column prevdpwt_ct new_value prevdpwt_ct_var column prevdpwt_tm new_value prevdpwt_tm_var column prevdpr_ct new_value prevdpr_ct_var column prevdpr_tm new_value prevdpr_tm_var column prevdpw_ct new_value prevdpw_ct_var column prevdpw_tm new_value prevdpw_tm_var column prevseq_ct new_value prevseq_ct_var column prevseq_tm new_value prevseq_tm_var column prevscat_ct new_value prevscat_ct_var column prevscat_tm new_value prevscat_tm_var column prevlfpw_ct new_value prevlfpw_ct_var column prevlfpw_tm new_value prevlfpw_tm_var column prevsec new_value prevsec_var select 0 prevsec from dual; select 0 prevseq_tm from dual; select 0 prevseq_ct from dual; select 0 prevscat_ct from dual; select 0 prevscat_tm from dual; select 0 prevlfpw_ct from dual; select 0 prevlfpw_tm from dual; select 0 prevdprt_ct from dual; select 0 prevdprt_tm from dual; select 0 prevdpwt_ct from dual; select 0 prevdpwt_tm from dual; select 0 prevdpr_ct from dual; select 0 prevdpr_tm from dual; select 0 prevdpw_ct from dual; select 0 prevdpw_tm from dual; column prevdprt_ct noprint column prevdprt_tm noprint column prevdpwt_ct noprint column prevdpwt_tm noprint column prevdpr_ct noprint column prevdpr_tm noprint column prevdpw_ct noprint column prevdpw_tm noprint column prevseq_ct noprint column prevseq_tm noprint column prevscat_ct noprint column prevscat_tm noprint column prevlfpw_ct noprint column prevlfpw_tm noprint column prevsec noprint
Run following query to see the current latency for
- single block read
- log file parallel write
- multi-block read
oramon.sql
select round(seqtm/nullif(seqct,0),2) seq_ms, round(seqct/nullif(delta,0),2) seq_ct, round(lfpwtm/nullif(lfpwct,0),2) lfpw_ms, round(lfpwct/nullif(delta,0),2) lfpw_ct, round(scattm/nullif(scatct,0),2) scat_ms, round(scatct/nullif(delta,0),0) scat_ct, round(dprtm/nullif(dprct,0),2) dpr_ms, round(dprct/nullif(delta,0),2) dpr_ct, round(dprttm/nullif(dprtct,0),2) dprt_ms, round(dprtct/nullif(delta,0),2) dprt_ct, prevseq_ct, prevscat_ct, prevseq_tm, prevscat_tm, prevsec,prevlfpw_tm,prevlfpw_ct , prevdpr_ct, prevdpr_tm , prevdprt_ct, prevdprt_tm , prevdpw_ct, prevdpw_tm , prevdpwt_ct, prevdpwt_tm from (select sum(decode(event,'db file sequential read', round(time_waited_micro/1000) - &prevseq_tm_var,0)) seqtm, sum(decode(event,'db file scattered read', round(time_waited_micro/1000) - &prevscat_tm_var,0)) scattm, sum(decode(event,'log file parallel write', round(time_waited_micro/1000) - &prevlfpw_tm_var,0)) lfpwtm, sum(decode(event,'db file sequential read', round(time_waited_micro/1000) ,0)) prevseq_tm, sum(decode(event,'db file scattered read', round(time_waited_micro/1000) ,0)) prevscat_tm, sum(decode(event,'log file parallel write', round(time_waited_micro/1000) ,0)) prevlfpw_tm, sum(decode(event,'db file sequential read', total_waits - &prevseq_ct_var,0)) seqct, sum(decode(event,'db file scattered read', total_waits - &prevscat_ct_var,0)) scatct, sum(decode(event,'log file parallel write', total_waits - &prevlfpw_ct_var,0)) lfpwct, sum(decode(event,'db file sequential read', total_waits ,0)) prevseq_ct, sum(decode(event,'db file scattered read', total_waits ,0)) prevscat_ct, sum(decode(event,'log file parallel write', total_waits ,0)) prevlfpw_ct, sum(decode(event,'direct path read', round(time_waited_micro/1000) - &prevdpr_tm_var,0)) dprtm, sum(decode(event,'direct path read', round(time_waited_micro/1000) ,0)) prevdpr_tm, sum(decode(event,'direct path read', total_waits - &prevdpr_ct_var,0)) dprct, sum(decode(event,'direct path read', total_waits ,0)) prevdpr_ct, sum(decode(event,'direct path write', round(time_waited_micro/1000) - &prevdpw_tm_var,0)) dpwtm, sum(decode(event,'direct path write', round(time_waited_micro/1000) ,0)) prevdpw_tm, sum(decode(event,'direct path write', total_waits - &prevdpw_ct_var,0)) dpwct, sum(decode(event,'direct path write', total_waits ,0)) prevdpw_ct, sum(decode(event,'direct path write temp', round(time_waited_micro/1000) - &prevdpwt_tm_var,0)) dpwttm, sum(decode(event,'direct path write temp', round(time_waited_micro/1000) ,0)) prevdpwt_tm, sum(decode(event,'direct path write temp', total_waits - &prevdpwt_ct_var,0)) dpwtct, sum(decode(event,'direct path write temp', total_waits ,0)) prevdpwt_ct, sum(decode(event,'direct path read temp', round(time_waited_micro/1000) - &prevdprt_tm_var,0)) dprttm, sum(decode(event,'direct path read temp', round(time_waited_micro/1000) ,0)) prevdprt_tm, sum(decode(event,'direct path read temp', total_waits - &prevdprt_ct_var,0)) dprtct, sum(decode(event,'direct path read temp', total_waits ,0)) prevdprt_ct, to_char(sysdate,'SSSSS')-&prevsec_var delta, to_char(sysdate,'SSSSS') prevsec from v$system_event where event in ('db file sequential read', 'db file scattered read', 'direct path read temp', 'direct path write temp', 'direct path read', 'direct path write', 'log file parallel write') ) ;
Output looks like
SEQ_MS SEQ_CT LFPW_MS LFPW_CT SEQ_MS SCAT_CT DPR_MS DPR_CT DPRT_MS DPRT_CT -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- 115.71 422.67 76.17 12.00 .00 .00 .00
The first execution of the query is I/O since database startup, so should most likely be ignored.
Subsequent executions are the I/O since the last execution
The columns are
- SEQ_MS: single block latency
- SEQ_CT: single block reads per second
- LFPW_MS: log file parallel write latency
- LFPW_CT: log file parallel write count per second
- SCAT_MS: multi-block latency
- SCAT_CT: multi-block reads per second
- DPR_MS: direct path read latency
- DPR_CT: direct path read count
- DPRT_MS: direct path read temp latency
- DPRT_CT: direct path read temp count
Simple but only once a minute
select n.name event, m.wait_count cnt, 10*m.time_waited ms, nvl(round(10*m.time_waited/nullif(m.wait_count,0),3) ,0) avg_ms from v$eventmetric m, v$event_name n where m.event_id=n.event_id and ( wait_class_id= 1740759767 -- User I/O or wait_class_id= 4108307767 -- System I/O ) and m.wait_count > 0 ;
.
Pingback: Kyle Hailey » Wait Metrics vs Wait Events
Pingback: Oracle IO latency monitoring | Vishal desai's Oracle Blog
Thanks for sharing the details …
One questions. Last week my report was took more than 5 hours instead of 3 hour… When I check all inner & outer details with the help of tools & tracing ..etc.. but not found any interesting things even execution plan.. gather stats was also intact properly.
When we did last check on storage level & found one storage path was dead & all traffic was moving only one path which cause this issue….after fixing again reports are back to normal 3 hours..
Is there way can we find pro active this type of unexpected issues..
Please request to shed some light & how to handle this type of issue..
Thanks
Yousuf
Note: I have both good & bad AWR report of this issue.
@Yousuf: ASH will give you a break down of where the time was spent and why more time was spent on the 5 hour instead of the 3 hour. Once you identify the difference you can set an alert when that type of even occurs
Thanks for your reply…..
Sure, I will validate the result as suggested with ASH (Active Session History) report..