Home > Performence Tuning > Procedure need to follow to determine the SAN issue.

Procedure need to follow to determine the SAN issue.


Procedure need to follow to determine the SAN issue.
Normally everything is fine, however over the past few weeks we have been having huge lag spikes in sql server. It’s clear that SQL Server is waiting on disk I/O. But I keep getting told that it is because SQL Server is asking for abnormally high I/O. Which isn’t the case?
IO issue identification due to SAN:
Identify I/O bottlenecks by examining the latch waits. Run the following DMV query to find I/O latch wait statistics.
select wait_type, waiting_tasks_count, wait_time_ms, signal_wait_time_ms, wait_time_ms / waiting_tasks_count
from sys.dm_os_wait_stats
where wait_type like ‘PAGEIOLATCH%’ and waiting_tasks_count > 0
order by wait_type

Identify the number of pending I/Os that are waiting to be completed for the entire SQL Server instance:
SELECT SUM(pending_disk_io_count) AS “Number of pending I/Os” FROM sys.dm_os_schedulers

Capture the details about the stalled I/O count reported by the above query.
SELECT * FROM sys.dm_io_pending_io_requests
To identify the stalled I/O per each database files in the SQL Server instance, run the below query:
SELECT DB_NAME(database_id) AS “Database”,file_id, io_stall_read_ms, io_stall_write_ms, io_stall FROM sys.dm_io_virtual_file_stats(NULL,NULL)

Note: The values are cumulative since SQL Server started. You can use sample_ms column output to compare the output between two instances of the output and find which file is the cause of the stalled I/O. (sample_ms is the number of milliseconds since the computer was started). Once you identify the file and the drive with high stall IO, next step is to capture the IO latency
SELECT
–virtual file latency
[ReadLatency] =
CASE WHEN [num_of_reads] = 0
THEN 0 ELSE ([io_stall_read_ms] / [num_of_reads]) END,
[WriteLatency] =
CASE WHEN [num_of_writes] = 0
THEN 0 ELSE ([io_stall_write_ms] / [num_of_writes]) END,
[Latency] =
CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
THEN 0 ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END,
–avg bytes per IOP
[AvgBPerRead] =
CASE WHEN [num_of_reads] = 0
THEN 0 ELSE ([num_of_bytes_read] / [num_of_reads]) END,
[AvgBPerWrite] =
CASE WHEN [io_stall_write_ms] = 0
THEN 0 ELSE ([num_of_bytes_written] / [num_of_writes]) END,
[AvgBPerTransfer] =
CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
THEN 0 ELSE
(([num_of_bytes_read] + [num_of_bytes_written]) /
([num_of_reads] + [num_of_writes])) END,
LEFT ([mf].[physical_name], 2) AS [Drive],
DB_NAME ([vfs].[database_id]) AS [DB],
–[vfs].*,
[mf].[physical_name]
FROM
sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]
JOIN sys.master_files AS [mf]
ON [vfs].[database_id] = [mf].[database_id]
AND [vfs].[file_id] = [mf].[file_id]
— WHERE [vfs].[file_id] = 2 — log files
— ORDER BY [Latency] DESC
— ORDER BY [ReadLatency] DESC
ORDER BY [WriteLatency] DESC;
GO

First see which database files have the most IO bottleneck by running this query(Query by Glenn Berry)
SELECT DB_NAME(fs.database_id) AS [Database Name] ,
mf.physical_name ,
io_stall_read_ms ,
num_of_reads ,
CAST(io_stall_read_ms / ( 1.0 + num_of_reads ) AS NUMERIC(10, 1)) AS [avg_read_stall_ms] ,
io_stall_write_ms ,
num_of_writes ,
CAST(io_stall_write_ms / ( 1.0 + num_of_writes ) AS NUMERIC(10, 1)) AS [avg_write_stall_ms] ,
io_stall_read_ms + io_stall_write_ms AS [io_stalls] ,
num_of_reads + num_of_writes AS [total_io] ,
CAST(( io_stall_read_ms + io_stall_write_ms ) / ( 1.0 + num_of_reads
+ num_of_writes ) AS NUMERIC(10,
1)) AS [avg_io_stall_ms]
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS fs
INNER JOIN sys.master_files AS mf WITH ( NOLOCK ) ON fs.database_id = mf.database_id
AND fs.[file_id] = mf.[file_id]
ORDER BY avg_io_stall_ms DESC
OPTION ( RECOMPILE );
 Then run this query to see the top ten events your server is waiting on(query by Jonathan Kehayias). You will also find similar query from Glenn Berry diagnostic queries.
SELECT TOP 10
wait_type ,
max_wait_time_ms wait_time_ms ,
signal_wait_time_ms ,
wait_time_ms – signal_wait_time_ms AS resource_wait_time_ms ,
100.0 * wait_time_ms / SUM(wait_time_ms) OVER ( ) AS percent_total_waits ,
100.0 * signal_wait_time_ms / SUM(signal_wait_time_ms) OVER ( ) AS percent_total_signal_waits ,
100.0 * ( wait_time_ms – signal_wait_time_ms )
/ SUM(wait_time_ms) OVER ( ) AS percent_total_resource_waits
FROM sys.dm_os_wait_stats
WHERE wait_time_ms > 0 — remove zero wait_time
AND wait_type NOT IN — filter out additional irrelevant waits
( ‘SLEEP_TASK’, ‘BROKER_TASK_STOP’, ‘BROKER_TO_FLUSH’, ‘SQLTRACE_BUFFER_FLUSH’,
‘CLR_AUTO_EVENT’, ‘CLR_MANUAL_EVENT’, ‘LAZYWRITER_SLEEP’, ‘SLEEP_SYSTEMTASK’,
‘SLEEP_BPOOL_FLUSH’, ‘BROKER_EVENTHANDLER’, ‘XE_DISPATCHER_WAIT’,
‘FT_IFTSHC_MUTEX’, ‘CHECKPOINT_QUEUE’, ‘FT_IFTS_SCHEDULER_IDLE_WAIT’,
‘BROKER_TRANSMITTER’, ‘FT_IFTSHC_MUTEX’, ‘KSOURCE_WAKEUP’,
‘LAZYWRITER_SLEEP’, ‘LOGMGR_QUEUE’, ‘ONDEMAND_TASK_QUEUE’,
‘REQUEST_FOR_DEADLOCK_SEARCH’, ‘XE_TIMER_EVENT’, ‘BAD_PAGE_PROCESS’,
‘DBMIRROR_EVENTS_QUEUE’, ‘BROKER_RECEIVE_WAITFOR’,
‘PREEMPTIVE_OS_GETPROCADDRESS’, ‘PREEMPTIVE_OS_AUTHENTICATIONOPS’, ‘WAITFOR’,
‘DISPATCHER_QUEUE_SEMAPHORE’, ‘XE_DISPATCHER_JOIN’, ‘RESOURCE_QUEUE’ )
ORDER BY wait_time_ms DESC
Once you have this information at hand it would be much easier to troubleshoot the problem.

Once you identify the file and the drive with high stall IO, next step is to capture logical disk counters using Perfmon (Windows Performance monitor).
Logical Disk
Avg. Disk sec/read
Avg. Disk sec/write
Avg. Disk sec/transfer
%Disk time
Disk Read Bytes/sec
Disk Write Bytes/sec
Split IO/sec
Process
IO Data Bytes/sec
IO Read Data Bytes/sec
IO Write Data Bytes/sec
Remember to capture the counters for respective drive or process not in _Total
Average Disk Sec/Transfer on a SQL Server machine is typically less than 15ms. If this value climbs, it may indicate that the I/O subsystem is not keeping up with I/O demand. To gauge I/O properly, here are the values for reference:
Any number Less than 10ms – very good
Between 10 – 20ms – okay
Between 20 – 50ms – slow, needs attention
Greater than 50ms – Serious I/O bottleneck
If this value is above the recommended value, check if SQL Server is flooding the IO subsystem – in case of poorly written IO intensive query and explore the opportunity of tuning such queries, an addition index or creating missing or updating existing statistics or re-writing the problematic query might help reduce the IO.

Advertisements
Categories: Performence Tuning
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: