SQL Centre of Excellence

Great news today, I just got the acceptance email into the SSAS Maestro programme in June over in Redmond !!

You can read a bit about the programme from Kasper and Denny’s blogs:

http://www.powerpivotblog.nl/do-you-have-what-it-takes-to-become-ssas-maestro

http://sqlcat.com/msdnmirror/archive/2011/01/30/what-is-the-ssas-maestros.aspx

If I’m successful, I’ll have the SQL MCA, MCM, MVP and SSAS Maestro. Hmm, might look a bit pretentious on an email signature.

I’m sure the course content will be under NDA so I won’t be able to blog much, but Vidas made up a good reading list so we know what to expect.

http://www.ssas-info.com/VidasMatelisBlog/208_ssas-white-paper-list

Luckily, I’ve already read (err skimmed and know they exist) all background reading,  as part of my day job as a consultant is to point customers to appropriate white papers that underpin a problem they may be having.

Will have to go back to re-read as many and possible and brush up on my MDX and SSAS internals which my real weak point at the best of times.

For me the most frustrating thing about Analysis Services is that thanks to Paul Randal, Kimberly Tripp and many others we know “so” much about the SQL Server DB Engine, how it works, why it makes choices and even the on disk structure. However with Analysis Services a lot of the product is still black box to all but a few insiders. Even if we can quote best practices we don’t often truly understand why its a best practice to evaluate the choices we make. Hopefully the Maestro programme will change all that for me.

While I’m lording it up in Redmond for a week (fancy Hotel, food and uninterrupted sleep), Carmel will have to get the six kiddies up and to school, etc for a week without me, as well as putting up with me trying to cram in reading between now and June. I will be owing some major brownie points for sure.

Thanks to everyone who attended my session. Here are some links that might be useful for further reading:

- Presentation Deck (attached to this blog)

- Microsoft Assessment and Planning Application Toolkit

 http://www.microsoft.com/downloads/en/details.aspx?FamilyID=67240b76-3148-4e49-943d-4d9ea7f77730&displaylang=en

- Sql Consolidation Planning Add-In for Excel

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=eda8f544-c6eb-495f-82a1-b6ae53b30f0a

- Demo of using MAP and Excel to produce IO Histogram

http://blogs.prodata.ie/post/Using-MAP-Tool-and-Excel-to-Analyse-IO-for-SQL-Consolidation-Part-I-e28093-basic-distribution.aspx

- Demo of using MAP and PowerPivot to analyse IO for a SQL Estate

http://blogs.prodata.ie/post/Using-MAP-Tool-to-Analyse-IO-for-SQL-Consolidation-Part-II-e28093-Using-Power-Pivot.aspx

Some  Great Questions from the audience which I’ll try to summarise here

Q: What is the maximum recommended CPU limit to decide what SQL Servers to virtualise?
In Hyper-V this is an easy question to answer as it only supports 4. VMware supports more cores, BUT the general recommendation is NOT to use more CPU’s than are available on a single NUMA node which is either 4 or 8. The reasoning is that your virtual operating system will not be “NUMA aware” and having some of the CPU’s on one NUMA node and some on another will be very expensive. There is not a lot of research yet on this type of scenario, so most people avoid it. Your mileage may vary, but we avoid crossing NUMA boundaries with virtualisation.

Q: Does the MAP tool account for growth ?

No, it only captures a snapshot of resource usage. My concern is not usually the growth in CPU usage and IOPS of workloads, but the fact that as time goes on MORE workloads will be added to the virtualised environment. You need to figure out what spare capacity will be allocated for this on the virtual hosts in terms of CPU, IOPS and disk space.

How much disk space you need to park for “growth” depends on how you are virtualising. If you are only moving physical SQL Servers and moving drives like for like, then they may already have had capacity planning for X time.

If you are doing SQL consolidation (moving data and log files to new environment), then this is trickier. It is possible to run the MAP Inventory tool at different months to measure % growth in databases to help plan for growth.

Q: Why don’t we ask for IOPS based on the MAX amount used.?

This would be cool to do for only one server as you get guaranteed performance no matter how busy it is! However imagine you have 100 servers each doing an avg 50 IOPS, 80% percentile at 80 and a max 250 IOPS.

Commissioning 25,000 IOPS at 8ms could be quite expensive, but 8,000 iops might be more practical.

Note we are not saying that the maximum IOPS we would ask for is 8,000. Just that we want 8,000 at avg 8ms latency. Hopefully the storage will be able to spike above 8,000 IOPS.

Its unlikely that all 100 servers will need their “max” at the same time, so we can save some costs with shared storage.

 

Thanks. Do post any follow up questions Winking smile

 

 

 

 

 

 

 

 

 

 

 

The Microsoft Assessment and Planning Toolkit is a fantastic tool for helping gather resource usage for a SQL consolidation / virtualisation project.

http://technet.microsoft.com/en-us/library/bb977556.aspx

One thing that is quite basic though is its presentation of disk IOPS (see below). This article shows how to analyse the data in a bit more detail using the underlying SQL database and excel.

image_thumb1

1) IOPs distribution graph

This is useful for seeing the distribution of IOPS so we can determine what percentage of IO falls within percentile boundaries. For example on the above example, would we want to purchase a device only capable of performing 15 IOPS. Probably not as this means that only 50% of IO is within performance targets. We may want 80% or 95% of IO coming in within performance targets. So how do we determine this….

a) Locate the database on the “MAPS” instance and query the database for IO broken down by 15 minute periods:

SELECT     CAST(FLOOR(CAST(dbo.performance_disk.collection_datetime AS float(53)) * 24 * 4) / (24 * 4) AS smalldatetime) AS time, 
                      CEILING(SUM(dbo.performance_disk.disk_transfers_per_sec) / COUNT(*) * COUNT(DISTINCT dbo.performance_disk.device_number)) AS IOPS
FROM         dbo.performance_disk INNER JOIN
                      dbo.devices ON dbo.performance_disk.device_number = dbo.devices.device_number
WHERE     (dbo.performance_disk.instance = N'_Total')
GROUP BY CAST(FLOOR(CAST(dbo.performance_disk.collection_datetime AS float(53)) * 24 * 4) / (24 * 4) AS smalldatetime)

OK – i know the query is horrible, but its a once off!

b) Copy and paste the results into excel

image_thumb2

c) Click Insert-Pivot Table and make a Pivot table showing the IOPS on the rows and the count of time (twice on the columns). name the first count of time “Freq” and the second “Quartile”

image_thumb3

d) Format the second series (we named Quartile) as “% Running Total In” IOPS

image_thumb6

e) Click Insert 2D Line Chart, right click on the second series (Quartile)  and select Format Data Series – Plot Series on Secondary Axis and tidy up the chart with nice formatting.

image_thumb9

 

We can now see some interesting data as: While the average IOPS is about 19.5, the 80% quartile is about 25 and the 90% quartile is about 34.

This might help specify SAN requirements as we could now say that our workload requires:

  • 25 IOPS at a latency of 8ms (80% percentile)
  • 34 IOPS at 20ms (90% quartile)
  • max IOPS of 37 (100%)

Ok, these are just small play numbers, but the concept is there. Some wider points to consider:

a) This is assuming that we don’t want to massively increase performance

b) We may need to add more IOPS for future growth and other factors

A Sample excel sheet with results is attached..Love to hear if anyone finds this useful!

 

Download File - IO_Distribution.zip

As mentioned previously (http://blogs.prodata.ie/post/Using-MAP-Tool-and-Excel-to-Analyse-IO-for-SQL-Consolidation-Part-I-e28093-basic-distribution.aspx), the MAP tool has a nice database containing all your disk IOPS data and this is nicely contained in a single table called “performance_disk”

We can use PowerPivot to analyse the IOPs form different perspective and start to apply some logic such as: excluding non-business hours, seeing demand by date, time and other factors, and examining read/write ratios

I’ve attached a sample PowerPivot mashup that shows us some more details on one sheet (sample below)

image

As its PowerPivot we can do some funky stuff like drag the “ServerName” onto a chart to see the breakdown by actual servers to help locate who are the IO gobblers (see below)
image

Non BI folk can stop reading now. Go away and play with the PowerPivot sheet, refresh it against your MAPS database and hopefully its useful.

----------------------------------------------------------------------------------------------------

For those BI folk, a few interesting challenges for PowerPivot:

  • The IOPS is semi-additive as in we need to average the IOPS across time, but then sum it across servers to get the correct figure. We do this by simply averaging the IOPS and then multiplying by the number of unique servers. the number of unique server is determined by the DAX expression as below:
    image
  • For a better distribution graph we want to “band” IOPS into buckets. For example 20-25, 25-30 and so forth. To do this we can use the “ceiling” DAX function which is much more powerful than the TSQL equivalent as it allows for rounding up on large whole numbers like 5,10,50 or 100. the example below rounds up IOPS to the nearest two.
    image

 

Download File - MAP_IO_Analysis.xlsx

Storage area networks are getting larger and more sophisticated. One of the problems I face as SQL consultants is running sqlio on them.

One common stumbling block is the size of the cache on some storage area networks can be huge (128GB plus), especially with the use of solid state devices as a tier “zero” cache accelerator. If our test file is too small we simply get the “from cache” speed of the SAN and while this is interesting its not indicative of performance when we throw a couple of TB of SQL Server data onto it. The fact is that often read cache on a SAN is a waste of money for SQL Server anyway as the cache hit ratio on the SAN will be incredibly low – SQL Server already has its own cache in the buffer pool, so its not going to be reading data from the cache of the SAN anytime soon unless the cache on the SAN is larger than the size of the SQL Servers buffer pool (possible but unlikely).

So what do we do when the SAN has a 64GB cache ? Well guidelines we use is that the size of the test file should be about four times the size of the cache. So your going to need a 256GB test file.

What if you don’t want to wait for a 256GB test file to be created as you only have limited test time?

Here’s a neat trick that Joe Sack from the SQL MCM program showed me for using SQL Servers “Instant File Initialisation” feature to massively speed up creation of test files.

1. Ensure that SQL DB Engine is installed and that Instant File Initialisation is configured in the local Security Policy Editor.

2. Create a database MDF file of the correct size (say 256GB) and detach the database

3. Modify the parameter file that sqlio uses to specify that you have pre-sized the file (see below)

4. Away you go.

Modifying the sqlio parameter file:

Usually the parameter file for sqlio is called something like “param.txt” and looks like this:

image

You can simply change the filename to be the MDF file and then delete the last parameter so that the size is not specified.

image

Using this technique I recently had a 1TB test file generated in under three seconds. Nice Winking smile

Technorati Tags:

SQL IO Characteristics

by Bob Duffy 8. August 2010 05:20

This is fairly well documented in various books and white papers, but can be tricky to find out on the Internet so I am summarising it here in this blog.

The key questions is “what are SQL Server’s I/O Characteristics”. This helps answer questions like why is a 1MB block size bad; why don’t we often care about random I/O for log files; what extra benefit for I/O does enterprise edition have; and why do we mainly test random I/O for data files in OLTP systems

The table below shows general I/O characteristics for the storage engine that will be important to most OLTP workloads. If you “really” know your workload you can optimise your storage. For example:

  • if you know that most of your workload is index seeks, you may benefit from a 8k block size on the SAN - good luck convincing your SAN team of this ;-)
  • If you are doing mostly very large table scans and you have enterprise edition, then it is possible that a 1MB block size may be most performant.
  • If you are not running enterprise edition, then a 1MB block size will never be optimal as the maximum IO size SQL Server will issue is 256k, so a minimum of three quarters of your I/O will be wasted.

Operation

Random / Sequential

Read / Write

Size Range

OLTP – Log

Sequential

Write

Sector Aligned Up to 60K

OLTP – Log

Sequential

Read

Sector Aligned Up to 120K

OLTP – Data (Index Seeks)

Random

Read

8K

OLTP - Lazy Writer (scatter gather)

Random

Write

Any multiple of 8K up to 256K

OLTP – Checkpoint (scatter gather)

Random

Write

Any multiple of 8K up to 256K

Read Ahead (DSS, Index/Table Scans)

Sequential

Read

Any multiple of 8KB up to 256K (1024 Enterprise Edition)

Bulk Insert

Sequential

Write

Any multiple of 8K up to 128K

 

This secondary table is useful as trivia, but we rarely optimise storage for these operations. One  observation is that backup and restore can issue up to 4MB IO sizes, and the importance of instant file initialisation in helping create database performance (or file growth).

Operation

Random / Sequential

Read / Write

Size Range

CREATE DATABASE (or file growth)

Sequential

Write

512KB (SQL 2000) , Up to 4MB (SQL2005+)

(Only log file is initialized in SQL Server 2005+ if instant file initialisation is enabled)

BACKUP

Sequential

Read/Write

Multiple of 64K (up to 4MB)

RESTORE

Sequential

Read/Write

Multiple of 64K (up to 4MB)

DBCC – CHECKDB

Sequential

Read

8K – 64K

ALTER INDEX REBUILD - replaces DBREINDEX
(Read Phase)

Sequential

Read

Any multiple of 8KB up to 256K

ALTER INDEX REBUILD - replaces DBREINDEX
(Write Phase)

Sequential

Write

Any multiple of 8K up to 128K

DBCC – SHOWCONTIG (deprecated, use sys.dm_db_index_physical_stats)

Sequential

Read

8K – 64K

 

Further References:

Hitachi – Tuning SL Server 2005 Performance http://www.servicesorientedstorage.com/assets/pdf/tuning-microsoft-sql-server-2005-performance-wp.pdf

High Availability Best Practices: I/O Subsystem

http://download.microsoft.com/download/f/e/3/fe32de98-8be3-4212-9cf9-be75fc699df8/SQL_Server_Always_On_Tec_IO.ppt

SQL Server 2000 I/O Basics (Old but still mostly relevant today)

http://technet.microsoft.com/en-us/library/cc966500.aspx

I must be going mad. I’ve worked on  thousands of database schemas over the years, including a lot of “conformed data warehouses” where there are a lot of different business areas in one user database, separated either by schema (newer style) or by some funky naming convention (older style). I’ve always found that SSMS can get a bit clunky and slow when there are lots of objects – in fact on a recent customer site with Navision (uses a silly number of objects) SSMS actually couldn't enumerate all the tables and bombed out. We had to resort to just querying sys.objects to list tables.

I’ve just noticed that you can filter tables in SSMS to only show you a subset. Very handy to just select tables in a schema, or just “Sales” tables. Here's a screen shot of the feature in action.

How did I miss that – its like been sitting on the toolbar since SSMS 2005….doh.

image

Whenever we chat about IO and storage there are actually FOUR factors to determine requirements. Often people only discuss the first one (Capacity) so we have to tease out the others:

1. Capacity

2. Throughput

3. Latency

4. Service Levels

1. Capacity

Capacity is the easy one that no one misses. How many GB do I need ?

The obvious mistakes we see people making is to only size for the raw data. You also need to size for:

  • Indexes. These can be 20% to 100% of the data size, depending on how heavily indexed. A database that is used for both OLTP and relational reporting (and is tuned) is likely to be heavily indexed. A data warehouse that just exists to load a cube, might not have any indexes on it at all!!
  • TempDB. How to size TempDB is a topic in itself, but as a rough rule of thumb, you'll need to store at least 1.5 to 2 times your largest table (or partition) in TempDB.
  • Backup and General Space. You are going to need somewhere to back up your SQL Server, or restore from. You may also need more than one full copy to be stored. How much space you need can depend on if your backups are compressed, but in a lot of cases the space required is 2 or more times the database + index size. The good news it that this “general” storage space does not often need top of the range performance, so use of near line storage in RAID 5 can save buckets.
  • OLAP If you have cubes, then these need to be sized. A cube can 15% to 100% of the size of the data warehouse, and you may need multiple copies. Often, cubes don’t need as terribly performant storage if queries are served from RAM.
  • Data Processing. Its not uncommon to have ETL processes drop files or pick files up from a SQL Server and this requires both space and a drive letter allocated. You can’t really put this on the same volume as SQL Data files or you will get fragmentation.

2. Throughout

Throughput is how fast do I need the IO to be or rate of transfer. This is typically measures in IOP’s which is the windows perfmon counter “Logical Disk\Disk Transfers per Sec” or it can also be measured in MB/Sec.

Usually, we measure how many 64k random read/write IO’s we need per second at the correct latency (8-20ms) for data files.

An exception to this is sometimes when designing data warehouses, we speak in terms of MB/Sec. The reasoning here is that a lot of research has gone into Microsoft’s “Fast Track” design for data warehouses that estimate a single CPU can consume about 200 MB/Sec sequential reads. While a full blown “fast Track” design might be too expensive for most Irish pockets, the underlying principles and estimator calculations are still really useful for sizing.

http://www.microsoft.com/sqlserver/2008/en/us/fasttrack.aspx

 

3. Latency

 

This is how responsive do I need the disks to be and is measured by the perfmon counter “Logical Disk\Avg Secs per Transfer”. SQL Server is very prescriptive here:

  • Data Files need 8-20ms latency
  • Log files need 1-5ms latency

The exception to this is data warehouses where we accept higher latency (8-50ms). The reasoning is very simple – if a query is going to take 30 seconds to scan millions of rows then having to wait 50ms to get started isn't going to loose anyone any sleep …

3. Service Levels

This is how resilient do we need the storage to be:

  • Do we need the throughput and Latency guaranteed 9-5 or 24x7
  • What throughput and Latency can we accept during a disk failure event
  • What throughput and Latency can we accept during a backup event.
  • Following a failover to DR site what storage throughput and latency do you expect (or is this all bets are off!)

As an example, imagine if you have a 24x7 SQL Server on RAID 5 and a disk blows and will take eight hours to rebuild. What if the IOPS drops from 1,000 to 500 during the rebuild and latency goes from 8ms to 60ms? This could mean that this event is technically an outage for eight hours. if a disk blows a few times a year you will have lots of outages. Therefore the requirements should state “we need 1,000 IOPS and 8ms latency DURING a disk failure event”, or “We can accept 800 IOPS and 10ms latency during a disk failure event”. This will help the storage designer make the correct choice on configuration. For example he may use 15 spindles on RAID 5 rather than 8 to ensure requirements are met, or he may change from RAID 5 to RAID 6. If you don’t specify the requirements, you can’t really blame the storage designer or Vendor for not meeting them!!

For a real mission critical SQL Server, i would run the “SQLIO” test during these conditions to baselines expectations and check requirements met.

Bottom line is Service Levels is the stick that you use to get the storage designer to over spec the IO requirements from the basic requirements  ;-) It will raise the cost of course.

 

Ok, this topic comes up all the time and I partially feel like I’m going to get less work out of posting it, by giving away some trade secrets, but its all in the public domain anyway. So here goes…

The scenario is you are an experienced Infrastructure manager considering purchasing a new SAN and you have lots of workloads to consider: File and Print, VMware servers, exchange, SQL Server, ERP’s, Web Servers and lots of application servers. For a lot of these workloads you just tell the SAN guy how much space you need and away you go.

But, what are the sort of questions your going to want to ask a SQL consultant to help with storage design for your SAN. I’ve made up ten questions that I think are quite common

1. What are the storage requirements for SQL Server?
2. Why can’t I use the 128k or 1MB block size JimBob recommended?
3. What the best RAID Types and disk speeds for SQL Server ?
4. How do I measure what my current SQL Server needs ?
5. Any special Cache Settings?
6. Can I share SQL Server with my other workloads?
7. How do I know that my storage meets SQL Servers requirements?
8. How do I convince the SAN vendor to meet SQL Servers requirements?
9. How do I know if things are working well or not ?
10. Useful links on SQL IO and Storage Area Networks

1. What are Storage Requirement for Running SQL Server

The magic numbers that every SQL consultant has in his back pocket are 8- 20 ms and 1-5 ms  ;-)

This means that the number one requirement for SQL Server is that when it requests information from the disk subsystem or writes to it it MUST get the information back fact enough. If it does not it will kind of go into “throttle mode” and slow down the IO process.

Your SQL Server will be split into three basic components: Data files,  log files and TempDB

  • Data files must have a response time averaging about 8ms and a maximum response time of around 20ms. This is the windows performance counter “Logical Disk: Disk Secs/Transfer”
  • Log Files must have a response time averaging from 1-5ms. The Log files are often the most important thing to consider as log writes can be synchronous in nature so those users will be waiting a lot if the log disks are slow.
  • TempDB performance must be about the same as data files (1-8ms). If a SQL consultant has performance a review of your workload and warned of “heavy” TempDB usage, then placing TempDB on your fastest disk subsystem will bring a lot of benefit.

Additionally we can assume the following workload characteristics

  • Assume that data files use 64k size IOs and that are random in nature and both reading and writing is performed by SQL.
  • Assume that log files use 64k IO size and are sequential in nature, only writing.
  • assume that an OLTP workload like your ERP or Navision is 60% reads and 40% writes. If you have existing SQL workloads you should establish that answer to “what is my read/write ratio”. It is very common that we see storage optimised for read performance when the workloads is 90% writes!

It really is that simple in that latency is the ONLY real requirement SQL Server has, everything else is just guidance to help you meet this requirement!! if your IO requests that SQL server issues to your SAN meet those figures then you have a highly performance storage subsystem and happy SQL Server users.

The main thing to specify apart form your latency requirement is the throughput (IOPs). It is no good meeting the 8ms target for 100 IOPs and then finding your workloads needs 5,000 IOPs. You wont be able to meet the 8ms target!!

2. Why can’t I use the 128k or 1MB block size JimBob recommended?

Well if you can meet that 8ms for data and 1ms for log writes target, then knock yourself out, use a 5MB block size for all SQL Server cares ;-)

But realistically, you really need to use a 64k block size. There have been hundreds if not thousands of empirical tests showing that if the block size is larger than 64k then you will get a massive drop in latency and not meet the requirements for running SQL Server. For example if the block size is 512K, then the SAN will need to return 512k when SQL Server has only asked for 64k, this could take 8 times longer right…

3. What the best RAID Types and disk speeds for SQL Server ?

Nowadays most SQL consultants try and not talk about RAID types and types of disk, it can be best to leave that up to the storage guys. If the storage team can meet my requirement for 5,000 random 64k read/write IOPs at 8ms latency by using 50 old SATA drives at 5,400 rpm in RAID 5 then knock yourself out – I’m happy. Well maybe I’m happy till we have that chat about Service Level requirements during a disk degrade event but that's a different story…

Unfortunately we often do have to discuss RAID or disk speed because the core requirements are not met, so we need to optimise what we have to lower latency at the target IOPs.

As a rule of Thumb

  • RAID 5 is generally optimised for read operations and will often give you a performance hit, maybe an extra latency of 1-2 ms or so. It is therefore a poor candidate for logs buy may be a good candidate for read intensive data volumes. It is probably a poor choice for “TempDB” as TempDB is very write intensive. The golden rule here is “know your read/write ratio”. If your workload is 80% reads then maybe raid 5 will work for you.
  • RAID 10 is generally optimised better for writes and all round performance. It is usually the best choice for TempDB and Logs.
  • RAID 1 (Mirror) is when you only have one spindle in a disk group. This is generally considers the holy grail for configuring log files – one disk per log file. This is because a single disk can deliver extremely fast and reliable sequential writes which is perfect for log files. Now why don’t we just put every log on its own disk ? Cost is the main factor, so what we generally recommend is IF you have a huge SQL Server that needs the very best in performance then place the log files on their own RAID 1 disk. Otherwise your going to want a RAID 10 pool for sharing log files.
  • RAID 6 is the new kid on the blog. Often is is preferred over RAID 5 as RAID 5 sucks when a disk degrades. The switch from RAID 5 to RAID 6 usually happens if there is a conversation about guaranteed server levels during a disk degradation event.

4. How do I measure what my SQL Server Needs?

Easy. Just run the perfmon counters “Logical Disk: Disk Writes/Sec and Logical Disk: Disk Reads/Sec”. Over a period of time this will tell you your average IOP’s needed and your maximum IOPs needed.

If you have SCOM installed then the windows management pack actually captures these metrics, aggregates them up to the hour mark and puts them into a data warehouse. give us a shout and we will gladly come in and suck the data out and present it in a nice report. We will even add reports in to SCOM so you can get this data out at the push of a button.

You should quickly be able to now say. “I need 750 IOPS made up of 250 writes and 500 reads”

Two words of warning:

1 – don’t use the average figure to specify requirements or else SQL Server will become dog slow during peak usage. Consider specifying the maximum or some figure in between.

2. If your target IOPS is 1,000 then you probably don’t want to run the storage subsystem “hot”. Agree with your storage teams what sort of headroom is needed. E.g. My sql server needs 1,000 IOPS. Can you give me a storage subsystem that can do 2,000 IOPS at a latency of 8ms please.

5. Any special Cache Settings?

Well, If the storage can meet my requirement for 5,000 random 64k read/write IOPs at 8ms latency by using standard SAN cache settings, then who cares right!! knock yourself out.

But, if you need to optimise the storage to lower latency then then are a few common tricks:

  • read cache is pretty much useless for SQL Server. the reasoning is that SQL Server uses most of its memory (when over 2GB) for the buffer pool which is basically its own disk cache. It therefore won’t really ever request the same block from the SAN. Change that read cache to only 10% or off.
  • If the SAN cache is “overloaded” with writes in some cases, disabling the write cache and the cache “mirror” to the other controller can help. This is not a blanket recommendation though, it is something to test when desperate..

6. Can I share SQL Server disks with my other workloads on the SAN ?

Well, If you can run exchange, VMware and all other workloads with SQL Server and still meet my requirement for 5,000 random 64k read/write IOPs at 8ms latency then knock yourself out !!

Otherwise a side effect of sharing SQL Server storage with other workloads is that SQL Server may have to sometimes wait longer than it would hope for (8-20ms) and this will cause performance issues.

As a rule of thumb. Sharing SQL Data volumes with other workloads “can” sometimes be beneficial. For example rather than each get 1.5 disks, your 10 workloads can now enjoy 15 spindles.

As another rule of thumb, sharing log files is generally a no-no. This is because the log files need sequential IO and the sharing of workloads will degenerate this entire IO pattern to random.

As a third rule of thumb. The Service Level Agreement is king. If the end users of your SQL Server need “guaranteed” performance of max 50ms per query, then you will need to supply the SQL Server with storage that can guarantee “8ms” per IO request. Sharing disks with exchange might not be such a good idea here.

7. How do I know that my storage meets SQL Servers requirements?

This is quite easy. there is a freely available tool called “SQLIO.EXE” which you can run and it will output the amount of IOPS that the volume can do along with avg and max latency figures. the results look something like the sample below

sqlio v1.5.SG
using system counter for latency timings, 3306972 counts per second
parameter file used: param.txt
    file E:\testfile.dat with 1 thread (0) using mask 0x0 (0)
1 thread reading for 120 secs from file E:\testfile.dat
    using 64KB random IOs
    enabling multiple I/Os per thread with 6 outstanding
using specified size: 5000 MB for file: E:\testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:   500.59
MBs/sec:    31.28
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 8
Max_Latency(ms): 690

histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  1  1  4 10 11 13 15 10  8  6  4  3  2  2  1  2  1  1  1  1  1  0  0  0  4

Note that in the example above although the max_Latency is recorded as 690ms, we can see that 96% of the IO did return in 20ms or under, so this might be very acceptable.

If you speak you your friendly SQL consultant (cough Prodata) they will have some hand rolled tool that automates SQL IO measuring and presents back a nice graph that looks something like:

image

We can see from the chart above that the storage can sustain 8,944 IOPS and still hit SQL requirements (8ms avg and 20ms max). As it goes beyond that the SAN is saturated and latency massively increases.

8.How do I convince the SAN vendor to meet SQL Servers requirements?

Most of the conversations I have with poor SAN performance are based around the scenario where requirements are not met. how do you convince the SAN team or vendor to help meet the basic requirements for running SQL Server?

Well, ultimately its going to cost cash. There are some quick wins we discussed: block size, raid type, increasing the spindle count, and volume alignment. It may also need some help from a performance tuning specialist to highlight bottlenecks in IO such as HBA cards, switches, VMware environment slack of multipathing, or driver related.

The “best” advice is to convince your SAN vendor to let you test their SAN BEFORE you pay it. pre sales help is a lot cheaper that post sales. In Ireland a lot of the SAN manufacturers even have performance centres and know a lot about SQL Server. Not favouring anybody, but a big shout out to the EMC guys who regularly attend every advanced SQL training seminar there is going in Ireland and really know their stuff.

If you can go armed with your requirements (remember: 5,000 random 64k IOPS at 8-20ms), they can show you a configuration meeting those requirements and everyone will be happy.

9. How do I know if things are working well or not ?

Apart from your SQL users moaning you will know if your storage is causing a performance problem for SQL Server by using a few techniques:

The most basic is the windows perfmon counters discussed, You can monitor the IOPS and latency. If you are seeing consistent latency > 20ms then you have a problem.

Another route is to check SQL Server’s wait stats. SQL Server records every single time a query is halted from running due to resource waits and you can query these to see what the biggest problems are. At a high level:

  • the WRITELOG wait stat indicates that the log file cannot be written to fast enough
  • PAGEIOLATCH or ASYNC_IO_COMPLETION and IO_COMPLETION is a sign of potential disk subsystem issues.

10. Useful links on SQL IO and Storage Area Networks

Storage Top 10 Best Practices

http://technet.microsoft.com/en-ie/library/cc966534(en-us).aspx

Complete I/O Best practices whitepaper - http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx

Predeployment I/O Best Practices - http://sqlcat.com/whitepapers/archive/2007/11/21/predeployment-i-o-best-practices.aspx

SQLIO.EXE Tool

SQLIO Disk Subsystem Benchmark Tool 

Disk partition Alignment – Make the Case by Jimmy May

http://blogs.msdn.com/b/jimmymay/archive/2009/05/08/disk-partition-alignment-sector-alignment-make-the-case-with-this-template.aspx

Performance  Tuning with Wait Stats by Joe Sack

http://www.digitalconcourse.com/dropzone/MSCOMM/PASSMN/PASSEVT20090616/Joe%20Sack%20Performance%20Troubleshooting%20with%20Wait%20Stats.pdf

Page List

Page List