When migrating a SQL Server workload we need to think about a few things
- A baseline of what resources our workload needs, at normal and peak operation, often we may just capture performance counters for CPU, Disk, maybe some SQL Server counters such as Batch Requests per second and ideally a business counter such as OPM (Orders Per Minute).
- A way to measure the target platform to make sure it has sufficient resources to run the workload. For disk resources the defacto tool is SQLIO. I have some notes on how to specify requirements here (http://blogs.prodata.ie/post/How-to-Specify-SQL-Storage-Requirements-to-your-SAN-Dude.aspx). For CPU resources you may simply check the number of cores and frequency, or use an industry standard benchmark such as those available on www.spec.org
The cloud presents some challenges. If we are migrating to the Infrastructure as a Service,its really easy to verify we have the correct resources by simply running tools like SQLIO or running the workload and checking the performance counters for throughput and latency. If we are migrating a database to “Windows Azure SQL Database” or SQL Azure as it is still commonly called, we are a bit more restricted. WASD doesn’t allow us to use perfmon or SQLIO, so the best benchmark we can use is something measured from the application like OPM.
So what does SQLIO tell me about running SQL in the cloud ?
As a test we created a four core VM in the West Europe data centre and then run a series of SQLIO tests to measure the “saturation” point for disk IO (Thanks Fintan and Stephen!). The resulting chart is below.
This shows that the disk we tested scaled up to about 30MB/Second at just over 30ms latency. The highlighted section on the chart shows the “saturation point” where increasing the outstanding IO gives no more throughout. This is essentially the limit.
30MB/Sec equates to almost exactly 1,000 IOPS. Quite a nice round number! so much so that I think its safe to say that this is probably a limit imposed by throttling rather than the actual limitation of the hardware.
One key challenge for SQL infrastructure running in the cloud is the log drive performance. Traditionally we like to keep the latency at say 3ms, but assumabley due the replication and the fact the VHD is mounted on top of a blob store we have very high write latency. At low yields, say only a few MB/Sec, it maintains 10-20ms average, but then quickly increases to 100ms plus average!
I ran a sample ecommerce OLTP supplied by dell on a four core VM running in IaaS. We can see that the number one wait statistics was indeed WRITELOG, so a primary bottleneck is the disk subsystem. You can see the delays from the DMV “exec requests”.
So what does this mean again ?
This gives us a few facts we can use when doing capacity planning for workloads moving to the cloud:
- Azure IaaS (4 core) offering currently has a max ceiling of 1,000 IOPS for a disk. How much you get in terms of MB/Sec will depend on the block size: 30MB/Sec with an 8K block size and 200MB/Sec with 64k block size. You won’t get beyond 200MB/Sec or 1,000 IOPS due to throttling, but maybe this will change in the future.
- While the max IOPS is 1,000 I was just in a session with the SQLCAT team at SqlBits and they recommend to consider 500 IOPS per disk for capacity planning..
- Different VM Sized allow you to add more disks, and therefore get more IOPS. It seems that you probably want to consider using multiple disks as a matter of course in Azure IaaS as the charge is not based on the number of disks
- Write latency will start to affect service quality moreso than it will with on premise disk subsystems which are not replicated. You need to make sure that this is either not important or design applications to work more asynchronously, so this does not translate into poor user quality with log writes (as our sample OLTP suffered from)
Thoughts for some more testing
I note that there are two options for disk configuration in Azure Iaas: host caching and whether geo-replication is enabled. I would love to test the effect of both of these on IO throughput and latency. This could for example affect placement of TempDB. We ran the tests with geo-replication disabled.
I’m just going to see if we can get more throughput or better latency by utilising more than one logical disks. Will hopefully post the results ;-)
If you want to repeat these tests you can sign up for Azure IaaS here.
and my article here has some links and tips on running SQLIO