SQL Centre of Excellence

When migrating a SQL Server workload we need to think about a few things

  • A baseline of what resources our workload needs, at normal and peak operation, often we may just capture performance counters for CPU, Disk, maybe some SQL Server counters such as Batch Requests per second and ideally a business counter such as OPM (Orders Per Minute).
  • A way to measure the target platform to make sure it has sufficient resources to run the workload. For disk resources the defacto tool is SQLIO. I have some notes on how to specify requirements here (http://blogs.prodata.ie/post/How-to-Specify-SQL-Storage-Requirements-to-your-SAN-Dude.aspx). For CPU resources you may simply check the number of cores and frequency, or use an industry standard benchmark such as those available on www.spec.org

The cloud presents some challenges. If we are migrating to the Infrastructure as a Service,its really easy to verify we have the correct resources by simply running tools like SQLIO or running the workload and checking the performance counters for throughput and latency. If we are migrating a database to “Windows Azure SQL Database” or SQL Azure as it is still commonly called, we are a bit more restricted. WASD doesn’t allow us to use perfmon or SQLIO, so the best benchmark we can use is something measured from the application like OPM.

So what does SQLIO tell me about running SQL in the cloud ?

As a test we created a four core VM in the West Europe data centre and then run a series of SQLIO tests to measure the “saturation” point for disk IO (Thanks Fintan and Stephen!). The resulting chart is below.

This shows that the disk we tested scaled up to about 30MB/Second at just over 30ms latency. The highlighted section on the chart shows the “saturation point” where increasing the outstanding IO gives no more throughout. This is essentially the limit.

image

30MB/Sec equates to almost exactly 1,000 IOPS. Quite a nice round number! so much so that I think its safe to say that this is probably a limit imposed by throttling rather than the actual limitation of the hardware.

One key challenge for SQL infrastructure running in the cloud is the log drive performance. Traditionally we like to keep the latency at say 3ms, but assumabley due the replication and the fact the VHD is mounted on top of a blob store we have  very high write latency. At low yields, say only a few MB/Sec, it maintains 10-20ms average, but then quickly increases to 100ms plus average!

image

I ran a sample ecommerce OLTP supplied by dell on a four core VM running in IaaS. We can see that the number one wait statistics was indeed WRITELOG, so a primary bottleneck is the disk subsystem. You can see the delays from the DMV “exec requests”.

image

 

So what does this mean again ?

This gives us a few facts we can use when doing capacity planning for workloads moving to the cloud:

  • Azure IaaS (4 core) offering currently has a max ceiling of 1,000 IOPS for a disk. How much you get in terms of MB/Sec will depend on the block size: 30MB/Sec with an 8K block size and 200MB/Sec with 64k block size. You won’t get beyond 200MB/Sec or 1,000 IOPS due to throttling, but maybe this will change in the future.
  • While the max IOPS is 1,000 I was just in a session with the SQLCAT team at SqlBits and they recommend to consider 500 IOPS per disk for capacity planning..
  • Different VM Sized allow you to add more disks, and therefore get more IOPS. It seems that you probably want to consider using multiple disks as a matter of course in Azure IaaS as the charge is not based on the number of disks
  • Write latency will start to affect service quality moreso than it will with on premise disk subsystems which are not replicated.  You need to make sure that this is either not important or design applications to work more asynchronously, so this does not translate into poor user quality with log writes (as our sample OLTP suffered from)

Thoughts for some more testing

I note that there are two options for disk configuration in Azure Iaas: host caching and whether geo-replication is enabled. I would love to test the effect of both of these on IO throughput and latency. This could for example affect placement of TempDB. We ran the tests with geo-replication disabled.

I’m just going to see if we can get more throughput or better latency by utilising more than one logical disks. Will hopefully post the results ;-)

If you want to repeat these tests you can sign up for Azure IaaS here.

http://www.windowsazure.com/en-us/

and my article here has some links and tips on running SQLIO

http://blogs.prodata.ie/post/How-to-Specify-SQL-Storage-Requirements-to-your-SAN-Dude.aspx

As a recap for anyone who has made it this far, here are the other three blog entries in the series:

This weeks questions are on waits and extended events. OK admission – I suck at extended events. Never really got on the gravy train with SQL 2008 and still clinging onto the hope that RML utilities will be upgraded to support SQL 2012. I’d even do the upgrade myself if the source code was on codeplex, such is my aversion to extended events. Waits on the other hand are the bread and butter of performance tuning – to be a SQL MCM I would expect you not only need to be able to articulate main wait types, but at least share some horror stories of situations where you’ve seen them. Mere mention of CXPACKET should be enough for a one hour debate.

Waits (not just wait stats)

1. Can you define “signal wait time” ?

2. In the DMV ‘sys.dm_os_wait_stats’ does the column “wait_time” include or exclude signal wait time?

3. What does a high “signal wait time” usually mean ?

4. What does a high wait time but low signal wait time mean ?

5. What is the difference between WRITELOG and LOGBUFFER wait types ?

6. When running a query what could cause an IO_COMPLETION wait type ?

7. When running a query what could cause an ASYNC_IO_COMPLETION wait type ?

8. What can cause PAGELATCH_XX ?

9. If you  have a RESOURCE_SEMAPHORE wait for a large query how could you resolve this?

10. How could you resolve THREADPOOL waits on a server with a high number of concurrent users?

11. How would you determine why a log file was continually growing ?

12. What can sys.dm_io_virtual_file_stats show you that you cannot see from perfmon counters ?

Extended Events (Assume SQL 2008R2)

  1. What is a predicate WRT extended events ?
  2. Is a predicate at the event or session scope ?
  3. Explain the difference between sys.dm_xe_packages and sys.dm_xe_objects ?
  4. Name some targets available in SQL 2008R2 (six available) ?
  5. How big is an asynchronous buffer by default ?
  6. What happens if an event is bigger than this default size?
  7. Explain the difference between the three options for dealing with a full extended events buffer using EVENT_RETENTION_MODE
  8. If the services is restarted will an xevents session continue ?
  9. Why would you want to use extended events to track page splits instead of just using the performance counter ?
  10. What is the purpose of this xevents node “//RingBufferTarget/event/action/value”

I’m at the Belfast user Group Thurs the 11th April to demo and discuss two of my favourite topics (below). If you are interested in SQL Server or want to know a bit more about Windows Azure hype v feature then please do come alone.

SQL 2012 Always On Deep Dive (Beginner/Intermediate)

Always On Availability Groups offers a huge leap forward in terms of high availability. This sessions is a demo based introduction to the high availability changes in SQL 2012 emphasising key features and benefits of Availability Groups.

During the demonstration we will show how to seamlessly upgrade a database from SQL 2005+ to Denali with ZERO down time and then use the Always on features to simplify reporting, availability and disaster recovery.

This session will be useful for anyone who is working in a high availability environment or currently using replication to provide reporting and looking for a neater solution.

Windows Azure for SQL Folk (Beginner)

So what is this cloud stuff and how does it affect database professionals?

In this session we aim to cover the fundamental concepts that you should know on the Cloud and Windows Azure. In this demo based hour we will step through building Azure applications, deploying and managing them and what storage choices you have when developing applications in the cloud.

This sessions is Ideally suited to anyone who wants to understand more behind how Azure works, what it offers and what is probably just hype. Note that we are looking at the whole of Azure rather than just SQL Azure.

Date: Thursday 11th April 2013
Time: 5:45pm-8:45pm
Venue: Liberty IT, Adelaide Exchange, 24-26 Adelaide St, Belfast.

Registration is at http://www.eventbrite.com/event/5705276636#

Last year one of the most popular sessions for SQL Saturday were the lightning talks. No need to register in advance, just register at the event for a FIVE minute slot. one hour and maybe ten speakers.

The rules are simple:

- Topic can be ANYTHING loosely related to SQL Server
- Strict five minute time, buzzer provided for cut off

We will have some mystery prizes. Last year prize categories were:

  • Best “New” speaker (no MVPs and product team folk)
  • Most entertaining
  • Best speaker

This years promised to be a mixture of short nuggets form industry experts with some light hearted content thrown in.

If you missed Matt Massons hilarious “How to use SSIS and slipstream to manage facebook” he is presenting a sequel “How to use SSIS to manage online dating”

@CarmelGunn penned “Paradise Lost? Advise found” with the catch tag of “When your data fits on a USB stick, more hardware’s unlikely to do the trick” http://blogs.prodata.ie/post/Paradise-Lost-Advise-Found.aspx

To register and for more information on this years SQL Saturday Dublin go to
http://www.sqlsaturday.com/229/

or catch up at #sqlsatdublin on twitter

These questions are focused on availability, more next week on performance and tuning. I helped write some of the Mirror questions on the v1.0 MCM exam which must be well retired by now, and I must admit they were pretty obscure.  Learn those failure conditions and resources/threads required on the MSDN articles and MCM videos!

HA - Availability

1. Define the difference between RPO and RTO ?

2. What are the key components of an SLA ?

3. What are the key characteristics of an SLO (or Service Level Target) (eight possible)

4. Name five different types of disaster, what they affect and recommended mitigation steps

5. Articulate the key feature differences between Log shipping and Replication

6. When and why would you use Failover clustering instead of Mirroring

7. What would be a blocker to using failover clustering (name three blockers)

8. A major web site wants “zero down time” and “zero data loss” and “zero user impact”. What SQL Server technology will achieve this by itself?

9. How can developers ensure that mirroring and always on do not affect the user experience when they failover WRT uncommitted transactions?

10. Now that you have answered (9) do you want to change your answer to (8) ;-)

11. What is the max failover time with a mirror with default settings?

12. What types of failures can cause this?

13. Explain the difference between “Looks Alive” and “is Alive” in WFC?

14. How does this change in SQL 2012?

15. Windows 2008 cluster components no longer need the HCL, but what “certification” do they need?

16. A customer has added a new disk to the cluster. Do they need to rerun the validation tool to maintain supported status?

17. Should a cluster use the vendor MPIO driver or the MS MPIO ?

18. Any special anti-virus considerations for clustering? name them

19. There is a three node cluster with a single instance. How many nodes should be removed as possible owners before patching?

20. Why might log shipping offer better performance for a data warehouse workload than mirroring?

21. What’s the maximum number of databases that it is recommended to Mirror on a single instance?

22. How many additional threads does a mirror consume on the principal per database mirrored?

23. At what level of hardware does enterprise edition offer more redo threads with mirroring ?

24. What OS and SQL editions can be supported with the SVVP ?

25. Is VMWARE or Hyper-V “Snapshot” technology supported by MS ?

26. What does the RHS.exe do in Windows 2008 clustering ?

Few people are naturally “wide” enough to cover the whole of the SQL MCM with ease, and no I don't mean in terms of eating burgers, ice cream, beer and carbs.

When I went to Redmond, I knew that I was weak in clustering, replication and partitioning. I may have ahem “architected” some clusters while working for Microsoft Consultancy Services and even reviewed some design documents, best practices, etc but I had not actually got my hands dirty. It was getting embarrassing as sales guys were even saying “You have a SQL cluster issue, you need to speak to our guy Bob. Bob – show them what's wrong will you”. A friend who runs the dell training centre took me through a series of labs and we made clusters, broke them, player with quorum models, looked into logs, etc. After that one evening some weight was lifted off my shoulders. I could now start to merge the theory and practice a bit better with some affirmation from someone who has set up maybe hundreds of environments. I also stopped avoiding clusters like the plague.

Some advice. If you don’t have and know your weak areas, either your already already an MCM, MCA and BI Maestro (and your name is Dirk Gubbels). OR maybe you haven’t looked hard enough.

Find those weak areas and embrace them. If you are going for MCA don’t be afraid to make it clear on your skills, where they blur and where they end. In my MCA review board I was asked a truck load of questions on ITIL, operations and data centre manageability, I recommended that we bring in a team including an Ops consultant (for which I had some contact), and explained that while I could help them navigate through some of the (obvious) high level issues to engage deeper we needed a specialist rather than just myself. I was more a solution/technical design and infrastructure dude than hard core data centre operations. Expecting a ding, I actually got a comment from the review board of “solid – knows what he knows and what he doesn’t know”. So don’t expect to be an expert in everything. In the MCM you may have to guess some questions as its a Microsoft Learning exam but in MCA – don’t bullshit so much. In the real world you don’t need to personally be an expert in everything to get to the correct result.

If you have read the "http://blogs.prodata.ie/post/So-you-think-your-a-SQL-MCM-Test-Yourself-Part-I.aspx” here some more questions to test where your at. If you read these questions and say “that was fun, must check out 1-2 things” you are ready. If you read them and go “holy crap”, maybe you need to watch the videos some more and attend some of the level 400-500 training events that are going around.

Health Warning: These are not actual MCM questions. Just some questions to demonstrate the type depth I personally would expect to see. If you pass these questions and fail the MCM – sorry, I’ll buy you a beer…

Locking and Blocking

1. How can you see the locks for a specific SPID (two common methods)

2. Someone has a shared lock on a resource. Can an update lock also be serviced?

3. Name two types of locks that can be made if an IX lock is in place ?

4. What can cause a row lock to escalate to a page lock?

5. How can you implement partition level locking?

6. When would you consider the “UPDLOCK” instead of the default shared lock??

7. How can you control lock timeout

8. Name two ways to deal with lock timeout in procedural code?

9. How can you avoid blocking (up to five methods)?

10. What is the relevance of the “'blocked process threshold”

11. How can you avoid deadlocks (six techniques)

Snapshot Isolation

1. What is ACID?

2. Someone issues a begin tran and then two identical select statement. Under what isolation levels will the same results return if updates are ongoing?

3. When RCSI is enabled what is increase in row length and why?

4. What is the difference between RCSI and SNAPSHOT?

Partitioning

1. What are the pros and cons of PV’s versus PT’s?

2. Name two improvements in query plans on partitioned tables in 2008+ over 2005

3. Are there any queries that will run slower on partitioned tables v non partitioned. Name some examples?

4. How do you enable partition level lock escalation?

5. How and why would you disable lock escalation completely on a partitioned table?

6. A customer with a complex OLTP is having performance problems. The developers have read about partitioning and want to apply this. Is this a good idea?

Index Fragmentation

1. What is the largest IO size with read ahead in enterprise and std edition ?

2. What is the largest amount of read ahead that Enterprise and standard edition will do ?

3. Do you want to change your answer to question (1) after doing question 2 ;-)

4 Explain the difference between logical, extent and physical fragmentation. Which ones prevent SQL Server read ahead from using larger IO size?

5. Why is the perfmon counter “Page/Splits per second” not very useful for detecting harmful page splits?

6. How can you detect harmful page splits versus natural ones?

7. List three ways to fix fragmentation

8. What are the pros and cons of using MAXDOP of 1 on an index rebuild for a data warehouse ?

9. Does a clustered index rebuild rebuild all the Non clustered Indexes ?

10. What can prevent online index rebuild operations (name three things)

11. How can you monitor the progress of an index rebuild?

HA - Backup

1. A log backup is started while a full backup is being taken. Will the log record by cleared ?

2. A crash happens during a backup. Will transactions started before the backup be written to the data file on disk? why ?

3. Does a backup also include any log details ? why ?

4. What does BACKUP LOG WITH TRUNCATE ONLY do?

5. How do you backup the “tail” of the log following disk failure ?

6. Will taking a full backup break the log chain ?

7. Name two common ways the log chain is broken on log backups ?

8. What is the largest IO size you can get with backups and how can you get it ?

HA - Restore

1. How do you set instant file initialization?

2. How much faster does instant file initialization make the log file creation?

3. Apart from a faster disk (and instance file initialization) how can you make restore form a full back up faster?

4. You have a customer that needs to create a 100GB log file on restore. How can they make this faster without additional hardware?

5. How can you determine how long the “undo” phase of restore will take ?

6. If you restore with “NORECOVERY” is the redo phase applied

7. What affects how long the “redo” phase takes ?

8. Why should you not use SSMS when recovering from a disaster ?

9. When and why would you use RESTORE..WITH STANDBY ?

10. How does this DMV help with a restore strategy ? “dm_db_persisted_sku_features”

HA – Consistency and Corruption

1. What is the number one cause of corruption ?

2. What is the difference between a hard and soft IO issues ?

3. What is a read-retry error ? does it result in data loss ?

4. Name two places 823, 824 and 825 I/O errors are logged ?

5. What is the recommended tool to check a disk subsystem for IO errors for an OLTP SQL workload?

6. Can you whiteboard the difference between torn page detection and checksum ?

7. What does checksum protect from that torn pages does not ?

8. You switch a database from torn-page detection to checksum. You run a query which will read a page from disk that is corrupted. At what point will the error be detected.

9. What types of errors does automatic page repair work with ?

10. Name some improvements in DBCC CHECKDB introduced in 2005 ?

11. Name some improvements in DBCC CHECKDB introduced in 2008 ?

12. Does DBCC CHECKDB check constraints?

13. How do you estimate how much tempdb may be required for a DBCC CHECKDB ?

14. DBCC checkDB always takes 20 minutes to run, Today it is 60 minutes and still going.

a. What is the most likely cause?

b. How can we help answer how long it will take?

Post any comments on questions that bug you. Would love to get an existing MCM to help provide answers but I’m too lazy ;-)

Just found out that I’ve been selected to speak at SQL Saturday Edinburgh on the 7th/8th of June http://sqlsaturdayedinburgh.com

I’m super excited as Edinburgh is one of my favourite cities, having been through my student years there and sampled many a pint in the Last Drop on Grassmarket. I’ve been back a few time for new year but never during the summer.

So what topic am I speaking on ? Well a topic fitting of a SQL Server MCA, SQL MCM, SQL Server MVP and Analysis services Maestro. Its a tool that leaves SQL professionals, IT professionals and dot.net developers shaking in their boots (or pulling their hair out).

Yes, my “Microsoft Access Power Hour” is finally coming to a SQL Saturday near you.

Come along and jeer, see what the enemy is wooing the business with;  throw some rotten tomatoes as we show how fantastic Microsoft Access is . If you can’t beat the enemy, join them: find out how to get SQL Server and Access to co-exist, even maybe work together, and actually mention the world “performance” without giggling.

If you have not seen what access can do I guarantee that you will change your opinion – or at least know your enemy better so you can fight your battles.

Yes, with Microsoft Access you do common tasks ridiculously faster than with the “enterprise” equivalents.

Some popular demos:

- Using Access to build your windows front ends in 30 seconds, while dot.net developers are still getting their coffee before white boarding class diagrams.

- Developing reports in 15 seconds. Make reporting services and even PowerView look cumbersome.

- Building data driven web sites in 60 seconds. Make Access do the hard work and leave web designers to work on the nice logos and arty stuff.

- Transferring data for ETL solutions in five seconds. Leave SSIS and BCP for when you need to take more time to justify your bill.

- Developing SharePoint solutions with Microsoft Access in 90 seconds, or optionally employ a SharePoint “workflow” developer for six weeks.

- Using Microsoft Access to bulk load and transfer data.

- How to bypass common access security to prove access is very insecure.

- How to make Access databases secure to annoy people who are trying to prove it is insecure.

- How to secretly scale Microsoft Access front ends to many hundreds of concurrent users (people have been killed for less)

- Why Microsoft Access can teach SQL Server a thing or two about meta data, extended properties, data management and being RAD.

We have  few places left for the three day agile data warehouse design course with Lawrence Corr. This is the first time we have run a dimensional modelling course in Ireland and we picked Lawrence Corr to present the course. Numbers are limited to only 25, so do book as soon as you can – we can’t accommodate a lot of people on this course.

If you are working with data warehouse’s or star schemas in any way shape or form, then  this will be a really good course. While I speak at a lot of event in the year I only take about one 2-3 day course a year and this is the one I am attending this year.

Good data modelling skills are the foundation of any star schema design and Lawrence has lots of experience to help guide us through the topic, having been an associate of Ralph Kimball himself and worked for Ralph Kimball University.

  • Day 1 – Modelstorming/ Agile BI requirements gathering
  • Day 2 – Agile Star Schema Design
  • Day 3 – Dimension Design Patterns

For more information and to book, follow the link below:

http://www.prodata.ie/events/2013/Agile_Data_Warehouse_Design.aspx

I use the term “Checkpoint IO storm” to refer to issues caused by the checkpoint process “flooding” the disk subsystem. This is often diagnosed on OLTP systems as periodic high latency writes which occur at very regular intervals.

The basic problem is that SQL Server does not write data changes write to disk immediately (the MDF file), it will continue to only write to the log file and write dirty pages to the in memory buffer pool. Every now and then it will then flush the dirty pages to disk by the checkpoint process based on how long it thinks it would take to recover if the server crashed (the recovery interval).

For many customers this process goes un-noticed for other customers they suffer a very noticeable drop in service quality.

How do you Know you have a Checkpoint Issue ?

Well high latency periodically is a warning sign but the sure sign is when you can correlate the high latency to the checkpoint process using these performance counters:

- Buffer Manager\CheckPoint Pages/Sec

- Disk Secs / Write

- Disk Writes/Sec

As a test I ran a load test on a sample OLTP doing ecommerce transactions on the C: drive of my laptop and this is what they look like. You can see that my workload does about 105 IOPS and disk latency is usually pretty great (under 1ms as its SSD), but when the Checkpoint pages/sec kicks in the IOPS go up to 5,000 or so and disk latency goes up to 20ms. My critical inserts will now get 20ms of WRITELOG waits !!

image

Paul Randal goes into some detail in the checkpoint process on his Immersion course aimed at MCM candidates and also has a blog article on monitoring checkpoints further than the perfmon counters

http://www.sqlmag.com/blog/sql-server-questions-answered-28/sql-server/how-to-monitor-checkpoints-137148

How do you solve checkpoint issues

There are a few tools in your arsenal.

a) Reconfigure the disk subsystem to handle the burst smoother (if you can get past the SAN mafia).

b) Set the recovery interval on your instance This is roughly the number of minutes of data SQL Server will keep dirty before flushing. If you lower this the checkpoints will be more frequent.   http://msdn.microsoft.com/en-us/library/ms191154.aspx

c) Make the checkpoint flush over a longer time to regulate the IO using the checkpoint command and the [checkpoint duration] flag. See http://msdn.microsoft.com/en-us/library/ms188748.aspx 

Lets See this [checkpoint duration] thing in action

You need to be careful as if you specify a checkpoint duration too low, this will cause spikes and if you specify a period too long, SQL server will not be writing as fast as the pages are getting dirty so will need to do another unregulated checkpoint after the manual one!

As an example I ran the script below while a OLTP load was applied.

image

image

A bit better but not perfect ;-( you can see that the 5,10 and 20 second checkpoints for this particular workload still get a small latency spike (the red line). Not as bad as the 20ms, but 4-10ms all the same, not great for log writes.

I then tried the following script

image

image

Much better we can see that peak IOPS have gone down from 9,000 during the “IO storm” to under 3,000 and latency barely kisses 1ms. The SAN guys can buy us a Guinness…

So why don’t I go away and mess with this right now?

Two reasons. Firstly you need to know the workload super well to start messing with the checkpoint. The default settings are kind of self regulating, so can handle when the workload changes.

Secondly, this issue can more often be fixed by basic disk best practices such as not placing the data and log on the same spindles or LUN, or if the server has two HBA’s that are not muti-path, placing the data on one and the log on another,or if a big data load considering minimal logging.

One issue I have seen a few times is if the SAN runs out of write cache half way through the checkpoint. You can be in a world of pain as SQL Server will be pushing at potentially tens of thousands of IOPS to the cache and suddenly boom – the SAN can’t handle it as it runs out of cache and all hell breaks loose.

we finally have a date for this years SQL Saturday Dublin It is June 22nd, 2013. The registration site is below:

http://www.sqlsaturday.com/229/

Without question this will be the biggest SQL Server event in Ireland for 2013.

Last  year we had about 400 attendees, with many of the worlds top SQL speakers present, excellent prizes, barbeque, beer and an after event party with a real Irish theme. The event was sold out and the Hilton Hotel was packed with everything to do with SQL Server. Precon wise we had Matt Masson from the SSIS product team delivering a deep dive on SSIS design patterns for performance and new features in SQL 2012.

This year we are looking to top that. Speaker submissions are now active on the site and we will be looking to run TWO full day pre-conference seminars on the Friday. On the Saturday we will have FOUR tracks covering core database, BI, Developer and cloud/special interest.

For those of you who know Marco and Sandra, Inga, Niall and Carmel who are helping run the event , you’ll know that this will be a special/unique event and if you can stay for the evening you wont be disappointed.

Here is a clip of Jack Wise in action from last year

Jack wise at the last SQL Saturday

 

Hope to see you there!!

Page List

Page List