SQL Centre of Excellence

(Script can be downloaded here)

 

TempDB is the work horse of most SQL Servers, if a query is doing something wrong, or just being a resource hog, chances are it’s TempDB picking up the slack. It’s a safe bet that the majority of queries running on your system are using TempDB in some way. (Good technet article hereon how TempDB is used)

So it pays to give your TempDB some TLC. Starting out as a DBA I found it pretty confusing as to what the “Best Practices” actually were when it came to TempDB configuration, as there are many differing opinions. So here is a script I’ve put together with what I’m going to call my “Recommended Practices”, rather than “Best Practices”. In my experience structuring your layout to get green lights on each of these checks has yielded the best performance and stability, but your mileage may vary!

What am I checking, and why?

The script performs 6 checks, and I’ll break them down below:

 

1. If you have multiple CPU’s, have you got multiple TempDB data files?

On a multi core system, it is recommended to have multiple TempDB data files. This is to reduce the contention of threads requiring TempDB space trying to hit a single file. The official recommendation from Microsoft is to have 1 TempDB file for each CPU/Core in the server.

Now this is massive overkill, particularly on large multi core boxes, so a ratio of 1 File for every 4 cores is a good place to start, potentially rising to a 1:2 ratio if you still have some hotspots. Paul Randal wrote an excellent blog on this already, so I won’t repeat the effort.

This check measures the ratio of CPU’s to TempDB data files, and reports back accordingly

 

2. If you have multiple TempDB Data files, are they all sized equally?

If you have multiple files, you want to ensure they are all sized equally to ensure the Round Robin effect is in play, which gives you that performance boost. Again Paul discusses this in the previous blog link, so I won’t duplicate on the reasoning.

This check ensures that if multiple files exist, they are all equally sized, and if not, it has a loud whinge.

3. Do you have TF1118 enabled?

This is a contentious one, as it’s not necessarily a best practice, but I’ve found it really helps concurrency in any situation I’ve had issues in. This checks if TF1118, which disables mixed extents (Over to Paul Randal for this one again blog), and in my experience enhances concurrency.

One important note here, this disables mixed extents server wide, not just for TempDB. This isn’t a worry in most cases as disk space isn’t as expensive any more, but just be aware of this.

4. If you have multiple data files, do you have TF1117 enabled?

This trace flag ensures that if one data file experiences a growth event, then all data files grow equally. This means that file growth on a data file doesn’t effectively take it out of the Round Robin pool, as all files must be sized equally to be considered.

You should be pre-allocating anyway, but this trace flag will ensure you don’t take a performance hit if one or more of your files grows.

Again this is server wide, so if you have any user DB’s with multiple files in a Filegroup, they will exhibit the same behaviour

5. Has TempDB data file experienced any auto growth events

You should be pre-allocating space for these files where possible, but this is a check to make sure you got it right. If you have no auto growth events, then you are sized right for your system. If you do, it may be worth taking a look at how much it has grown, and re-evaluate your sizing.

6. Are the TempDB data files on their own storage

TempDB is generally the most I/O intensive database in an instance, and it can be very heavy random I/O. While a lot of workloads are now on SAN’s, it still makes sense to separate TempDB from all of your other files. This is from a management perspective within the OS, but also if you are using tiered storage, such as the EMC VNX, the partition holding TempDB can be placed on the higher tier if it’s out on its own, without having to move all of your user DB’s too.

It’s a recommended practice from me, but if you’re having no issues and you are within your I/O thresholds, then this is by no means a must do.

</Lecture>

So there you have it, some recommended practices, and a way to give your TempDB a quick health check. If anyone has any feedback, questions or comments, or rebuttals, on any of this, as always I’d love to hear it.

Stephen

Few people are naturally “wide” enough to cover the whole of the SQL MCM with ease, and no I don't mean in terms of eating burgers, ice cream, beer and carbs.

When I went to Redmond, I knew that I was weak in clustering, replication and partitioning. I may have ahem “architected” some clusters while working for Microsoft Consultancy Services and even reviewed some design documents, best practices, etc but I had not actually got my hands dirty. It was getting embarrassing as sales guys were even saying “You have a SQL cluster issue, you need to speak to our guy Bob. Bob – show them what's wrong will you”. A friend who runs the dell training centre took me through a series of labs and we made clusters, broke them, player with quorum models, looked into logs, etc. After that one evening some weight was lifted off my shoulders. I could now start to merge the theory and practice a bit better with some affirmation from someone who has set up maybe hundreds of environments. I also stopped avoiding clusters like the plague.

Some advice. If you don’t have and know your weak areas, either your already already an MCM, MCA and BI Maestro (and your name is Dirk Gubbels). OR maybe you haven’t looked hard enough.

Find those weak areas and embrace them. If you are going for MCA don’t be afraid to make it clear on your skills, where they blur and where they end. In my MCA review board I was asked a truck load of questions on ITIL, operations and data centre manageability, I recommended that we bring in a team including an Ops consultant (for which I had some contact), and explained that while I could help them navigate through some of the (obvious) high level issues to engage deeper we needed a specialist rather than just myself. I was more a solution/technical design and infrastructure dude than hard core data centre operations. Expecting a ding, I actually got a comment from the review board of “solid – knows what he knows and what he doesn’t know”. So don’t expect to be an expert in everything. In the MCM you may have to guess some questions as its a Microsoft Learning exam but in MCA – don’t bullshit so much. In the real world you don’t need to personally be an expert in everything to get to the correct result.

If you have read the "http://blogs.prodata.ie/post/So-you-think-your-a-SQL-MCM-Test-Yourself-Part-I.aspx” here some more questions to test where your at. If you read these questions and say “that was fun, must check out 1-2 things” you are ready. If you read them and go “holy crap”, maybe you need to watch the videos some more and attend some of the level 400-500 training events that are going around.

Health Warning: These are not actual MCM questions. Just some questions to demonstrate the type depth I personally would expect to see. If you pass these questions and fail the MCM – sorry, I’ll buy you a beer…

Locking and Blocking

1. How can you see the locks for a specific SPID (two common methods)

2. Someone has a shared lock on a resource. Can an update lock also be serviced?

3. Name two types of locks that can be made if an IX lock is in place ?

4. What can cause a row lock to escalate to a page lock?

5. How can you implement partition level locking?

6. When would you consider the “UPDLOCK” instead of the default shared lock??

7. How can you control lock timeout

8. Name two ways to deal with lock timeout in procedural code?

9. How can you avoid blocking (up to five methods)?

10. What is the relevance of the “'blocked process threshold”

11. How can you avoid deadlocks (six techniques)

Snapshot Isolation

1. What is ACID?

2. Someone issues a begin tran and then two identical select statement. Under what isolation levels will the same results return if updates are ongoing?

3. When RCSI is enabled what is increase in row length and why?

4. What is the difference between RCSI and SNAPSHOT?

Partitioning

1. What are the pros and cons of PV’s versus PT’s?

2. Name two improvements in query plans on partitioned tables in 2008+ over 2005

3. Are there any queries that will run slower on partitioned tables v non partitioned. Name some examples?

4. How do you enable partition level lock escalation?

5. How and why would you disable lock escalation completely on a partitioned table?

6. A customer with a complex OLTP is having performance problems. The developers have read about partitioning and want to apply this. Is this a good idea?

Index Fragmentation

1. What is the largest IO size with read ahead in enterprise and std edition ?

2. What is the largest amount of read ahead that Enterprise and standard edition will do ?

3. Do you want to change your answer to question (1) after doing question 2 ;-)

4 Explain the difference between logical, extent and physical fragmentation. Which ones prevent SQL Server read ahead from using larger IO size?

5. Why is the perfmon counter “Page/Splits per second” not very useful for detecting harmful page splits?

6. How can you detect harmful page splits versus natural ones?

7. List three ways to fix fragmentation

8. What are the pros and cons of using MAXDOP of 1 on an index rebuild for a data warehouse ?

9. Does a clustered index rebuild rebuild all the Non clustered Indexes ?

10. What can prevent online index rebuild operations (name three things)

11. How can you monitor the progress of an index rebuild?

HA - Backup

1. A log backup is started while a full backup is being taken. Will the log record by cleared ?

2. A crash happens during a backup. Will transactions started before the backup be written to the data file on disk? why ?

3. Does a backup also include any log details ? why ?

4. What does BACKUP LOG WITH TRUNCATE ONLY do?

5. How do you backup the “tail” of the log following disk failure ?

6. Will taking a full backup break the log chain ?

7. Name two common ways the log chain is broken on log backups ?

8. What is the largest IO size you can get with backups and how can you get it ?

HA - Restore

1. How do you set instant file initialization?

2. How much faster does instant file initialization make the log file creation?

3. Apart from a faster disk (and instance file initialization) how can you make restore form a full back up faster?

4. You have a customer that needs to create a 100GB log file on restore. How can they make this faster without additional hardware?

5. How can you determine how long the “undo” phase of restore will take ?

6. If you restore with “NORECOVERY” is the redo phase applied

7. What affects how long the “redo” phase takes ?

8. Why should you not use SSMS when recovering from a disaster ?

9. When and why would you use RESTORE..WITH STANDBY ?

10. How does this DMV help with a restore strategy ? “dm_db_persisted_sku_features”

HA – Consistency and Corruption

1. What is the number one cause of corruption ?

2. What is the difference between a hard and soft IO issues ?

3. What is a read-retry error ? does it result in data loss ?

4. Name two places 823, 824 and 825 I/O errors are logged ?

5. What is the recommended tool to check a disk subsystem for IO errors for an OLTP SQL workload?

6. Can you whiteboard the difference between torn page detection and checksum ?

7. What does checksum protect from that torn pages does not ?

8. You switch a database from torn-page detection to checksum. You run a query which will read a page from disk that is corrupted. At what point will the error be detected.

9. What types of errors does automatic page repair work with ?

10. Name some improvements in DBCC CHECKDB introduced in 2005 ?

11. Name some improvements in DBCC CHECKDB introduced in 2008 ?

12. Does DBCC CHECKDB check constraints?

13. How do you estimate how much tempdb may be required for a DBCC CHECKDB ?

14. DBCC checkDB always takes 20 minutes to run, Today it is 60 minutes and still going.

a. What is the most likely cause?

b. How can we help answer how long it will take?

Post any comments on questions that bug you. Would love to get an existing MCM to help provide answers but I’m too lazy ;-)

Whops I accidently deleted the original list of questions. Like a true data professional I don’t backup my  blog. Here’s of the original questions.

Internals

1. What’s an example of a legacy LOB type v new LOB Type ?

2. Are Legacy LOB Types stored on/Off row by default

3. Are New LOB Types stored on/off row by default ?

4. How can you change this behaviour for legacy LOB Types?

5. How can you change this behaviour for New LOB Types?

6. If you change row storage for a LOB from “on row“ to “off row” on a 100GB database how long will it take ?

7. How big in bytes is the “version tag” for versioned rows and what does it store ?

8. How often to PFS pages appear in a data file?

9. How often to IAM pages appear in data file?

Index Strategies

1. A table has 1,000 pages with 100 rows per page. A query on an (select * from blah where MyColumn=1) column returns 500 rows and there is an index on MyColumn. Will it use the index if the index is non-clustered?

2. Can an index include LOB data types?

3. How big can the clustered Index Keys on a Materialised View be?

4. Recommend TWO Ways to get rid of unused indexes if you have a representative trace file

Log Internals

1. Is there such a thing as a non-logged operation ? If so, examples?

2. How Can you get minimally logged inserts on a clustered table with rows in it ? (specify two things to do)

3. A Developer issues a begin tran followed by an update and then a checkpoint. Are the changes written to disk in the checkpoint ?

4. How do you resolve a “checkpoint IO storm”

5. Someone creates a 1GB log file. How many VLF’s are there ?

6. Is a log backup smaller or larger with BULK_LOGGED than with FULL logging mode ?

7. Is rebuilding of an index in simple mode minimally logged?

8. Is re-org of an index in simple mode minimally logged?

9. You need to update a 100GB BLOB when in bulk logged mode. Can this be done minimally logged? If so, how?

10. If you switch from full to bulk and then MDF corrupts. Can you backup the tail log ?

11. How can you make the log file initialisation faster ?

Statistics

1. How many steps are in a stats histogram

2. Are histogram values left or right based?

3. What can cause a recompile (there are 14 events). List as many as you can.

Paradise Lost? Advise Found

by Carmel 3. April 2012 03:28

Help me please my data’s fried

My manager is fuming

My users just cannot get in

Its very time-consuming

 

Resources have all gone to pot

My memory has amnesia

The processor is eaten up

And the network is freezing!

 

Our data feeds are going slow

And inserts do not happen

Our dashboards are not up to date

Yeah - Someone’s for a slappin’

 

Infrastructure took a look,

And bought some new equipment

They assured me things would go

At lightspeed, in an instant

 

We got a team and hauled it in

New servers, SAN and Rack

We moved the database and app

Improvement? - not jack.

 

Infrastructure bought some more

Disks, solid state and all

And memory, upgraded to the max

Things better? Not at all

 

Ah now they said, we have instead

Bought Fusion IO

They  plugged it in, it was a SIN

Performance gain? NO

 

They asked their devs to take a look

They said it was the SQL

And that it was above their heads

So then they called us people

 

In we went to take a look

Your database size?, we asked

There was a pause, a big deep breath

A GIGABYTE they gasped

 

Quite curiously, I gazed

Upon their stored procedure

It was quite long, I will admit

I nearly had a seizure!

 

With UDF’s, temp tables

And cursors abounding

Trigger-Happy to be sure

It really was astounding!

 

I took a look at their IO

T’was something like the matrix

Covering Indexes? Not a one

The scanning was CRAZY!

 

Non-sargable, crossed numa nodes

Blown cache, both plan and dough

A total mess – is this a test?

It really can’t be so!

 

Using Dynamic SQL doesn't always perform

And no param's is not the norm!

So what to do that is the question?

Maybe they'd take a small suggestion?

When your data fits on a memory stick

More hardware's unlikely to do the trick!!

 

 

My Sessions selected at SqlBits

by Bob Duffy 8. January 2012 07:50

I’m super excited that two of my sessions have been selected for SqlBits (http://sqlbits.com/) on the 29th –31st March, 2012. SqlBits is all lined up to be the largest SQL event in Europe this year!

If you go to only one SQL event this year, this is the one. Now if you go to two events you must of course come to our very own first ever Sql Saturday which is also the Irish Technical launch for SQL 2012. its on the 23rd march and is free http://www.sqlsaturday.com/105/eventhome.aspx

here are my two sessions.

Load Testing Analysis Services (Friday 30th March)

http://sqlbits.com/Sessions/Event10/Load_Testing_Analysis_Services

Ever deployed an Analysis Services cube that worked perfectly well with one user on the development server, only to find that it doesn’t meet the required volumes of user concurrency?

This session focuses on tools and methodology to load test Analysis Services in highly concurrent environments. Including how to locate resource bottlenecks and the appropriate configuration settings that can help improve performance.

Sample source code will be supplied to help you load test analysis services. We will be focused on the MOLAP engine, but techniques are equally applicable to the new tabular data model available in BISM.

SQL 2012 Always On-Deep Dive (Sat 31st March)

http://sqlbits.com/Sessions/Event10/SQL_2012_Alwasy_On-Deep_Dive

Always On offers a huge leap forward in terms of high availability. This sessions is a demo based introduction to the high availability changes in Denali emphasising key features and benefits.

During the demonstration we will show how to seamlessly upgrade a database from SQL 2005+ to Denali with ZERO down time and then use the Always on features to simplify reporting, availability and disaster recovery.

This session will be useful for anyone who is working in a high availability environment or currently using replication to provide reporting and looking for a neater solution.

This session scored really highly in the last SQL Bits so I hope to improve the scores this time.

http://blogs.prodata.ie/post/Speaker-feedback-from-SqlBits-9-Scores-on-Doors.aspx

Thanks to everyone who attended my session. Here are some links that might be useful for further reading:

- Presentation Deck (attached to this blog)

- Microsoft Assessment and Planning Application Toolkit

 http://www.microsoft.com/downloads/en/details.aspx?FamilyID=67240b76-3148-4e49-943d-4d9ea7f77730&displaylang=en

- Sql Consolidation Planning Add-In for Excel

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=eda8f544-c6eb-495f-82a1-b6ae53b30f0a

- Demo of using MAP and Excel to produce IO Histogram

http://blogs.prodata.ie/post/Using-MAP-Tool-and-Excel-to-Analyse-IO-for-SQL-Consolidation-Part-I-e28093-basic-distribution.aspx

- Demo of using MAP and PowerPivot to analyse IO for a SQL Estate

http://blogs.prodata.ie/post/Using-MAP-Tool-to-Analyse-IO-for-SQL-Consolidation-Part-II-e28093-Using-Power-Pivot.aspx

Some  Great Questions from the audience which I’ll try to summarise here

Q: What is the maximum recommended CPU limit to decide what SQL Servers to virtualise?
In Hyper-V this is an easy question to answer as it only supports 4. VMware supports more cores, BUT the general recommendation is NOT to use more CPU’s than are available on a single NUMA node which is either 4 or 8. The reasoning is that your virtual operating system will not be “NUMA aware” and having some of the CPU’s on one NUMA node and some on another will be very expensive. There is not a lot of research yet on this type of scenario, so most people avoid it. Your mileage may vary, but we avoid crossing NUMA boundaries with virtualisation.

Q: Does the MAP tool account for growth ?

No, it only captures a snapshot of resource usage. My concern is not usually the growth in CPU usage and IOPS of workloads, but the fact that as time goes on MORE workloads will be added to the virtualised environment. You need to figure out what spare capacity will be allocated for this on the virtual hosts in terms of CPU, IOPS and disk space.

How much disk space you need to park for “growth” depends on how you are virtualising. If you are only moving physical SQL Servers and moving drives like for like, then they may already have had capacity planning for X time.

If you are doing SQL consolidation (moving data and log files to new environment), then this is trickier. It is possible to run the MAP Inventory tool at different months to measure % growth in databases to help plan for growth.

Q: Why don’t we ask for IOPS based on the MAX amount used.?

This would be cool to do for only one server as you get guaranteed performance no matter how busy it is! However imagine you have 100 servers each doing an avg 50 IOPS, 80% percentile at 80 and a max 250 IOPS.

Commissioning 25,000 IOPS at 8ms could be quite expensive, but 8,000 iops might be more practical.

Note we are not saying that the maximum IOPS we would ask for is 8,000. Just that we want 8,000 at avg 8ms latency. Hopefully the storage will be able to spike above 8,000 IOPS.

Its unlikely that all 100 servers will need their “max” at the same time, so we can save some costs with shared storage.

 

Thanks. Do post any follow up questions Winking smile

 

 

 

 

 

 

 

 

 

 

 

Page List

Page List