SQL Centre of Excellence

How to Generate a Hash in SSIS

by Bob Duffy 30. March 2011 16:33

Its quite common to have a requirement to generate a SHA1 or MD5 Hash in SSIS to help with comparing non-key columns for “delta” management.

One question that comes up is “What’s the fastest way to generate a Hash”. I just did a little test to compare two methods:

Method a) SQL Servers HashBytes(‘SHA1’, Column1+ coalesce(Column2,’’)) function

Method b) In SSIS Using the MultipleHash Component http://ssismhash.codeplex.com/

The target data set was a 10 million row table based on the Adventure Works Sales.SaleOrderHeader table scaled up on my 64 bit laptop with an 8 core i7 CPU.

Method

Time (Secs)

Avg %Processor Time

DTEXEC Working Set (MB)

  Method A - SQL HashBytes function on source

104

50

47

Method B - SSIS MultipleHash Component

644

107

156

image

 

Conclusions

Not sure if anyone else has got better performance from generating the Hash within SSIS as opposed to SQL Server but my conclusion so far is:

  • SQL Server HashBytes function is a lot faster than doing the Hash calculation in the Dataflow (x6 in my test).
  • There are other factors to choose which method is appropriate:
    • The SSIS Custom Component is easier to use (point and click)
    • The SSIS Custom Component allows you to hash columns which are  “introduced” in the pipeline
    • The SSIS component required the shipping of a DLL (bad)
    • The HashBytes function does not support nulls so needs a lot of casting to strings and checking for nulls.
    • A checksum function is faster again, but carries more risk of not detecting a change.

Turn that Screen Saver OFF

by Bob Duffy 26. March 2011 07:22

Just massaging the content for the SSIS Masterclass on Thursday (http://www.prodata.ie/Events/2011/SSIS-Deep-Dive.aspx). As usual I am actually playing with a few other things in the background on various virtualised environments.

I just realised what an annoyance screen savers are on virtualised environments. I mean you come back after a cup of tea and you have your desktop screen saver to re-login to and also five other screens. Even worst on a customer site this week where I was logged into remote desktop on production, test and desktop domains each with different passwords and some with a fob to co-authenticate with. Arghhhh, I can’t count the number of times a cup of tea has cost me an account lockout on that site.

Anyways, turned the screen saver off on all my VM’s and I’m sure I’ve saved 10 minutes a day Winking smile

Ok, maybe some customers have security needs for screen savers, but for us, I have decided a new policy of screen savers ONLY on devices with screens…as a bonus, one less task in the background too stealing my CPU cycles.

Quick Notes on TempDB Sizing

by Bob Duffy 26. March 2011 05:24

Here’s some common questions I am asked on TempDB

Question 1) How Big should my TempDB Be ?

On an existing system this is quite a simple question to answer. by default the TempDB starts at 8MB and grows by 10%, so we just look at the size it has grown to and now we know how big it needs to be Winking smile

On a “suck it and see” approach the only consideration is to ensure that file growth is not left at the default values so we can limit “growth” events. Consider growing in 50MB, or 100MB+ increments, depending on the size of the databases.

Once we know how big it can grow to, it is really important to go back to the database size settings and correctly size the TempDb database. Bear in mind, when the instance is restarted the TempDB will be recreated back to the original size specified. This can lead to “Monday Blues” following a reboot on a Sunday. Warning signs for consultants are customer statements like “the system is slow every Monday morning”

What if you don’t have an existing workload, e.g. you are doing capacity planning for a new solution that has not been tested yet. Well, a few underpinning questions to ask:

1) How is that largest table or index or partition that will be rebuilt. If this is sorted in TempDB, then this is a minimum size plus a bit more.

2) Do we make use of any new features that use a lot of TempDB such as snapshot isolation or lots of large sorts http://msdn.microsoft.com/en-us/library/ms345368.aspx

3) if we have the database with sample data we can run DBCC CHECKDB WITH ESTIMATEONLY to estimate the amount of size needed for consistency checking

Question 2) How many files do I need for TempDB

This is always a contentious question, even amongst the SQL community. Microsoft do recommend one file per core http://msdn.microsoft.com/en-us/library/ms175527.aspx 

On greater than eight cores, I don’t usually go that far on recommendations  as this can be harder to manage and make sure that growth in all files is even. On most systems I see 2-4 files is more than enough to avoid concurrency issues.

Paul Randal also recommends avoiding that 1-1 ratio for most systems: http://www.sqlskills.com/BLOGS/PAUL/post/A-SQL-Server-DBA-myth-a-day-%281230%29-tempdb-should-always-have-one-data-file-per-processor-core.aspx 

However, there is the expression “No one ever got fired for having too many files for TempDB”. Too many files is likely to hurt less than not enough, but the point is it can hurt…

Question 2) Why is my TempDB so Huge!

Well there are three likely suspects which we need to eliminate from enquiries:

a) User objects: developers specifically creating temporary objects

b) Internal objects: hashing and sorting for monster queries or queries with bad plans and lack of indexes.

c) version store: for snapshot isolation and triggers maintaining multiple versions of rows in TempDB

The DMV sys.dm_db_file_space_usage will show which one is the culprit. here's a sample query that shows total 8k pages and how they are broken down

SELECT
user_object_perc = CONVERT(DECIMAL(6,3), u*100.0/(u+i+v+f)),
internal_object_perc = CONVERT(DECIMAL(6,3), i*100.0/(u+i+v+f)),
version_store_perc = CONVERT(DECIMAL(6,3), v*100.0/(u+i+v+f)),
free_space_perc = CONVERT(DECIMAL(6,3), f*100.0/(u+i+v+f)),
[total] = (u+i+v+f)
FROM (
SELECT
u = SUM(user_object_reserved_page_count)*8,
i = SUM(internal_object_reserved_page_count)*8,
v = SUM(version_store_reserved_page_count)*8,
f = SUM(unallocated_extent_page_count)*8
FROM
sys.dm_db_file_space_usage
) x;

Storage area networks are getting larger and more sophisticated. One of the problems I face as SQL consultants is running sqlio on them.

One common stumbling block is the size of the cache on some storage area networks can be huge (128GB plus), especially with the use of solid state devices as a tier “zero” cache accelerator. If our test file is too small we simply get the “from cache” speed of the SAN and while this is interesting its not indicative of performance when we throw a couple of TB of SQL Server data onto it. The fact is that often read cache on a SAN is a waste of money for SQL Server anyway as the cache hit ratio on the SAN will be incredibly low – SQL Server already has its own cache in the buffer pool, so its not going to be reading data from the cache of the SAN anytime soon unless the cache on the SAN is larger than the size of the SQL Servers buffer pool (possible but unlikely).

So what do we do when the SAN has a 64GB cache ? Well guidelines we use is that the size of the test file should be about four times the size of the cache. So your going to need a 256GB test file.

What if you don’t want to wait for a 256GB test file to be created as you only have limited test time?

Here’s a neat trick that Joe Sack from the SQL MCM program showed me for using SQL Servers “Instant File Initialisation” feature to massively speed up creation of test files.

1. Ensure that SQL DB Engine is installed and that Instant File Initialisation is configured in the local Security Policy Editor.

2. Create a database MDF file of the correct size (say 256GB) and detach the database

3. Modify the parameter file that sqlio uses to specify that you have pre-sized the file (see below)

4. Away you go.

Modifying the sqlio parameter file:

Usually the parameter file for sqlio is called something like “param.txt” and looks like this:

image

You can simply change the filename to be the MDF file and then delete the last parameter so that the size is not specified.

image

Using this technique I recently had a 1TB test file generated in under three seconds. Nice Winking smile

Technorati Tags:

Page List

Page List