SQL Centre of Excellence

I haven’t been following what is now called Data Explorer very well, but just seen a great demo


Looks very interesting. Maybe this will be the future of ETL tools, or maybe this will become part of PowerPivot to improve the ETL capabilities which at the moment are pretty limited. for the moment is it effectively a cloud based ETL or data exploration tool.

According to the site it is due for release in November, 2011


We do a lot of code reviews and the use of SELECT DISTINCT always comes up as its picked up by many of the static code analysis tools as a known scalability and performance risk. The common question is “Why should I avoid distinct” and “How can I avoid it”

During a performance review we either Red Flag DISTINCT indicating that it is a severe performance issue, or orange flag it, meaning that its ok to keep it in the code, but consider avoiding this in future developments as its only slightly impacting performance and there may be bigger fish to fry performance wise.

To be honest, my main concern with select distinct is not related to performance it is more to do with the mind set behind distinct. A developer may have written a query seen that he has duplicate rows and rather than revisit the joins to see if he has joined to many tables, or if he can change a join criteria, he simply slaps a distinct on the query. So, a distinct on a high percentage of queries is a sign of “hiding” problems in query design.


Lets look at a few use cases. The terrible (red flag), The not optimal (orange flag) and the passable

1) Terrible. Using DISTINCT with expensive scalar functions or UDF’s

select distinct FirstName , reverse(Firstname) as MyFunction
from dbo.DimCustomer


In this case we are just using the replicate function, but often a customer may have far more expensive UDF functions (which by themselves are not so good for performance). The key point is that with DISTINCT, the scalar function has to be evaluated BEFORE the aggregate, so will be over the entire table, not just the few rows selected. On one customer engagement that was 100 million iterations when only seven rows were returned.

An alternative version of the same query is below. We can see that the Scalar function is now performed AFTER the aggregate. What difference does this make ? well it depends on the number of unique rows versus the number of rows in the table, and how expensive the scalar function is. So we cannot say that this is always a key issue, but its worth reviewing.

select FirstName , reverse(Firstname)
select distinct FirstName 
from dbo.DimCustomer
) a



During a performance tuning engagement we will only red flag this as critical if the workload analysis puts the query as one of the top five statements by CPU, otherwise we will mention it, but try and aim for bigger fish.

Note that I mentioned UDF’s are bad. This is well documented elsewhere, especially by Simon Sabin. If there was a holy war against UDF’s Simon would be leading the charge!






2) Bad. Using DISTINCT with One to Many Joins

select distinct EnglishPromotionName
from dbo.DimPromotion p
inner join dbo.FactInternetSales s on s.PromotionKey=p.PromotionKey


In this case, we need to perform a join and a Aggregate and a sort. A much more efficient query is written below using EXISTS

select p.EnglishPromotionName
from dbo.DimPromotion p
where exists (select * from dbo.FactInternetSales s where s.PromotionKey=p.PromotionKey) 

Notice as because we use EXISTS we do not need the aggregate or the sort in the query plan, the exists will just return the first occurrence, which is much more efficient.

3 Passable. Using DISTINCT on a single table or with 1-1 Joins

select distinct FirstName 
from dbo.DimCustomer



In this case its not really possible to get a better query plan, is DISTINCT is “ok”. However we would often write this as an aggregate query with GROUP BY

select FirstName 
from dbo.DimCustomer
group by FirstName



Why would we use the group by ? Well, clearly not for performance as the plans are the same, the reason is simply that we know that static code analysis tools will pick up distinct as a potential bad practice and we don’t want to have to continually justify its use. The group by will squeeze pass code reviews without raising an alert. Even though its technically the same plan anyway Winking smile


4 The Unforgivable. Using DISTINCT to  Hide Gratuitous Joins

select distinct Firstname
from dbo.DimCustomer c
INNER join dbo.FactInternetSales s on c.CustomerKey=s.CustomerKey

Assuming that every customer had at least one sale (eg that's why they are customers!), the inner join is not needed so we can simply write

select distinct Firstname
from dbo.DimCustomer c

5) The looks bad but is actually ok one

select distinct Firstname
from dbo.DimCustomer c
LEFT join dbo.FactInternetSales s on c.CustomerKey=s.CustomerKey

In this case the left join is worthless as the we are only doing a distinct on FirstName, so you would think this will result in a horrible query plan. However, god bless the Query Processor Team, they remove redundancy joins from the query so the actual plan for this is as below:



Anyone have any more base/ good use cases for distinct I would love to hear about or add them.




2012 is set to be SUPER cool year for SQL Server in Ireland. It was just announced at sqlpass that SQL Saturday 105 will be in DUBLIN, Ireland on March 24th, 2012.


We were hoping to have it on Paddy’s day, but town is mental then, so maybe not such a good idea. Anyone coming over, do be sure to stay over on the Saturday night as there will definitely be some Irish Style “craic” to be had.

Session submission is open but a date for closing submissions and when speaker announcements will be announced soon.

Really looking forward to the event Winking smile

The SQL 2010 Roadshow is announced as per the Microsoft Ireland blog site here


We are going to be spending nearly a week touring Ireland demoing the most important new features in SQL 2012 that will definitely affect anyone working with high availability of Business Intelligence.

Content wise we have some great demos covering Availability Groups where we will show an entire end to end data centre migration from a single node on 2005 to a four node always on configuration.

We will also be demoing great new Business Intelligence features such as :

- PowerPivot 2.0
- Power View 1.0 (aka Project Crescent)
- The BI Semantic Model and the Tabular Data Model

Dates, locations and venues are here:

Dublin, Monday 7th November 2011. Learn more and register here.

Belfast, Tuesday 8th November 2011. Learn more and register here.

Cork, Wednesday 9th November 2011. Learn more and register here.

Galway, Thursday 10th November 2011. Learn more and register here.

Well be hanging around at the events if you do have any SQL Server 2012 questions.

I just watched one of the project Juneau web casts. While there are lots of great features and I do see the long term value, I nearly fell of my chair looking at the “sample” stored Proc shown live on stage at Tech Ed 2011.


It’s a cursor looping through all the customer records assigning a segment to them and generating one update statement per row. While the ability to debug the cursor in the proc and view/set variables at run time is cool, surely we can use better examples that don’t make sql dudes cringe.

if you can avoid cringing at some the t-sql shown, this webcast is a good walk though of the database centric features available in Visual Studio 2010 and project Juneau.


Page List

Page List