SQL Centre of Excellence

We do a lot of code reviews and the use of SELECT DISTINCT always comes up as its picked up by many of the static code analysis tools as a known scalability and performance risk. The common question is “Why should I avoid distinct” and “How can I avoid it”

During a performance review we either Red Flag DISTINCT indicating that it is a severe performance issue, or orange flag it, meaning that its ok to keep it in the code, but consider avoiding this in future developments as its only slightly impacting performance and there may be bigger fish to fry performance wise.

To be honest, my main concern with select distinct is not related to performance it is more to do with the mind set behind distinct. A developer may have written a query seen that he has duplicate rows and rather than revisit the joins to see if he has joined to many tables, or if he can change a join criteria, he simply slaps a distinct on the query. So, a distinct on a high percentage of queries is a sign of “hiding” problems in query design.

 

Lets look at a few use cases. The terrible (red flag), The not optimal (orange flag) and the passable

1) Terrible. Using DISTINCT with expensive scalar functions or UDF’s

select distinct FirstName , reverse(Firstname) as MyFunction
from dbo.DimCustomer

image

In this case we are just using the replicate function, but often a customer may have far more expensive UDF functions (which by themselves are not so good for performance). The key point is that with DISTINCT, the scalar function has to be evaluated BEFORE the aggregate, so will be over the entire table, not just the few rows selected. On one customer engagement that was 100 million iterations when only seven rows were returned.

An alternative version of the same query is below. We can see that the Scalar function is now performed AFTER the aggregate. What difference does this make ? well it depends on the number of unique rows versus the number of rows in the table, and how expensive the scalar function is. So we cannot say that this is always a key issue, but its worth reviewing.

select FirstName , reverse(Firstname)
from
( 
select distinct FirstName 
from dbo.DimCustomer
) a

image

 

During a performance tuning engagement we will only red flag this as critical if the workload analysis puts the query as one of the top five statements by CPU, otherwise we will mention it, but try and aim for bigger fish.

Note that I mentioned UDF’s are bad. This is well documented elsewhere, especially by Simon Sabin. If there was a holy war against UDF’s Simon would be leading the charge!

http://sqlblogcasts.com/blogs/simons/archive/2008/11/03/TSQL-Scalar-functions-are-evil-.aspx

http://sqlblogcasts.com/blogs/simons/archive/2010/01/14/User-defined-function-performance-is-awful--Micorosft-please-sort-this-out.aspx

http://sqlblogcasts.com/blogs/simons/archive/2005/01/17/User-Defined-Function-performance-comparison.aspx

http://connect.microsoft.com/SQLServer/feedback/details/524983/user-defined-function-performance-is-unacceptable

 

2) Bad. Using DISTINCT with One to Many Joins

select distinct EnglishPromotionName
from dbo.DimPromotion p
inner join dbo.FactInternetSales s on s.PromotionKey=p.PromotionKey

image

In this case, we need to perform a join and a Aggregate and a sort. A much more efficient query is written below using EXISTS

select p.EnglishPromotionName
from dbo.DimPromotion p
where exists (select * from dbo.FactInternetSales s where s.PromotionKey=p.PromotionKey) 
image

Notice as because we use EXISTS we do not need the aggregate or the sort in the query plan, the exists will just return the first occurrence, which is much more efficient.

3 Passable. Using DISTINCT on a single table or with 1-1 Joins

select distinct FirstName 
from dbo.DimCustomer

image

 

In this case its not really possible to get a better query plan, is DISTINCT is “ok”. However we would often write this as an aggregate query with GROUP BY

select FirstName 
from dbo.DimCustomer
group by FirstName

image

 

Why would we use the group by ? Well, clearly not for performance as the plans are the same, the reason is simply that we know that static code analysis tools will pick up distinct as a potential bad practice and we don’t want to have to continually justify its use. The group by will squeeze pass code reviews without raising an alert. Even though its technically the same plan anyway Winking smile

 

4 The Unforgivable. Using DISTINCT to  Hide Gratuitous Joins

select distinct Firstname
from dbo.DimCustomer c
INNER join dbo.FactInternetSales s on c.CustomerKey=s.CustomerKey

Assuming that every customer had at least one sale (eg that's why they are customers!), the inner join is not needed so we can simply write

select distinct Firstname
from dbo.DimCustomer c

5) The looks bad but is actually ok one

select distinct Firstname
from dbo.DimCustomer c
LEFT join dbo.FactInternetSales s on c.CustomerKey=s.CustomerKey

In this case the left join is worthless as the we are only doing a distinct on FirstName, so you would think this will result in a horrible query plan. However, god bless the Query Processor Team, they remove redundancy joins from the query so the actual plan for this is as below:

image

 

Anyone have any more base/ good use cases for distinct I would love to hear about or add them.

 

Thanks,

Bob

blog comments powered by Disqus