How To Keep Sql From Selecting Duplicates

The get to solution for removing duplicate rows from your upshot sets is to include the distinct keyword in your select statement. Information technology tells the query engine to remove duplicates to produce a event gear up in which every row is unique. Did you know that the group past clause can also be used to remove duplicates? If not, read on to detect out what the chief differences are between them and which to use to produce a desired issue.

The Distinct and Distinctrow Keywords

The distinct keyword comes directly afterwards the SELECT in the query statement and replaces the optional all keyword, which is the default. Distinctrow is an alias for distinct and produces the exact same results:

          SELECT [ALL | DISTINCT | DISTINCTROW ]     select_expr     [FROM table_references     [WHERE where_condition]

To illustrate how it works, let'south select some data from the post-obit table, which contains a list of fruits and their colors:

proper name	color
apple	red
apple	greenish
apple tree	xanthous
banana	yellow
banana	green
grape	blood-red
grape	white

The following query volition retrieve all the fruit names from the table and list them in alphabetical club:

SELECT name FROM fruits;

Without the color information, we have multiples of each fruit type:

name

apple

banana

grape

At present let'due south try the query once more with the singled-out keyword:

SELECT DISTINCT name FROM fruits;

As expected, we now have simply one instance of each fruit type:

If merely it were e'er that easy! A quick Internet search on the phrase "sql eliminating duplicates" shows that there's more to removing duplicate values than inserting the distinct keyword into your SELECT statements.

When are Indistinguishable Rows Not Duplicate Rows

I trouble that the distinct keyword does zippo to solve is that sometimes removing duplicates creates misleading results. Observe the following scenario:

The client wants to generate a listing of their employees to generate some statistics. Here's some SQL to practice that:

SELECT name,        gender,               salary FROM employees Social club By proper noun;

Strangely, this produces duplicate rows for "Kristen Ruegg":

Proper noun	gender	salary
Allan Smithie	g	4900
Barbara Breitenmoser	f	(NULL)
Jon Simpson	one thousand	4500
Kirsten Ruegg	f	5600
Kristen Ruegg	f	5600
Peter Jonson	m	5200
Ralph Teller	m	5100

The client responds that they don't desire duplicates, so the developer adds the trusty distinct keyword to the SELECT argument. This produces the desired results, except for i small detail: There are two employees with the same name! Adding the singled-out keyword created wrong results by removing a valid row. Including the unique emp_id_number to the field list confirms that there are indeed 2 Kristen Rueggs:

SELECT name,        gender,               salary,        emp_id_number FROM employees ORDER BY proper noun;

Hither's the information in question showing the unique emp_id_numbersouth:

proper noun	gender	salary	emp_id_number
Kirsten Ruegg	f	5600	3462
Kristen Ruegg	f	5600	2223

The moral of the story is this: When using the singled-out keyword, be certain that you aren't inadvertently removing valid data!

Comparison Distinct to Group Past

Using singled-out is logically equivalent to using group by on all selected columns with no aggregate function. For such a query, group by just produces a list of distinct grouping values. When displaying and grouping by a unmarried column, the query produces the distinct values in that column. Nonetheless, if you display and group by multiple columns, the query produces the singled-out combinations of values in each cavalcade. For instance, the following query produces the same prepare of rows as our first SELECT distinct did:

SELECT proper name  FROM fruits  Group By name;

Similarly, the following statement produces the same results as our SELECT singled-out did on the employees table:

SELECT proper noun,        gender,               bacon  FROM employees Group BY proper noun;

A deviation between distinct and grouping by is that group by causes row sorting. Hence:

SELECT name,        gender,               bacon  FROM employees Group Past name;

…is the same equally:

SELECT Distinct proper name,                 gender,                        bacon  FROM employees Society BY name;

Counting Duplicates

Singled-out tin exist used with the COUNT() function to count how many singled-out values a column contains. COUNT(singled-out expression) counts the number of distinct (unique) non-NULL values of the given expression. The expression can be a column proper name to count the number of distinct non-NULL values in the column.

Here's the full employee tabular array data:

id	dept_id	gender	proper name	salary	emp_id_number
1	2	m	Jon Simpson	4500	1234
2	4	f	Barbara Breitenmoser	(NULL)	9999
three	3	f	Kirsten Ruegg	5600	3462
4	i	thousand	Ralph Teller	5100	6543
5	2	m	Peter Jonson	5200	9747
6	2	thousand	Allan Smithie	4900	6853
7	4	f	Kirsten Ruegg	5600	2223
viii	three	f	Kirsten Ruegg	4400	2765

Applying the Count distinct office on the name field produces six unique names:

SELECT Count(DISTINCT name) FROM employees;

Information technology'south also possible to requite a list of expressions separated past commas. In this instance, COUNT() returns the number of distinct combinations of values that contain no Cypher values. The following query counts the number of distinct rows for which neither the name nor salary is Zero:

SELECT Count (DISTINCT proper name, salary) FROM employees;

Count(Singled-out name, salary)

Yous tin also group counts of duplicates per grouping using a fleck of math in conjunction with the group by clause. Here's a query to count duplicated names for each section:

SELECT dept_id,         COUNT(*) - COUNT(Singled-out name) AS 'duplicate names' FROM   employees  GROUP BY dept_id;

dept_id	duplicate names
1	0
ii	0
iii	1
4	0

These queries help you characterize the extent of duplicates, but don't show you which values are duplicated. To see which names are duplicated in the employees tabular array, use a summary query that displays the non-unique values along with the counts:

          SELECT dept_id,            proper noun,            count(name) as name_count    FROM   employees     GROUP BY name,              dept_id;

dept_id	name	name_count
2	Allan Smithie	1
iv	Barbara Breitenmoser	i
2	Jon Simpson	ane
3	Kirsten Ruegg	2
4	Kirsten Ruegg	1
ii	Peter Jonson	one
1	Ralph Teller	1

Since we're only interested in duplicates, we can filter out everything else using the HAVING clause. Information technology's like a WHERE clause, except that it'south used with grouping by to narrow downward the results:

SELECT dept_id,         name,         count(name) as name_count FROM   employees  Grouping Past proper noun,         dept_id HAVING name_count > 1;

Now we can come across which names are duplicated, also as how many there are:

dept_id	name	name_count
iii	Kirsten Ruegg	2

Displaying Per-Group Minimum or Maximum Values in Duplicated Rows

Equally nosotros saw in the concluding example, the grouping by clause causes aggregate functions to be applied for each unique value in the field list. You should exist aware that columns that are not in the group by field list do not necessarily belong to the same row as the aggregated values! An example is definitely in order here. The following query displays the highest bacon for each department:

SELECT dept_id,        name,        gender,               max(salary) as max_salary  FROM   employees Group Past dept_id;

The intention is to also display information well-nigh the individual who earns the highest salary. Withal, that is non what is returned here:

dept_id	name	gender	max_salary
1	Ralph Teller	thousand	5100
two	Jon Simpson	thousand	5200
3	Kirsten Ruegg	f	5600
iv	Barbara Breitenmoser	f	5600

The problem is that the salary is the simply aggregated field because the Max() aggregate office is applied to information technology. Consequently, the start name and gender values encountered for each group by field are what are displayed. Looking at the tabular array, you'll see that, while Ralph Teller is the but member of department 1, Jon Simpson only earned $4500. Peter Jonson is really the possessor of that stardom, but the query engine selected the kickoff name and gender that it came across having a dept_id of 2.

The solution is to join the GROUP_BY results with the original table using the grouped fields. In this example, we only have ane field, and that is the salary:

SELECT emp2.dept_id,         emp1.name,         emp1.gender,         emp2.max_salary FROM (   SELECT dept_id,                 Max(salary) as max_salary    FROM   employees    GROUP BY dept_id ) as emp2 Bring together employees as emp1 ON emp1.bacon = emp2.max_salary Group By dept_id;

Now the name and gender fields vest to the earner of the greatest bacon:

dept_id	proper noun	gender	max_salary
1	Ralph Teller	chiliad	5100
two	Peter Jonson	thousand	5200
3	Kirsten Ruegg	f	5600
4	Kirsten Ruegg	f	5600

In that location are other techniques that were not covered, such as the employ of temporary tables and dynamic SQL. Here is more than in-depth data on removing duplicate records. This article discusses the group by and HAVING clauses in more detail.

» See All Articles by Columnist Rob Gravelle

Robert Gravelle

Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare fourth dimension, Rob has become an accomplished music artist with several CDs and digital releases to his credit.