How To Keep Sql From Selecting Duplicates
The get to solution for removing duplicate rows from your upshot sets is to include the distinct keyword in your select statement. Information technology tells the query engine to remove duplicates to produce a event gear up in which every row is unique. Did you know that the group past clause can also be used to remove duplicates? If not, read on to detect out what the chief differences are between them and which to use to produce a desired issue.
The Distinct and Distinctrow Keywords
The distinct keyword comes directly afterwards the SELECT in the query statement and replaces the optional all keyword, which is the default. Distinctrow is an alias for distinct and produces the exact same results:
SELECT [ALL | DISTINCT | DISTINCTROW ] select_expr [FROM table_references [WHERE where_condition]
To illustrate how it works, let'south select some data from the post-obit table, which contains a list of fruits and their colors:
proper name | color |
apple | red |
apple | greenish |
apple tree | xanthous |
banana | yellow |
banana | green |
grape | blood-red |
grape | white |
The following query volition retrieve all the fruit names from the table and list them in alphabetical club:
SELECT name FROM fruits;
Without the color information, we have multiples of each fruit type:
name |
apple |
apple |
apple |
banana |
banana |
grape |
grape |
At present let'due south try the query once more with the singled-out keyword:
SELECT DISTINCT name FROM fruits;
As expected, we now have simply one instance of each fruit type:
If merely it were e'er that easy! A quick Internet search on the phrase "sql eliminating duplicates" shows that there's more to removing duplicate values than inserting the distinct keyword into your SELECT statements.
When are Indistinguishable Rows Not Duplicate Rows
I trouble that the distinct keyword does zippo to solve is that sometimes removing duplicates creates misleading results. Observe the following scenario:
The client wants to generate a listing of their employees to generate some statistics. Here's some SQL to practice that:
SELECT name, gender, salary FROM employees Social club By proper noun;
Strangely, this produces duplicate rows for "Kristen Ruegg":
Proper noun | gender | salary |
Allan Smithie | g | 4900 |
Barbara Breitenmoser | f | (NULL) |
Jon Simpson | one thousand | 4500 |
Kirsten Ruegg | f | 5600 |
Kristen Ruegg | f | 5600 |
Peter Jonson | m | 5200 |
Ralph Teller | m | 5100 |
The client responds that they don't desire duplicates, so the developer adds the trusty distinct keyword to the SELECT argument. This produces the desired results, except for i small detail: There are two employees with the same name! Adding the singled-out keyword created wrong results by removing a valid row. Including the unique emp_id_number to the field list confirms that there are indeed 2 Kristen Rueggs:
SELECT name, gender, salary, emp_id_number FROM employees ORDER BY proper noun;
Hither's the information in question showing the unique emp_id_numbersouth:
proper noun | gender | salary | emp_id_number |
Kirsten Ruegg | f | 5600 | 3462 |
Kristen Ruegg | f | 5600 | 2223 |
The moral of the story is this: When using the singled-out keyword, be certain that you aren't inadvertently removing valid data!
Comparison Distinct to Group Past
Using singled-out is logically equivalent to using group by on all selected columns with no aggregate function. For such a query, group by just produces a list of distinct grouping values. When displaying and grouping by a unmarried column, the query produces the distinct values in that column. Nonetheless, if you display and group by multiple columns, the query produces the singled-out combinations of values in each cavalcade. For instance, the following query produces the same prepare of rows as our first SELECT distinct did:
SELECT proper name FROM fruits Group By name;
Similarly, the following statement produces the same results as our SELECT singled-out did on the employees table:
SELECT proper noun, gender, bacon FROM employees Group BY proper noun;
A deviation between distinct and grouping by is that group by causes row sorting. Hence:
SELECT name, gender, bacon FROM employees Group Past name;
…is the same equally:
SELECT Distinct proper name, gender, bacon FROM employees Society BY name;
Counting Duplicates
Singled-out tin exist used with the COUNT() function to count how many singled-out values a column contains. COUNT(singled-out expression) counts the number of distinct (unique) non-NULL values of the given expression. The expression can be a column proper name to count the number of distinct non-NULL values in the column.
Here's the full employee tabular array data:
id | dept_id | gender | proper name | salary | emp_id_number |
1 | 2 | m | Jon Simpson | 4500 | 1234 |
2 | 4 | f | Barbara Breitenmoser | (NULL) | 9999 |
three | 3 | f | Kirsten Ruegg | 5600 | 3462 |
4 | i | thousand | Ralph Teller | 5100 | 6543 |
5 | 2 | m | Peter Jonson | 5200 | 9747 |
6 | 2 | thousand | Allan Smithie | 4900 | 6853 |
7 | 4 | f | Kirsten Ruegg | 5600 | 2223 |
viii | three | f | Kirsten Ruegg | 4400 | 2765 |
Applying the Count distinct office on the name field produces six unique names:
SELECT Count(DISTINCT name) FROM employees;
Information technology'south also possible to requite a list of expressions separated past commas. In this instance, COUNT() returns the number of distinct combinations of values that contain no Cypher values. The following query counts the number of distinct rows for which neither the name nor salary is Zero:
SELECT Count (DISTINCT proper name, salary) FROM employees;
Count(Singled-out name, salary) |
6 |
Yous tin also group counts of duplicates per grouping using a fleck of math in conjunction with the group by clause. Here's a query to count duplicated names for each section:
SELECT dept_id, COUNT(*) - COUNT(Singled-out name) AS 'duplicate names' FROM employees GROUP BY dept_id;
dept_id | duplicate names |
1 | 0 |
ii | 0 |
iii | 1 |
4 | 0 |
These queries help you characterize the extent of duplicates, but don't show you which values are duplicated. To see which names are duplicated in the employees tabular array, use a summary query that displays the non-unique values along with the counts:
SELECT dept_id, proper noun, count(name) as name_count FROM employees GROUP BY name, dept_id;
dept_id | name | name_count |
2 | Allan Smithie | 1 |
iv | Barbara Breitenmoser | i |
2 | Jon Simpson | ane |
3 | Kirsten Ruegg | 2 |
4 | Kirsten Ruegg | 1 |
ii | Peter Jonson | one |
1 | Ralph Teller | 1 |
Since we're only interested in duplicates, we can filter out everything else using the HAVING clause. Information technology's like a WHERE clause, except that it'south used with grouping by to narrow downward the results:
SELECT dept_id, name, count(name) as name_count FROM employees Grouping Past proper noun, dept_id HAVING name_count > 1;
Now we can come across which names are duplicated, also as how many there are:
dept_id | name | name_count |
iii | Kirsten Ruegg | 2 |
Displaying Per-Group Minimum or Maximum Values in Duplicated Rows
Equally nosotros saw in the concluding example, the grouping by clause causes aggregate functions to be applied for each unique value in the field list. You should exist aware that columns that are not in the group by field list do not necessarily belong to the same row as the aggregated values! An example is definitely in order here. The following query displays the highest bacon for each department:
SELECT dept_id, name, gender, max(salary) as max_salary FROM employees Group Past dept_id;
The intention is to also display information well-nigh the individual who earns the highest salary. Withal, that is non what is returned here:
dept_id | name | gender | max_salary |
1 | Ralph Teller | thousand | 5100 |
two | Jon Simpson | thousand | 5200 |
3 | Kirsten Ruegg | f | 5600 |
iv | Barbara Breitenmoser | f | 5600 |
The problem is that the salary is the simply aggregated field because the Max() aggregate office is applied to information technology. Consequently, the start name and gender values encountered for each group by field are what are displayed. Looking at the tabular array, you'll see that, while Ralph Teller is the but member of department 1, Jon Simpson only earned $4500. Peter Jonson is really the possessor of that stardom, but the query engine selected the kickoff name and gender that it came across having a dept_id of 2.
The solution is to join the GROUP_BY results with the original table using the grouped fields. In this example, we only have ane field, and that is the salary:
SELECT emp2.dept_id, emp1.name, emp1.gender, emp2.max_salary FROM ( SELECT dept_id, Max(salary) as max_salary FROM employees GROUP BY dept_id ) as emp2 Bring together employees as emp1 ON emp1.bacon = emp2.max_salary Group By dept_id;
Now the name and gender fields vest to the earner of the greatest bacon:
dept_id | proper noun | gender | max_salary |
1 | Ralph Teller | chiliad | 5100 |
two | Peter Jonson | thousand | 5200 |
3 | Kirsten Ruegg | f | 5600 |
4 | Kirsten Ruegg | f | 5600 |
In that location are other techniques that were not covered, such as the employ of temporary tables and dynamic SQL. Here is more than in-depth data on removing duplicate records. This article discusses the group by and HAVING clauses in more detail.
» See All Articles by Columnist Rob Gravelle
How To Keep Sql From Selecting Duplicates,
Source: https://www.databasejournal.com/mysql/eliminating-duplicate-rows-from-mysql-result-sets/
Posted by: walshfrivis.blogspot.com
0 Response to "How To Keep Sql From Selecting Duplicates"
Post a Comment