What is the difference between Postgres DISTINCT vs DISTINCT ON?

postgresql distinct count
distinct on postgres
postgres count(distinct multiple columns)
postgresql multiple distinct on
select distinct on expressions must match initial order by expressions django
difference between distinct and distinct all
postgres distinct on subquery
postgres distinct group by

I have a Postgres table created with the following statement. This table is filled by as dump of data from another service.

CREATE TABLE data_table (
    date date DEFAULT NULL,
    dimension1 varchar(64) DEFAULT NULL,
    dimension2 varchar(128) DEFAULT NULL
) TABLESPACE pg_default;

One of the steps in a ETL I'm building is extracting the unique values of dimension1 and inserting them in another intermediary table. However, during some tests I found out that the 2 commands below do not return the same results. I would expect for both to return the same sum. The first command returns more results compared with the second (1466 rows vs. 1504.

-- command 1
SELECT DISTINCT count(dimension1)
FROM data_table;

-- command 2    
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;

Any obvious explanations for this? Alternatively to an explanation, is there any suggestion of any check on the data I should do?

EDIT: The following queries both return 1504 (same as the "simple" DISTINCT)

SELECT count(*)
FROM data_table WHERE dimension1 IS NOT NULL;

SELECT count(dimension1)
FROM data_table;

Thank you!

DISTINCT and DISTINCT ON have completely different semantics.

First the theory

DISTINCT applies to an entire tuple. Once the result of the query is computed, DISTINCT removes any duplicate tuples from the result.

For example, assume a table R with the following contents:

#table r;
a | b 
---+---
1 | a
2 | b
3 | c
3 | d
2 | e
1 | a

(6 rows)

SELECT distinct * from R will result:

# select distinct * from r;
 a | b 
---+---
 1 | a
 3 | d
 2 | e
 2 | b
 3 | c
(5 rows)

Note that distinct applies to the entire list of projected attributes: thus

select distinct * from R

is semantically equivalent to

select distinct a,b from R

You cannot issue

select a, distinct b From R

DISTINCT must to follow SELECT. It applies to the entire tuple, not to an attribute of the result.

DISTINCT ON is a postgresql addition to the language. It is similar, but not identical, to group by.

Its syntax is:

 SELECT DISTINCT ON (attributeList) <rest as any query>

For example:

 SELECT DISTINCT ON (a) * from R

It semantics can be described as follows. Compute the query as usual, but before the projection of the result, sort the current result and group it according to the attribute list in DISTINCT ON (similar to group by). Now, do the projection using the first tuple in each group and ignore the other tuples.

Example:

 SELECT DISTINCT on (a) * from r;
  a | b 
 ---+---
 1 | a
 2 | b
 3 | c
 (3 rows)

Now, back to your problem:

First query:

SELECT DISTINCT count(dimension1)
FROM data_table;

computes the count of dimension1 (number of tuples in data_table that where dimension1 is not null). This query returns one tuple, which is always unique (hence DISTINCT is redundant).

Query 2:

SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;

This is query in a query. Let me rewrite it for clarity:

WITH tmp_table AS (
   SELECT DISTINCT ON (dimension1) 
     dimension1 FROM data_table
     GROUP by dimension1) 
SELECT count(*) from tmp_table

Let us compute first tmp_table. As I mentioned above, let us first ignore the DISTINCT ON and do the rest of the query. This is a group by by dimension1. Hence this part of the query will result in one tuple per different value of dimension1.

Now, the DISTINCT ON. It uses dimension1 again. But dimension1 is unique already (due to the group by). Hence this makes the DISTINCT ON superflouos (it does nothing). The final count is simply a count of all the tuples in the group by.

As you can see, there is an equivalence in the following query (it applies to any relation with an attribute a):

SELECT (DISTINCT ON a) a
FROM R

and

SELECT a FROM R group by a

and

SELECT DISTINCT a FROM R

Warning

Using DISTINCT ON results in a query might be non-deterministic for the any given instance of the database. In other words, the query might return different results for the same tables.

One interesting aspect

Distinct ON emulates a bad behaviour of sqlite and mysql in a much cleaner way. Assume that R has two attributes a and b:

SELECT a, b FROM R group by a

is an illegal statement in SQL. Yet, it runs on mysql and sqlite. It simply takes a random value of b from any of the tuples in the group of same values of a. In Postgresql this statement is illegal. Instead, you must use DISTINCT ON and write:

SELECT DISTINCT ON (a) a,b from R

* Corollary *

DISTINCT ON is useful in a group by when you want to access a value that is functionally dependent on the group by attributes. In other words, if you know that for every group of attributes they always have the same value of the third attribute, then use DISTINCT ON that group of attributes. Otherwise you would have to make a JOIN to retrieve that third attribute.

SQL DISTINCT vs UNIQUE – What's the Difference?, 3 powerful uses of DISTINCT in PostgreSQL. Haki Benita Comparing values in SQL can result in three outcomes — true , false or unknown : Oracle — Provide a function called LNNVL to compare nullable fields (good luck with that…). sql: distinct, distinct on and all It is not uncommon to have duplicate data in the results of a query. We can use the DISTINCT clause of SELECT to remove those and return only the unique results.

The first query gives the number of not null values of dimension1, while the second one returns the number of distinct values of the column. These numbers obviously are not equal if the column contains duplicates or nulls.

The word DISTINCT in

SELECT DISTINCT count(dimension1)
FROM data_table;

makes no sense, as the query returns a single row. Maybe you wanted

SELECT count(DISTINCT dimension1)
FROM data_table;

which returns the number of distinct not null values of dimension1. Note, that it is not the same as

SELECT count(*)
FROM (
    SELECT DISTINCT ON (dimension1) dimension1
    FROM data_table
    -- GROUP BY dimension1 -- redundant
    ) AS tmp_table;

The last query yields the number of all (null or not) distinct values of the column.

The many faces of DISTINCT in PostgreSQL - Statuscode, clause is used in a SELECT statement to filter duplicate rows in the result set. It ensures that rows returned are unique for the column or columns specified in the SELECT clause. table; In this statement, the values in the column_1 of the table are compared to determine the duplicates. The main difference between unique and distinct is that UNIQUE is a constraint that is used on the input of data and ensures data integrity. While DISTINCT keyword is used when we want to query our results or in other words, output the data.

Try

SELECT count(dimension1a)
FROM (SELECT DISTINCT ON (dimension1) dimension1a
FROM data_table
ORDER BY dimension1) AS tmp_table;

DISTINCT ON appears to be synonymous with GROUP BY.

Difference between distinct and group by in SQL Server, . Feel free to test this out in the editor to see what happens! Normally you would use distinct on in combination with an order by so that you get the particular unique row you are interested in. For example if col1 and col2 were an event type and col3 was a timestamp for the event, you could use distinct on to get the earliest or latest timestamp for each event type.

To learn and understand what happens by visual example. Here's a bit of SQL to execute on a PostgreSQL:

DROP TABLE IF EXISTS test_table;
CREATE TABLE test_table (
    id int NOT NULL primary key,
    col1 varchar(64) DEFAULT NULL
);

INSERT INTO test_table (id, col1) VALUES 
(1,'foo'), (2,'foo'), (3,'bar'), (4,null);

select count(*) as total1 from test_table;
-- returns: 4
-- Because the table has 4 records.

select distinct count(*) as total2 from test_table;
-- returns: 4
-- The count(*) is just one value. Making 1 total unique can only result in 1 total. 
-- So the distinct is useless here.

select col1, count(*) as total3 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- Since there are 3 unique col1 values. NULL's are included.

select distinct col1, count(*) as total4 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- The result is already grouped, and therefor already unique. 
-- So again, the distinct does nothing extra here.

select count(distinct col1) as total5 from test_table;
-- returns 2
-- NULL's aren't counted in a count by value. So only 'foo' & 'bar' are counted

select distinct on (col1) id, col1 from test_table order by col1 asc, id desc;
-- returns 3 rows: (2,'a'),(3,'b'),(4,NULL)
-- So it gets the records with the maximum id per unique col1
-- Note that the "order by" matters here. Changing that DESC to ASC would get the minumum id.

select count(*) as total6 from (select distinct on (col1) id, col1 from test_table order by col1 asc, id desc) as q;
-- returns 3. 
-- After seeing the previous query, what else would one expect?

select distinct col1 from test_table order by col1;
-- returns 3 unique values : ('bar'),('foo'),(null)

select distinct id, col1 from test_table order by col1;
-- returns all records. 
-- Because id is the primary key and therefore makes each returned row unique

Oracle SELECT DISTINCT By Practical Examples, No, this is not a typical DISTINCT . This is different. It is perfect when you have groups of data that are similar and want to pull a single record is a single identifier that could be used), a LATERAL join or a WINDOW function. >"distinct" unless you know exactly why your query is generating duplicate >rows in the first place. On that note: I know most people here don't pay much - or any - attention to SQLServer, however there was an interesting article recently regarding significant performance differences between DISTINCT and GROUP BY as used to remove duplicates.

PostgreSQL SELECT DISTINCT By Practical Examples, The DISTINCT clause keeps one row for each group of duplicates. The DISTINCT clause can be used on one or more columns of a table. The following  The main difference between Unique and Distinct in SQL is that Unique helps to ensure that all the values in a column are different while Distinct helps to remove all the duplicate records when retrieving the records from a table.

Can we apply DISTINCT to a SELECT query with multiple columns , We can use the DISTINCT clause of SELECT to remove those and return only If we only want the distinct set of people who are or have been  DISTINCT: DIFFERENT: Meaning: distinguished as not being the same: not alike in character or quality: Etymology: Distinct is etymologically related to distincten “to distinguish”. Different is etymologically related to differre “to set apart”. Modern Usage: Distinct suggests different identity, although the things might be similar in other ways.

SELECT DISTINCT ON in PostgreSQL, (Each element in the FROM list is a real or virtual table.) SELECT DISTINCT eliminates duplicate rows from the result. syntax, if one of the functions requires a column definition list, it's preferred to put the column definition list after the  Distinct usually means “noticeably different” or “separate.” Distinct is used when you want to say that something is obviously separate or different from other things. It is often used when talking about more than one thing. Below are some examples of this use. The painting has three distinct colors: red, blue, and yellow.

Comments
  • Weird. Side note: Logically, distinct and group by should accomplish the same thing so it's redundant to have them both in play.
  • @n8. thank you for pointing that out. I will have more fields (outside the DISTINCT ON expression which will require me to have the ORDER BY > From the docs: Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first
  • Here is something interesting: medium.com/statuscode/…
  • I don't use PostgreSQL anymore and when I did it wasn't often, so I didn't know that there is a difference between DISTINCT and DISTINCT ON.
  • Thank you for the explanation. Definitely helpful for understanding these differences. I read through the docs, apparently not well enough.
  • Hello @dmg, again, I see you updated your answer with a much more detailed explanation, I really appreciate your taking your time.
  • What is good way to learn like this, where you know what's going on behind scenes at least theoretically. Most sites are mere syntax reference and provide no background information on this all ties together behind scenes. @dmg
  • damn.... moving the DISTINCT from outside the count to inside (your block of code SELECT count(DISTINCT dimension1) FROM data_table; shows it. Thank you!
  • Thank you, but the answer from @klin actually solve it. The DISTINCT outside the count wasn't doing anything.
  • Thank you @LukStorms, extremely useful stuff! I'm in the middle of translating a bunch of aggregations, originally written for MySQL, to Postgres and it the more examples I can get the better.
  • @tekneee Nice to hear that it helps. But I think it's mostly that DISTINCT ON(column) that's not standard SQL and specific to PostgreSQL. The rest of those queries would behave the same on MySQL or other database types.
  • Sure @LukStorms. Curiously, this also arose form the need to replace a Group By from MySQL which on its hand also behaves in a non-standard way because it doesn't require an aggregation - and when this happens, it gets only ONE row of values, even if the aggregation resulted in the last step of aggregation returning more rows. (just a curiousity)
  • @tekneee If you're talking about GROUP BY in MySql. I remember that at least older versions of MySql are peculiar with the GROUP BY compared to other database types. In any other database something like select foo, bar, count(*) from test group by foo wouldn't be allowed. Because you'd get the error that 'bar' isn't in the GROUP BY. But in MySql it's often allowed. Which can sometimes lead to unexpected results.
  • @tekneee they explained it though : here. But you can actually disable that behaviour. F.e. check this old SO post