Sum distinct, duplicate rows

sum distinct sql
sql query to combine duplicate records and sum qty
sql sum duplicate rows
sql sum multiple rows
sql sum row values
inner join sum duplicates
sql distinct
sql distinct sum group by

The tl/dr summary: 3 tables with hierarchical relationship, a number field in the middle level, need a sum of that number without duplicating because of the lower level, looking for an alternative using OLAP functions in DB2.

This somewhat revisits these two topics (SUM(DISTINCT) Based on Other Columns and Sum Values based on Distinct Guids) - but I'm bumping as a separate topic because I'm wondering if there's a way to accomplish this with OLAP functions.

I'm working in DB2. The scenario (not the actual tables, due to client confidentiality) is:

   Table: NEIGHBORHOOD, field NEIGHBORHOOD_NAME
   Table: HOUSEHOLD, fields NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, and HOUSEHOLD_INCOME
   Table: HOUSEHOLD_MEMBER, fields HOUSEHOLD_NAME, PERSON_NAME

Right now we've got the data pulled by a single flatten-it-all view. So we would get something like

 Shady Acres, 123 Shady Lane, 25000, Jane
 Shady Acres, 123 Shady Lane, 25000, Mary
 Shady Acres, 123 Shady Lane, 25000, Robert
 Shady Acres, 126 Shady Lane, 15000, George
 Shady Acres, 126 Shady Lane, 15000, Tom
 Shady Acres, 126 Shady Lane, 15000, Betsy
 Shady Acres, 126 Shady Lane, 15000, Timmy

If I want

    Shady Acres, 123 Shady Lane, 25000, 3  (household income, count of members)
    Shady Acres, 125 Shady Lane, 15000, 4

it's no problem:

SELECT N.NEIGHBORHOOD_NAME, H.HOUSEHOLD_NAME, H.HOUSEHOLD_INCOME, count(1)
from NEIGHBORHOOD N join HOUSEHOLD H on N.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
join HOUSEHOLD_MEMBER M on H.HOUSEHOLD_NAME = M.HOUSEHOLD_NAME
group by N.NEIGHBORHOOD_NAME, H.HOUSEHOLD_NAME, H.HOUSEHOLD_INCOME

However, if I want

   Shady Acres, 2, 40000, 7 (i.e. neighborhood, number of households, sum of income, count of members)

I can't accomplish it without a subquery, as seen in the related links.

The best I've gotten so far is

select NEIGHBORHOOD.NEIGHBORHOOD_NAME,
count(distinct HOUSEHOLD.HOUSEHOLD_NAME) household_Count,
sum(distinct HOUSEHOLD.HOUSEHOLD_INCOME) total_income,
count(1) household_members group by N.NEIGHBORHOOD_NAME

This won't work if you have two households with the same income, of course. I was frankly surprised that "sum(distinct)" even worked, since it just doesn't make sense to me.

I tried

sum(household_income) over (partition by household.household_name) 

and it threw an error:

An‬‎ ‪expression‬‎ ‪starting‬‎ ‪with‬‎ ‪‬‎"HOUSEHOLD_INCOME"‬‎ ‪specified‬‎ ‪in‬‎ ‪a‬‎ ‪SELECT‬‎ ‪clause‬‎,‪‬‎ ‪HAVING‬‎ ‪clause‬‎,‪‬‎ ‪or‬‎ ‪ORDER‬‎ ‪BY‬‎ ‪clause‬‎ ‪is‬‎ ‪not‬‎ ‪specified‬‎ ‪in‬‎ ‪the‬‎ ‪GROUP‬‎ ‪BY‬‎ ‪clause‬‎ ‪or‬‎ ‪it‬‎ ‪is‬‎ ‪in‬‎ ‪a‬‎ ‪SELECT‬‎ ‪clause‬‎,‪‬‎ ‪HAVING‬‎ ‪clause‬‎,‪‬‎ ‪or‬‎ ‪ORDER‬‎ ‪BY‬‎ ‪clause‬‎ ‪with‬‎ ‪a‬‎ ‪column‬‎ ‪function‬‎ ‪and‬‎ ‪no‬‎ ‪GROUP‬‎ ‪BY‬‎ ‪clause‬‎ ‪is‬‎ ‪specified‬‎.‪‬‎.‪‬‎ ‪SQLCODE‬‎=‪‬‎-‪119‬‎,‪‬‎ ‪SQLSTATE‬‎=‪42803‬‎,‪‬‎ ‪DRIVER‬‎=‪4‬‎.‪19‬‎.‪56

Attempting to add either HOUSEHOLD_INCOME or HOUSEHOLD_NAME to the grouping causes the wrong results since we don't want to break it out by those fields.

It's entirely possible that there's no solution to this aside from using a subquery, but we'd have to do some significant redesign of the underlying view (including adding additional views), so I figured it couldn't hurt to ask.


I agree that this is not possible without a sub-query if you want to use OLAP fucntions

A co-related sub-select would work, but is inelegant, inefficient and likely not what you want

WITH NEIGHBORHOOD(NEIGHBORHOOD_NAME) AS (VALUES ('Shady Acres'))
,   HOUSEHOLD (NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, HOUSEHOLD_INCOME)
AS (VALUES 
   ('Shady Acres', '123 Shady Lane', 25000)
  ,('Shady Acres', '126 Shady Lane', 15000) 
  )
, HOUSEHOLD_MEMBER ( HOUSEHOLD_NAME, PERSON_NAME )
AS(VALUES
      ('123 Shady Lane', 'Jane'  )
     ,('123 Shady Lane', 'Mary'  )
     ,('123 Shady Lane', 'Robert')
     ,('126 Shady Lane', 'George')
     ,('126 Shady Lane', 'Tom'   )
     ,('126 Shady Lane', 'Betsy' )
     ,('126 Shady Lane', 'Timmy' )    
)
SELECT
    NEIGHBORHOOD_NAME
,   COUNT(DISTINCT HOUSEHOLD_NAME  )  AS HOUSEHOLD_COUNT
--,   SUM(DISTINCT   HOUSEHOLD_INCOME)  AS TOTAL_INCOME       -- not valid if two househols have the same income
--,   SUM(HOUSEHOLD_INCOME) OVER (PARTITION BY HOUSEHOLD_NAME)  -- not valid unless we GROUP BY HOUSEHOLD_NAME in the main body
,   SUM( (SELECT SUM(S.HOUSEHOLD_INCOME) FROM HOUSEHOLD S 
            WHERE S.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
            AND   M.PERSON_NAME = (SELECT MAX(SS.PERSON_NAME) 
                                   FROM HOUSEHOLD_MEMBER SS
                                   WHERE SS.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME))
           )            AS TOTAL_INCOME
,   COUNT(1)                          AS HOUSEHOLD_MEMBERS
FROM NEIGHBORHOOD N
JOIN HOUSEHOLD    H     USING ( NEIGHBORHOOD_NAME )
JOIN HOUSEHOLD_MEMBER M USING ( HOUSEHOLD_NAME )
GROUP BY N.NEIGHBORHOOD_NAME

returns

 NEIGHBORHOOD_NAME  HOUSEHOLD_COUNT     TOTAL_INCOME    HOUSEHOLD_MEMBERS
 -----------------  ---------------     ------------    -----------------
 Shady Acres                      2     40000           7

Sum distinct records in a table with duplicates in Teradata, Your B92_Sum currently returns either NULL , 1 or 2 , this is definitely no sum. To sum distinct values you need something like sum(distinct  The primary key ensures that the table has no duplicate rows. However, when you use the SELECT statement to query a portion of the columns in a table, you may get duplicates. To remove duplicates from a result set, you use the DISTINCT operator in the SELECT clause as follows: SELECT DISTINCT column1, column2,


So this would be a very hacky solution, that will work as long as you get no hash collisions on the Key of the denormalised / duplicated data and you don't have more duplicates than than the number of decimal places I pushed the HASH value out of the way for the SUM()

WITH NEIGHBORHOOD(NEIGHBORHOOD_NAME) AS (VALUES ('Shady Acres'))
,   HOUSEHOLD (NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, HOUSEHOLD_INCOME)
AS (VALUES 
   ('Shady Acres', '123 Shady Lane', 25000)
  ,('Shady Acres', '126 Shady Lane', 25000) 
  )
, HOUSEHOLD_MEMBER ( HOUSEHOLD_NAME, PERSON_NAME )
AS(VALUES
      ('123 Shady Lane', 'Jane'  )
     ,('123 Shady Lane', 'Mary'  )
     ,('123 Shady Lane', 'Robert')
     ,('126 Shady Lane', 'George')
     ,('126 Shady Lane', 'Tom'   )
     ,('126 Shady Lane', 'Betsy' )
     ,('126 Shady Lane', 'Timmy' )    
)
SELECT
    NEIGHBORHOOD_NAME
,   COUNT(DISTINCT HOUSEHOLD_NAME  )  AS HOUSEHOLD_COUNT
,   BIGINT(SUM(DISTINCT DECFLOAT(HOUSEHOLD_INCOME || '.000000' || ABS(HASH4(HOUSEHOLD_NAME))))) AS TOTAL_INCOME
,   COUNT(1)                          AS HOUSEHOLD_MEMBERS
FROM NEIGHBORHOOD N
JOIN HOUSEHOLD    H     USING ( NEIGHBORHOOD_NAME )
JOIN HOUSEHOLD_MEMBER M USING ( HOUSEHOLD_NAME )
GROUP BY N.NEIGHBORHOOD_NAME

which returns

 NEIGHBORHOOD_NAME  HOUSEHOLD_COUNT     TOTAL_INCOME    HOUSEHOLD_MEMBERS
 -----------------  ---------------     ------------    -----------------
 Shady Acres                      2        50000                    7

Note I made the income of the two households the same to prove that this solution works in that scenario

I guess you could argue that SQL is missing some syntactic feature where OLAP functions using the DISTINCT keyword should be able to define what is being made DISTINCT separately from what is being aggregated.

Sum distinct records in a table with duplicates, I have a table that has some duplicates. I can count the distinct records to get the Total Volume. When I try to Sum when the CompTia Code is  Combine duplicate rows and sum the values with Consolidate function. 1 . Click a cell where you want to locate the result in your current worksheet. 2 . Go to click Data > Consolidate , see screenshot: 3 . In the Consolidate dialog box: 4 . After finishing the settings, click OK , and the duplicates


The following query returns the result you need:

WITH 
  NEIGHBORHOOD(NEIGHBORHOOD_NAME) AS 
(
VALUES ('Shady Acres')
)
, HOUSEHOLD (NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, HOUSEHOLD_INCOME) AS 
(
VALUES 
  ('Shady Acres', '123 Shady Lane', 25000)
, ('Shady Acres', '126 Shady Lane', 15000) 
)
, HOUSEHOLD_MEMBER (HOUSEHOLD_NAME, PERSON_NAME) AS
(
VALUES
      ('123 Shady Lane', 'Jane'  )
     ,('123 Shady Lane', 'Mary'  )
     ,('123 Shady Lane', 'Robert')
     ,('126 Shady Lane', 'George')
     ,('126 Shady Lane', 'Tom'   )
     ,('126 Shady Lane', 'Betsy' )
     ,('126 Shady Lane', 'Timmy' )    
)
, TMP AS 
(
SELECT 
  N.NEIGHBORHOOD_NAME
, CASE WHEN H.HOUSEHOLD_NAME = LAG(H.HOUSEHOLD_NAME) OVER (PARTITION BY N.NEIGHBORHOOD_NAME ORDER BY H.HOUSEHOLD_NAME) THEN 0 ELSE 1 END AS HOUSEHOLD_NAME_CHANGED
, H.HOUSEHOLD_INCOME
FROM NEIGHBORHOOD N
JOIN HOUSEHOLD H ON H.NEIGHBORHOOD_NAME = N.NEIGHBORHOOD_NAME
JOIN HOUSEHOLD_MEMBER M ON M.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
)
SELECT 
  NEIGHBORHOOD_NAME
, SUM(HOUSEHOLD_NAME_CHANGED) AS HOUSEHOLD_NAMES
, SUM(HOUSEHOLD_NAME_CHANGED * HOUSEHOLD_INCOME) AS HOUSEHOLD_INCOME
, COUNT(1) AS MEMBERS
FROM TMP
GROUP BY NEIGHBORHOOD_NAME;

How to eliminate duplicate values in aggregate function with , With SUM(), AVG(), and COUNT(expr), DISTINCT eliminates duplicate values before the sum, average, or count is calculated. DISTINCT isn't meaningful with MIN() and MAX(); you can use it, but it won't change the result. You can't use DISTINCT with COUNT(*). You can use the Consolidate feature to combine duplicate rows and then sum the values in excel, let’s see the below steps: 1# select a cell that you want to display the result combined 2# on the DATA tab, click Consolidate command under Data Tools group. 3# the Consolidate window will appear.


Another option could be

with base (NEIGHBORHOOD_NAME,HOUSEHOLD_NAME, HOUSEHOLD_INCOME, HOUSEHOLD_MEMBER) as (
 values ('Shady Acres', '123 Shady Lane', 25000, 'Jane')
,( 'Shady Acres', '123 Shady Lane', 25000, 'Mary')
,( 'Shady Acres', '123 Shady Lane', 25000, 'Robert')
,( 'Shady Acres', '126 Shady Lane', 15000, 'George')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Tom')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Betsy')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Timmy')
)
, temp as (
select NEIGHBORHOOD_NAME,HOUSEHOLD_NAME, HOUSEHOLD_INCOME, HOUSEHOLD_MEMBER
     , row_number() over (partition by NEIGHBORHOOD_NAME,HOUSEHOLD_NAME order by HOUSEHOLD_MEMBER) as rownum_asc
     , row_number() over (partition by NEIGHBORHOOD_NAME,HOUSEHOLD_NAME order by HOUSEHOLD_MEMBER desc) as rownum_desc
FROM base
)
SELECT NEIGHBORHOOD_NAME, sum(HOUSEHOLD_INCOME) as TOTAL_INCOME, sum(rownum_desc) as member_count

  FROM temp
 WHERE rownum_asc = 1
 GROUP BY NEIGHBORHOOD_NAME

There is a little trick with the rownumber in both directions - one to select only one of the rows per Household and the other to count the members so a sum would do the job in the end.

OLAP functions do not reduce the number of rows - this is a mayor difference and often advantage of OLAP functions compared to GROUP BY and column functions. But with you flattened base table a GROUP BY is needed.

SQL Server SUM() Function By Practical Examples, ALL instructs the SUM() function to return the sum of all values including duplicates. ALL is used by default. DISTINCT instructs the SUM() function to calculate  Sum only unique values in Excel with formulas. 1. Type this formula: =SUMPRODUCT(1/COUNTIF(A2:A15,A2:A15&""),A2:A15) into a blank cell, see screenshot: 2 . Then press Enter key, and the numbers which appear only one time have been added up. Notes: 1. Here is another formula also can help you:


Aggregating Distinct Values with DISTINCT, ALL is the default and rarely is seen in practice. With SUM(), AVG(), and COUNT(expr), DISTINCT eliminates duplicate values before the sum, average, or count is calculated. DISTINCT isn't meaningful with MIN() and MAX(); you can use it, but it won't change the result. You can't use DISTINCT with COUNT(*). If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables..keep_all: If TRUE, keep all variables in .data. If a combination of is not distinct, this keeps the first row of values.


How to Use GROUP BY with Distinct Aggregates and Derived tables , It seems like that should do the trick, since we only want to sum distinct shipping cost values, not all the duplicates. Let's give it a try: select o. And, for this small sample, it just randomly happens to be correct. But SUM(DISTINCT) works exactly the same as COUNT(DISTINCT): It simply gets all of the values eligible to be summed, eliminates all duplicate values, and then adds up the results. But it is eliminating duplicate values, not duplicate rows based on some primary key column!


Eliminating Duplicates Using SQL DISTINCT Operator, The SQL DISTINCT operator is used to eliminate the duplicate rows in the result set of SUM: SUM(DISTINCT column) to calculate the sum of distinct values. Hi, I'm new to Power BI and I am trying to determine to how to calculate an average amount based on distinct values in one column that have different values in another column. Specifically I am trying to create a column that gives me average expenditure per policy number, where a policy number is repeated in one column and has different values