T-SQL - Dedup a string

sql remove duplicate characters from a string
get distinct values from comma separated string in sql server
sql remove duplicate words from string
how to remove duplicates in sql query
t sql string aggregate distinct
sql filter duplicates
sql server split string
remove value from comma separated string sql

Code:

DECLARE @Table TABLE
               (
                   [Id]  INT IDENTITY(1, 1),
                   [Val] NVARCHAR(MAX)
               )

INSERT INTO @Table
    SELECT  N'123451234512345'
    UNION ALL
    SELECT  N'ABC@ABC.comABC@ABC.comABC@ABC.com'

Current output:

Id  Val
---------------------------------------
1   123451234512345
2   ABC@ABC.comABC@ABC.comABC@ABC.com

Desired Output:

Id  Val
------------------
1   12345
2   ABC@ABC.com

Goal: to dedup string in the Val column (no delimiter available). If there was a delimiter, I can split and dedup, but w/o one, I don't know how to approach this the most efficient way. The value can repeat 0 to many many times in one column.

I'm not going to argue that this is really efficient, but you can use a recursive CTE:

with cte as (
      select str, convert(varchar(max), left(str, 1)) as substr, 1 as lev
      from t
      union all
      select str, convert(varchar(max), left(str, lev)) as substr, 1 + lev
      from cte
      where lev <= len(str)
     )
select x.*
from t cross apply
     (select top (1) cte.*
      from cte
      where cte.str = t.str and
            replace(cte.str, cte.substr, '') = ''
      order by cte.lev asc
     ) x;

Here is a db<>fiddle.

How to remove duplicates from a string in SQL, Option 1 with a Parse Function Declare @YourTable table (ID int,YourCol varchar(50)) Insert Into @YourTable values (1,'qw"qw"er"er'), (2,'q"w"w"q'), (3,'f"k"​s"g')  7. Multi string search using normal T-SQL LIKE search. In example 6, multi string search was done using R script. If we want to do the same multi string search using normal T-SQL LIKE search, we can try by setting parameters values as shown below : Also in this example, let us see how we can search string with specific collation setting. 8.

Assuming there are at least two characters in [Val], try this UDF:

create function Dedup(@InCol nvarchar(max))
returns nvarchar(max) as
begin
    declare @i int = 2
    while charindex(left(@InCol, @i), @InCol, @i+1)>0 select @i += 1
    select @i = case when left(@InCol, @i-1) = right(@InCol, @i-1) then @i-1 else len(@InCol) end
    return left(@InCol, @i)
end

as

select Id, dbo.Dedup(Val) as Val from @Table

SQL SERVER, I would suggest that this UDF should be kept handy to perform this tedious task easily. CREATE FUNCTION dbo.DistinctList ( @List VARCHAR(  Attend this webinar to learn how to write T-SQL code and improve the performance of your queries along with how to import and export data with SQL Server. Problem I have seen many systems that accept or store multi-valued strings in parameters, variables, or columns, with no process set up for de-duplication.

As Gordon said - a recursive CTE will work but will be slow. For a nasty fast solution you could use dbo.rangeAB (code at the bottom of this post). First to create a logic to "dedupe" a string as you need.

-- Logic to create a function to "deduplicate" the values
DECLARE @string VARCHAR(8000) = '123451234512345'; -- 'ABC@ABC.comABC@ABC.comABC@ABC.com';

SELECT fnDedupe.itemLength, fnDedupe.item
FROM
(
  SELECT TOP (1) r.OP, lg.t
  FROM           dbo.rangeAB(1,LEN(@string),1,1)       AS r
  CROSS APPLY   (VALUES(SUBSTRING(@string,1,r.OP-1)))  AS lg(t)
  CROSS APPLY   (VALUES(CHARINDEX(lg.t,@string,r.OP))) AS f(x)
  WHERE          f.x > 0
  ORDER BY       r.RN
) AS fnDedupe(itemLength,item);

This returns:

itemLength  item
----------- ------
6           12345

Next we apply it to your table using APPLY; you will replace @string with t.v:

DECLARE @Table TABLE
(
    [Id]  INT IDENTITY(1, 1),
    [Val] NVARCHAR(MAX)
);
INSERT INTO @Table
SELECT  N'123451234512345' UNION ALL SELECT  N'ABC@ABC.comABC@ABC.comABC@ABC.com';

SELECT t.Id, t.Val, fnDedupe.item
FROM   @Table AS t
CROSS APPLY
(
  SELECT TOP (1) r.OP, lg.t
  FROM           dbo.rangeAB(1,LEN(t.val),1,1)       AS r
  CROSS APPLY   (VALUES(SUBSTRING(t.val,1,r.OP-1)))  AS lg(t)
  CROSS APPLY   (VALUES(CHARINDEX(lg.t,t.val,r.OP))) AS f(x)
  WHERE          f.x > 0
  ORDER BY       r.RN
) AS fnDedupe(itemLength,item);

This returns:

Id    Val                                     item
----- --------------------------------------- --------------
1     123451234512345                         12345
2     ABC@ABC.comABC@ABC.comABC@ABC.com       ABC@ABC.com

Here's rangeAB:

CREATE FUNCTION dbo.rangeAB
(
  @low  bigint, 
  @high bigint, 
  @gap  bigint,
  @row1 bit
)
/****************************************************************************************
[Purpose]:
 Creates up to 531,441,000,000 sequentia1 integers numbers beginning with @low and ending 
 with @high. Used to replace iterative methods such as loops, cursors and recursive CTEs 
 to solve SQL problems. Based on Itzik Ben-Gan's getnums function with some tweeks and 
 enhancements and added functionality. The logic for getting rn to begin at 0 or 1 is 
 based comes from Jeff Moden's fnTally function. 

 The name range because it's similar to clojure's range function. The name "rangeAB" as 
 used because "range" is a reserved SQL keyword.

[Author]: Alan Burstein

[Compatibility]: 
 SQL Server 2008+ and Azure SQL Database

[Syntax]:
 SELECT r.RN, r.OP, r.N1, r.N2
 FROM dbo.rangeAB(@low,@high,@gap,@row1) AS r;

[Parameters]:
 @low  = a bigint that represents the lowest value for n1.
 @high = a bigint that represents the highest value for n1.
 @gap  = a bigint that represents how much n1 and n2 will increase each row; @gap also
         represents the difference between n1 and n2.
 @row1 = a bit that represents the first value of rn. When @row = 0 then rn begins
         at 0, when @row = 1 then rn will begin at 1.

[Returns]:
 Inline Table Valued Function returns:
 rn = bigint; a row number that works just like T-SQL ROW_NUMBER() except that it can 
      start at 0 or 1 which is dictated by @row1.
 op = bigint; returns the "opposite number that relates to rn. When rn begins with 0 and
      ends with 10 then 10 is the opposite of 0, 9 the opposite of 1, etc. When rn begins
      with 1 and ends with 5 then 1 is the opposite of 5, 2 the opposite of 4, etc...
 n1 = bigint; a sequential number starting at the value of @low and incrimentingby the
      value of @gap until it is less than or equal to the value of @high.
 n2 = bigint; a sequential number starting at the value of @low+@gap and  incrimenting 
      by the value of @gap.

[Dependencies]:
N/A

[Developer Notes]:

 1. The lowest and highest possible numbers returned are whatever is allowable by a 
    bigint. The function, however, returns no more than 531,441,000,000 rows (8100^3). 
 2. @gap does not affect rn, rn will begin at @row1 and increase by 1 until the last row
    unless its used in a query where a filter is applied to rn.
 3. @gap must be greater than 0 or the function will not return any rows.
 4. Keep in mind that when @row1 is 0 then the highest row-number will be the number of
    rows returned minus 1
 5. If you only need is a sequential set beginning at 0 or 1 then, for best performance
    use the RN column. Use N1 and/or N2 when you need to begin your sequence at any 
    number other than 0 or 1 or if you need a gap between your sequence of numbers. 
 6. Although @gap is a bigint it must be a positive integer or the function will
    not return any rows.
 7. The function will not return any rows when one of the following conditions are true:
      * any of the input parameters are NULL
      * @high is less than @low 
      * @gap is not greater than 0
    To force the function to return all NULLs instead of not returning anything you can
    add the following code to the end of the query:

      UNION ALL 
      SELECT NULL, NULL, NULL, NULL
      WHERE NOT (@high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0)

    This code was excluded as it adds a ~5% performance penalty.
 8. There is no performance penalty for sorting by rn ASC; there is a large performance 
    penalty for sorting in descending order WHEN @row1 = 1; WHEN @row1 = 0
    If you need a descending sort the use op in place of rn then sort by rn ASC. 

Best Practices:
--===== 1. Using RN (rownumber)
 -- (1.1) The best way to get the numbers 1,2,3...@high (e.g. 1 to 5):
 SELECT RN FROM dbo.rangeAB(1,5,1,1);
 -- (1.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 0 to 5):
 SELECT RN FROM dbo.rangeAB(0,5,1,0);

--===== 2. Using OP for descending sorts without a performance penalty
 -- (2.1) The best way to get the numbers 5,4,3...@high (e.g. 5 to 1):
 SELECT op FROM dbo.rangeAB(1,5,1,1) ORDER BY rn ASC;
 -- (2.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 5 to 0):
 SELECT op FROM dbo.rangeAB(1,6,1,0) ORDER BY rn ASC;

--===== 3. Using N1
 -- (3.1) To begin with numbers other than 0 or 1 use N1 (e.g. -3 to 3):
 SELECT N1 FROM dbo.rangeAB(-3,3,1,1);
 -- (3.2) ROW_NUMBER() is built in. If you want a ROW_NUMBER() include RN:
 SELECT RN, N1 FROM dbo.rangeAB(-3,3,1,1);
 -- (3.3) If you wanted a ROW_NUMBER() that started at 0 you would do this:
 SELECT RN, N1 FROM dbo.rangeAB(-3,3,1,0);

--===== 4. Using N2 and @gap
 -- (4.1) To get 0,10,20,30...100, set @low to 0, @high to 100 and @gap to 10:
 SELECT N1 FROM dbo.rangeAB(0,100,10,1);
 -- (4.2) Note that N2=N1+@gap; this allows you to create a sequence of ranges.
 --       For example, to get (0,10),(10,20),(20,30).... (90,100):
 SELECT N1, N2 FROM dbo.rangeAB(0,90,10,1);
 -- (4.3) Remember that a rownumber is included and it can begin at 0 or 1:
 SELECT RN, N1, N2 FROM dbo.rangeAB(0,90,10,1);

[Examples]:
--===== 1. Generating Sample data (using rangeAB to create "dummy rows")
 -- The query below will generate 10,000 ids and random numbers between 50,000 and 500,000
 SELECT
   someId    = r.rn,
   someNumer = ABS(CHECKSUM(NEWID())%450000)+50001 
 FROM rangeAB(1,10000,1,1) r;

--===== 2. Create a series of dates; rn is 0 to include the first date in the series
 DECLARE @startdate DATE = '20180101', @enddate DATE = '20180131';

 SELECT r.rn, calDate = DATEADD(dd, r.rn, @startdate)
 FROM dbo.rangeAB(1, DATEDIFF(dd,@startdate,@enddate),1,0) r;
 GO

--===== 3. Splitting (tokenizing) a string with fixed sized items
 -- given a delimited string of identifiers that are always 7 characters long
 DECLARE @string VARCHAR(1000) = 'A601225,B435223,G008081,R678567';

 SELECT
   itemNumber = r.rn, -- item's ordinal position 
   itemIndex  = r.n1, -- item's position in the string (it's CHARINDEX value)
   item       = SUBSTRING(@string, r.n1, 7) -- item (token)
 FROM dbo.rangeAB(1, LEN(@string), 8,1)  r;
 GO

--===== 4. Splitting (tokenizing) a string with random delimiters
 DECLARE @string VARCHAR(1000) = 'ABC123,999F,XX,9994443335';

 SELECT
   itemNumber = ROW_NUMBER() OVER (ORDER BY r.rn), -- item's ordinal position 
   itemIndex  = r.n1+1, -- item's position in the string (it's CHARINDEX value)
   item       = SUBSTRING
               (
                 @string,
                 r.n1+1,
                 ISNULL(NULLIF(CHARINDEX(',',@string,r.n1+1),0)-r.n1-1, 8000)
               ) -- item (token)
 FROM dbo.rangeAB(0,DATALENGTH(@string),1,1) r
 WHERE SUBSTRING(@string,r.n1,1) = ',' OR r.n1 = 0;
 -- logic borrowed from: http://www.sqlservercentral.com/articles/Tally+Table/72993/

--===== 5. Grouping by a weekly intervals
 -- 5.1. how to create a series of start/end dates between @startDate & @endDate
 DECLARE @startDate DATE = '1/1/2015', @endDate DATE = '2/1/2015';
 SELECT 
   WeekNbr   = r.RN,
   WeekStart = DATEADD(DAY,r.N1,@StartDate), 
   WeekEnd   = DATEADD(DAY,r.N2-1,@StartDate)
 FROM dbo.rangeAB(0,datediff(DAY,@StartDate,@EndDate),7,1) r;
 GO

 -- 5.2. LEFT JOIN to the weekly interval table
 BEGIN
  DECLARE @startDate datetime = '1/1/2015', @endDate datetime = '2/1/2015';
  -- sample data 
  DECLARE @loans TABLE (loID INT, lockDate DATE);
  INSERT @loans SELECT r.rn, DATEADD(dd, ABS(CHECKSUM(NEWID())%32), @startDate)
  FROM dbo.rangeAB(1,50,1,1) r;

  -- solution 
  SELECT 
    WeekNbr   = r.RN,
    WeekStart = dt.WeekStart, 
    WeekEnd   = dt.WeekEnd,
    total     = COUNT(l.lockDate)
  FROM dbo.rangeAB(0,datediff(DAY,@StartDate,@EndDate),7,1) r
  CROSS APPLY (VALUES (
    CAST(DATEADD(DAY,r.N1,@StartDate) AS DATE), 
    CAST(DATEADD(DAY,r.N2-1,@StartDate) AS DATE))) dt(WeekStart,WeekEnd)
  LEFT JOIN @loans l ON l.lockDate BETWEEN  dt.WeekStart AND dt.WeekEnd
  GROUP BY r.RN, dt.WeekStart, dt.WeekEnd ;
 END;

--===== 6. Identify the first vowel and last vowel in a along with their positions
 DECLARE @string VARCHAR(200) = 'This string has vowels';

 SELECT TOP(1) position = r.rn, letter = SUBSTRING(@string,r.rn,1)
 FROM dbo.rangeAB(1,LEN(@string),1,1) r
 WHERE SUBSTRING(@string,r.rn,1) LIKE '%[aeiou]%'
 ORDER BY r.rn;

 -- To avoid a sort in the execution plan we'll use op instead of rn
 SELECT TOP(1) position = r.op, letter = SUBSTRING(@string,r.op,1)
 FROM dbo.rangeAB(1,LEN(@string),1,1) r
 WHERE SUBSTRING(@string,r.rn,1) LIKE '%[aeiou]%'
 ORDER BY r.rn;

---------------------------------------------------------------------------------------
[Revision History]:
 Rev 00 - 20140518 - Initial Development - Alan Burstein
 Rev 01 - 20151029 - Added 65 rows to make L1=465; 465^3=100.5M. Updated comment section
                   - Alan Burstein
 Rev 02 - 20180613 - Complete re-design including opposite number column (op)
 Rev 03 - 20180920 - Added additional CROSS JOIN to L2 for 530B rows max - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS 
(
  SELECT 1
  FROM (VALUES
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0)) T(N) -- 90 values 
),
L2(N)  AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT rn = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT
  r.RN,
  r.OP,
  r.N1,
  r.N2
FROM
(
  SELECT
    RN = 0,
    OP = (@high-@low)/@gap,
    N1 = @low,
    N2 = @gap+@low
  WHERE @row1 = 0
  UNION ALL -- COALESCE required in the TOP statement below for error handling purposes
  SELECT TOP (ABS((COALESCE(@high,0)-COALESCE(@low,0))/COALESCE(@gap,0)+COALESCE(@row1,1)))
    RN = i.rn,
    OP = (@high-@low)/@gap+(2*@row1)-i.rn,
    N1 = (i.rn-@row1)*@gap+@low,
    N2 = (i.rn-(@row1-1))*@gap+@low
  FROM iTally AS i
  ORDER BY i.rn
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0;

SQL DISTINCT: Removing Duplicates In a Result Set Examples, How do you remove duplicates from a comma separated string in SQL? Sometimes when you are inspecting data, you come across duplicates that shouldn’t exist. Here’s a an easy way to remove duplicate rows using the ROW_NUMBER function. In a previous post about using ROW_NUMBER to find the latest record for each member in a group, I set up some sample data with this statement: CREATE TABLE metrics AS ( SELECT date, CASE WHEN n > random() * 2 and n < random

SQL SERVER - Remove Duplicate Chars From String - Part 2, How do I remove duplicate characters in a string in SQL Server? This stored procedure has the capability to do normal T-SQL LIKE operations as well as can search string using R grep function and this can be controlled through input parameter. Output of the search operation will be stored in a table named "Tbl_SearchString".

Eliminating Duplicate Rows from MySQL Result Sets, I have already made a post on SQL SERVER – UDF – Remove Duplicate Chars From String where I showed you a method of removing  In SQL Server, you can use the T-SQL CHARINDEX () function or the PATINDEX () function to find a string within another string. Here’s a quick overview of each function. The CHARINDEX () Function This function accepts 3 arguments; the string to find, the string to search, and an optional start position.

SQL, The idea is to group according to all columns to be selected in output. For example, if we wish to print unique values of “FirstName, LastName and MobileNo”, we  There are basically two different functions for finding a character string or a wildcard in T-SQL, CHARINDEX() and PATINDEX(). They work pretty much the same way, but the difference is that CHARINDEX(s, a) returns the first offset of the character string “s” in “a”, whereas PATINDEX(s, a) returns the first offset where the wildcard “s” is true in “a”.

Comments
  • I don't understand what defines duplication.
  • How do you determine when something is duplicated? Are you sure the data isn't really supposed to be "ABC@ABC.comABC@ABC.comABC@ABC.com"? What if the data was "AAAAAAABC@ABC.com"? Is the correct answer "ABC@ABC.com" or "AAAAAAABC@ABC.com" or "AAABC@ABC.com" or ...?
  • I doubt SQL Server is going to be the right choice here. It is far from the best tool for string manipulation.
  • Yea, I asked for "rules" to define a dup, but haven't heard back. From the sample examples they gave me, it's always same value that repeats (so, ABC@ABC.com, ABC@ABC.com" turns into ABC@ABC.com instead of "AAAABC@ABC.com" turning into ABC@ABC.com). Yea, I'll push back on doing this inside SQL. I wanted to see if there was a cool way to do it though. Thanks for your feedback everyone.
  • "ABC@ABC.com" has "ABC" duplicated in it.
  • You came up with a working solution in like 7 mins... very impressive!
  • This works as expected. And like @Shiva said, VERY impressive!! (you've answered a few of my previous "weird" questions and it always amazes me). Cheers!
  • Thank you for an alternative solution. This also works as expected. Cheers! p.s. going to take me a bit to understand everything going on here. looks neat,
  • This is sick! Thank you so much for sharing. tbh, I'm a bit intimidated by the code inside the fx. ;) Just one thing I notice w/ this and the previous solution, but not Gordon's is that when I pass in "N'ABC@ABC.comABC@ABC.comABC@ABC.comA'" (notice the extra A at the end) it turns the output into "N'ABC@ABC.comA" instead of keeping it original since there is not "complete" repeat.