How to handle a dimension table with duplicate ids containing slightly different values in data warehouse?

loading fact table example
fact table loading best practices
how to load fact and dimension tables in informatica
can we load fact table before dimension table
fact table update strategy
how to query fact and dimension tables
how to load data into dimension tables
ssis incremental load fact table

I'm building out the data warehouse at my company and I've encountered a situation where I am pulling in data with slight variations in name but tied to the same ID. This is obviously a problem because my dimension table should only have one record per ID

for example:

+======+===================+
|  id  |      name         |
+======+===================+
|  185 | AAAA              |
+------+-------------------+
|  185 | AAAB              |
+------+-------------------+
|  197 | XXXA              |
+------+-------------------+
|  197 | XXXB              |
+------+-------------------+
|  197 | XXXC              |
+------+-------------------+

As you can see, the ID field should be tied to one unique value but there are strings that have slight variations but tied to the same ID. One thought was to normalize the strings but we would lose some of the metadata. Additionally, I should note that we are using Redshift which is why the unique id constraint is not being enforced. What would be the best solution to this issue?

Keep the latest name in the dimension table and create a secondary table for "history" just in case you need the other names in the future. I had a similar situation with a user dimension and implemented the way I described it. You can choose a rule to decide which one is going to be in the dimension table

With your example, the two tables will look like this

dim table
+======+===================+
|  id  |      name         |
+======+===================+
|  185 | AAAB              |
+------+-------------------+
|  197 | XXXC              |
+------+-------------------+

dim_hist table

+======+========+=================+
|  id  | dim_id |    name         |
+======+======+===================+
|  101 | 185  | AAAA              |
+------+------+-------------------+
|  102 | 197  | XXXA              |
+------+------+-------------------+
|  103 | 197  | XXXB              |
+------+------+-------------------+

Using id from dim table you can join two tables and access other names

sql, table with duplicate ids containing slightly different values in data warehouse? This is obviously a problem because my dimension table should only have  0 Chart Interaction for multiple data sources Dec 4 '19 0 Airflow Task Failed without empty Log and doesn't send email Nov 16 '19 0 How to handle a dimension table with duplicate ids containing slightly different values in data warehouse?

I don't know if this is the most optimal solution, but it is the solution we chose to accept for our situation. Essentially, I perform a self join on the dimension table and add a column which selects the shortest string per ID. Since typically, the variations between each value is due to data that is appended to the base string, the shortest string should return the base string which is the most important part of the field we are looking for.

Here is the sql code I wrote to perform this:

create table tmp_dim_offers as (
  -- create subquery
  with normalized_dim_offers as (
      select
        t1.id_offer,
        t1.dim_offer_name,
        min(t2.dim_offer_name) as normalized_offer_name
      from dim_offers as t1
        join dim_offers as t2 on t1.id_offer = t2.id_offer
      group by 1, 2
      order by t1.id_offer
  )
  -- select distinct ids and normalized offer name
  select distinct
    normalized_dim_offers.id_offer              as id_offer,
    normalized_dim_offers.normalized_offer_name as dim_offer_name
  from normalized_dim_offers
  order by normalized_dim_offers.id_offer
);

-- drop existing dim_offers table and replace with new normalized table
begin;
alter table dim_offers rename to dim_offers_to_delete;
alter table tmp_dim_offers rename to dim_offers;
drop table dim_offers_to_delete cascade;
commit;

Building a Data Warehouse: With Examples in SQL Server, Dimension tables contain attribute columns, typically having a character data type. If the attribute column has many duplicate values, it may not be worth indexing. Note that this is specifically for SQL Server; it does not apply to other database be organized/sorted according to date, customer, and then subscription ID. The Dimension tables are related to the Fact table by a surrogate keys. The Fact contains the measures of the data the business process wants to consume. There can be multiple Data Marts in a Data Warehouse, so do not get hung up by the single Fact table in a Data Mart. Eventually, you will see the Dimension tables related to many Fact tables in the overall schema. These dimension are termed Conformed Dimensions. The Date dimension is one of these dimension tables related to the Fact.

How about adding a new column in your dimension table and filling it with UUID value? The UUID column acts as a primary key.

In addition, that is the way I keep track of historical data. I think your problem is that someone had modified the records in the source table through time. By using UUID as pkey, we don't need to override the record, so we can keep versioning through time.

Populating Fact Tables, In reality date dimension would contains many more other attributes, such as day of The basic steps in loading data warehouse fact tables are described below. It is not that different from the steps in loading dimension tables (see my as natural key in the data warehouse, we need to handle this branch  The solution is to maintain mini dimension tables for historical data like type 4 Dimension in SCD. The main table should contain the current values and mini dimensions can contains historical data. 3) Junk Dimensions: In the data warehouse design we will come across a situation to use flag values. We can use one single table for this so that in Fact table no need to have multiple columns to store the Primary key values of these flag tables.

Set the Grain of a Fact Table to Avoid Duplicate Records , This article describes where it is helpful to use the fact table grain feature and how you transfer of an extremely large amount of data with possibility of errors or missing values. Each Invoice has a unique Invoice ID, and you can identify any entity by specifying Fact tables that have a grain do not have to contain facts. Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer's name and address are stored in a dimension table and updated only when the customer's profile changes. To minimize the size of a large fact table, the customer's name and address don't need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID.

Data Warehousing Logical Design, A unique identifier is something you add to tables so that you can differentiate in your data warehouse, a well-designed 3NF schema will be able to handle In first normal form (1NF), there are no repeating groups of data and no duplicate rows. Every intersection of a row and column (a field) contains just one value, and  This document discusses the creation and maintenance of "Summary Tables". It is a companion to the document on Data Warehousing Techniques. The basic terminology ("Fact Table", "Normalization", etc) is covered in that document. Summary tables for data warehouse "reports" Summary tables are a performance necessity for large tables.

SQL Programming & Database Management For Noobee, It is also self-consistent, meaning it contains the same type of data in every row. of them has a column that contains the same data value – CUSTOMER ID. the common keys makes it possible to merge data from multiple tables in forming a The relation between the two tables consists of a two-dimensional array of data  1) Used Steve's suggestion about negative ID keys for Unknown/special dimension values. This has worked perfectly and no issues arose during the SSAS cube building process. 2) Created transformations to check if a value is null, and if so, convert to either -1 (Unknown record in dimension) OR if it's a measure value, convert to 0.

Comments
  • How many such variations of name possible per unique_Id? do you need to query on name as well?
  • does the name change over time? are you interested in the old names? is this a data quality issue?
  • I have at most 5 variations per ID and it is only an issue with around 10% of the total population of IDs. I don't foresee a situation where we would actually need to query on name.
  • The issue is this marketing platform we pull from. It is not an ideal platform to work with and this is just one of the many shortcomings of it. There are several affiliates running the same campaigns and they append some metadata to the overall campaign name which is where we get these variations, so this extra metadata isn't really too important but it would be nice to find a solution that allowed us to keep it.
  • I figured out a satisfactory approach already, but this is a good extension to that approach.
  • Curious, how did you handle it
  • I just posted my solution as an answer.
  • one small suggestion. Always use column names, instead of ordinal names for ORDER BY, GROUP BY etc.