AWS Glue Truncate Redshift Table
I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table.
However, I need to purge the table during this process as I am left with duplicate records after the process completes.
I'm looking for a way to add this purge to the Glue process. Any advice would be appreciated.
Did you have a look at Job Bookmarks in Glue? It's a feature for keeping the high water mark and works with s3 only. I am not 100% sure, but it may require partitioning to be in place.
TRUNCATE - Amazon Redshift, Deletes all of the rows from a table without doing a table scan. The TRUNCATE command is a faster alternative to an unqualified DELETE operation. Merge an Amazon Redshift table in AWS Glue (upsert) Create a merge query after loading the data into a staging table, as shown in the following Python examples. Replace the following values: target_table: the Amazon Redshift table. test_red: the catalog connection to use. stage_table: the Amazon Redshift staging table.
You need to modify the auto generated code provided by Glue. Connect to redshift using spark jdbc connection and execute the purge query.
To spin up Glue containers in redshift VPC; specify the connection in glue job, to gain access for redshift cluster.
Hope this helps.
AWS Glue Truncate Redshift Table, AWS Glue Truncate Redshift Table. I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table. However, I need to purge the table during this process as I am left with duplicate records after the process completes. AWS Documentation » Amazon Redshift » Database Developer Guide » SQL Reference » SQL Commands » TRUNCATE. Deletes all of the rows from a table without doing a table scan: this operation is a faster alternative to an unqualified DELETE operation. To execute a TRUNCATE command, you must be the owner of the table or a superuser.
You can use spark/Pyspark databricks library to do an append after a truncate table of the table (this is better performance than an overwrite):
preactions = "TRUNCATE table <schema.table>" df.write\ .format("com.databricks.spark.redshift")\ .option("url", redshift_url)\ .option("dbtable", redshift_table)\ .option("user", user)\ .option("password", readshift_password)\ .option("aws_iam_role", redshift_copy_role)\ .option("tempdir", args["TempDir"])\ .option("preactions", preactions)\ .mode("append")\ .save()
You can take a look at databricks documentation in here
AWS Glue Truncate Redshift Table, I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table. However, I need to purge the that thread talked about using two jobs to truncate Redshift table first and run a AWS Glue job to load data. I'm trying to find a way to achieve this in one Glue job, i.e, plug in codes that connect to Redshift db and truncate target table then load data, is that possible? any sample code?
Loading Amazon Redshift Data Utilizing AWS Glue ETL service, Target: load resulting dataset inserted into Redshift table. We'll try to build the same scenario on AWS Glue ETL service to see if it can be a Amazon Redshift Spectrum uses external tables to query data that is stored in Amazon S3. You can query an external table using the same SELECT syntax you use with other Amazon Redshift tables. External tables are read-only. You can't write to an external table.
aws-samples/aws-glue-samples, AWS Glue and AWS Data pipeline are two of the easiest to use services for loading data from AWS table. KEEP EXISTING, OVERWRITE EXISTING, TRUNCATE, APPEND. Use a staging table to insert all rows and then perform a upsert/merge  into the main table, this has to be done outside of glue. Add another column in your redshift table , like an insert timestamp, to allow duplicate but to know which one came first or last and then delete the duplicate afterwards if you need to.
Loading Data to Redshift - 4 Best Methods, AWS Glue is a serverless ETL (Extract, transform and load) service that makes Table: Create one or more tables in the database that can be used by the how AWS Glue works along with Amazon S3 and Amazon Redshift. Amazon Redshift enforces a limit of 20,000 tables per cluster, including user-defined temporary tables and temporary tables created by Amazon Redshift during query processing or system maintenance. Optionally, the table name can be qualified with the database and schema name.