Loading data from AWS redshift using python

extract data from redshift using python
copy data from s3 to redshift using python
aws lambda connect to redshift python
python redshift psycopg2
boto3 redshift query
python redshift jdbc
pygresql redshift
jupyter redshift

I'm facing a mission impossible to extract a huge amount of data from Amazone Redshift to another table. It definitely requires a more efficient approach but I'm new to SQL and AWS so decided to ask this smart community for advice.

This is my initial SQL query which takes forever:

-- STEP 1: CREATE A SAMPLE FOR ONE MONTH
SELECT DISTINCT at_id, utc_time, name
INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create' 
AND (dt BETWEEN '20181001' AND '20181031');

What would be the best approach? I was thinking of using python and sqlalchemy to create dataframes with chunks of 1m rows and inserting it back into the new table (which I need to create beforehand). Would this work?:

from sqlalchemy import create_engine
import os
import pandas as pd

redshift_user = os.environ['REDSHIFT_USER']
redshift_password = os.environ['REDSHIFT_PASSWORD']

engine_string = "postgresql+psycopg2://%s:%s@%s:%d/%s" \
% (redshift_user, redshift_password, 'localhost', XXXX, 'redshiftdb')
engine = create_engine(engine_string)

for df in pd.read_sql_query("""
                        SELECT DISTINCT at_id, utc_time, name
                        INSERT INTO my_new_table
                        FROM s3_db.table_x
                        WHERE type = 'create' 
                        AND (dt BETWEEN '20181001' AND '20181031');
                       """, engine, chunksize=1000000):

You should use CREATE TABLE AS.

This allows you to specify a SELECT statement and have the results directly stored into a new table.

This is hugely more efficient than downloading data and re-uploading.

You can also CREATE TABLE LIKE and then load it with data. See: Performing a Deep Copy

You could also UNLOAD data to Amazon S3, then load it again via COPY, but using CREATE TABLE AS is definitely the best option.

Amazon S3 to Redshift: Steps to Load Data in Minutes, Connect to Redshift with Python. import psycopg2. con=psycopg2. connect(dbname= 'dbname', host='host', port= 'port', user= 'user', password= 'pwd') Connect to Redshift with Python To pull data out of Redshift, or any other database, we first need to connect to our instance. To do that we need to use a library or driver for Python to connect to Amazon Redshift. You can use any of the available libraries for Python, the one that PostgreSQL recommends is Psycopg.

Please refer AWS guidelines for RedShift and Spectrum best practices; I've put the links at the end of this post. Based on your question, I am assuming you want to extract, transform and load huge amount of data from RedShift Spectrum based table "s3_db.table_x" to new RedShift table "my_new_table"

Here are some suggestions based on AWS recommendations:

  1. Create your RedShift table with appropriate distribution key, sort key and compression encoding. At high level, "at_id" seems best suited as partition key and "utc_time" as sortkey for your requirement, but make sure to refer AWS guidelines for RedShift table design 3.

  2. As you mentioned, your data volume is huge, you may like to have your S3 source table "s3_db.table_x" partitioned based on "type" and "dt" columns (as suggested at point number 4 in spectrum best practices 1).

  3. Replace DISTINCTwith GROUP BY in the select query from Spectrum (point number 9 in Spectrum Best Practices 1).

  4. AWS recommends (point number 7 in Spectrum best practices 1) to simplify your ETL process using CREATE TABLE AS SELECT or SELECT INTO statements, wherein you may put your transformation logic in the select component to load data directly form S3 to RedShift.

redshift spectrum best practices

redshift best practices

redshift table design playbook

Access Your Data In Amazon Redshift & Postgre With Python, Download data files that use comma-separated value (CSV), character-delimited, and fixed width formats. Create an Amazon S3 bucket and then upload the  In Amazon Redshift's Getting Started Guide, data is pulled from Amazon S3 and loaded into an Amazon Redshift Cluster utilizing SQLWorkbench/J.I'd like to mimic the same process of connecting to the cluster and loading sample data into the cluster utilizing Boto3.

Tutorial: Loading data from Amazon S3 - Amazon Redshift, For more information, see Importing custom Python library modules. Important. Amazon Redshift blocks all network access and write access to the file system  Loading Data to Redshift Using Hevo Using a fully-managed, automated Data Pipeline like Hevo, you will be able to overcome all the limitations of the methods mentioned previously. Hevo can help you bring data from a variety of data sources both within and outside of the AWS ecosystem in just a few minutes into Redshift.

Python language support for UDFs - Amazon Redshift, Undo. 8 Answers. Olek Buzu, Certified Amazon AWS Developer/Architect Here's the example of connect to Redshift using Python module psycopg2: def load(location): What is the best way to load data into Amazon Redshift from MySQL? You can upload data into Redshift from both flat files and json files. You can also unload data from Redshift to S3 by calling an unload command. Boto3 (AWS SDK for Python) enables you to upload file into S3 from a server or local computer.

Can I connect to Redshift using Python?, Using Python to Write a Create Table Statement and Load a CSV I wanted to load the data into Redshift—and rather than be generous in The access key ID and secret access key can be found under users in your AWS  Redshift is Amazon Web Services’ data warehousing solution. They’ve extended PostgreSQL to better suit large datasets used for analysis. When you hear about this kind of technology as a Python developer, it just makes sense to then unleash Pandas on it. So let’s have a look to see how we can analyze data in Redshift using a Pandas script!

Using Python to Write a Create Table Statement and Load a CSV , Redshift is Amazon Web Services' data warehousing solution. When you hear about this kind of technology as a Python developer, it just and use the Redshift 'load' command to load data from an Amazon S3 bucket into  Amazon Redshift allocates the workload to the cluster nodes and performs the load operations in parallel, including sorting the rows and distributing data across node slices. Note Amazon Redshift Spectrum external tables are read-only.

Comments
  • you’re moving data from one redshift table to another redshift table?
  • Yes, that's correct.
  • database operation will ultimately be faster than going to pandas and then back to redshift, the problem is it takes to long right now?
  • Exactly - nearly impossible to complete as I get broken pipe time to time a loose everything. Is there a way doing it in chunks in SQL?
  • You need to take a more detailed look at how the source table is structured. Is it an actual Redshift table, or as the name "s3_db" seems to imply, is it a Spectrum table? If the latter, the data may exist as files in S3 rather than Redshift itself and you can potentially use other tools such as EMR/Spark, Athena or Glue to create the new data set. If it IS an actual Redshift table then check what the distribution key and sort key are for the source table. If you can apply a filter to the sort key this may help speed up the query.
  • John, this is good general logic but in this case i the the OP is trying to copy from s3 to redshift (we need clarification)