AWS: Automating queries in redshift

aws redshift
redshift query editor
redshift spectrum
redshift create database
redshift select database
redshift getting started
redshift console
aws redshift cloudformation example

I want to automate a redshift insert query to be run every day.

We actually use Aws environment. I was told using lambda is not the right approach. Which is the best ETL process to automate a query in Redshift.

For automating SQL on Redshift you have 3 options (at least)

Simple - cron Use a EC2 instance and set up a cron job on that to run your SQL code.

psql -U youruser -p 5439 -h hostname_of_redshift -f your_sql_file

Feature rich - Airflow (Recommended) If you have a complex schedule to run then it is worth investing time learning and using apache airflow. This also needs to run on a server(ec2) but offers a lot of functionality.

https://airflow.apache.org/

AWS serverless - AWS data pipeline (NOT Recommended)

https://aws.amazon.com/datapipeline/

Cloudwatch->Lambda->EC2 method described below by John Rotenstein This is a good method when you want to be AWS centric, it will be cheaper than having a dedicated EC2 instance.

Using Amazon Redshift with other services, Lists the other AWS services with which Amazon Redshift integrates to move, Querying data with federated queries You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift. For automating SQL on Redshift you have 3 options (at least) Simple - cron Use a EC2 instance and set up a cron job on that to run your SQL code. psql -U youruser -p 5439 -h hostname_of_redshift -f your_sql_file Feature rich - Airflow (Recommended) If you have a complex schedule to run then it is worth investing time learning and using apache airflow. This also needs to run on a server(ec2) but offers a lot of functionality.

Query your Amazon Redshift cluster with the new Query Editor , The Query Editor is an in-browser interface for running SQL queries on Amazon Redshift clusters directly from the AWS Management Console. Querying a Database. To query databases hosted by your Amazon Redshift cluster, you have two options: Connect to your cluster and run queries on the AWS Management Console with the Query Editor. If you use the Query Editor, you don't have to download and set up a SQL

You can use boto3 and psycopg2 to run the queries by creating a python script and scheduling it in cron to be executed daily.

You can also try to convert your queries into Spark jobs and schedule those jobs to run in AWS Glue daily. If you find it difficult, you can also look into Spark SQL and give it a shot. If you are going with Spark SQL, keep in mind the memory usage as Spark SQL is pretty memory intensive.

Querying a database using the query editor, Editor. For Schema, choose public to create a new table based on that schema. All queries that you run after the SET query_group command run as members of the specified query group until you either reset the query group or end your current login session. For information about setting and resetting Amazon Redshift objects, see SET and RESET in the SQL Command Reference.

Redshift vs Hadooop: A Brief Comparison, , with added functionality to manage very large datasets and support high-performance analysis and reporting of those data. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization,

Amazon Redshift SQL - Amazon Redshift, speed and scalability in Amazon Redshift · Invoiced Simplifies and Automates Amazon Redshift Monitoring Now Supports End User Queries and Canaries performance metrics from your Redshift cluster's system tables and This serverless solution leverages AWS Lambda to schedule custom SQL  In case you’re searching for Amazon Redshift Interview Questions and answers for Experienced or Freshers, you are at the correct place. There is parcel of chances from many presumed organizations on the planet. The AWS advertise is relied upon to develop to more than $5 billion by 2020,

What is Amazon Redshift?, For Redshift Spectrum, Amazon Redshift manages all the computing infrastructure, load balancing, planning, scheduling and execution of your queries on data  Redshift will help to handle a massive data warehouse workload. I used to manage some redshift cluster in past. Whenever the developers or I wanted to test something on RedShift, we generally take a snapshot and then launch a new cluster or launch it from the automated snapshot. This is fine for Ad-Hoc workloads.

Comments
  • Can you provide more details about what the query is doing and how long it takes to run? Did they suggest why Lambda was not the right approach?
  • Have you looked into Amazon Quicksight for scheduled reports (queries from Redshift)?
  • @JohnRotenstein the functionality behind this query is to join few redshift tables like (stl_query, stl_session, stl_ddltext) and load into a custom created table and everyday this query needs to be run. The reason they said lambda is not the right approach is, it can be active only for 300 seconds and what if my query takes more than 5 mins to run. Pls advice.
  • Yes, the 5-minute limit is the important factor. If the query is likely to take longer than 5 minutes, Lambda is not an option.
  • @JohnRotenstein yea my query won’t take more than 5 mins, but worst if it takes more than 5 then this process will not be suitable. Pls advice.
  • Thanks Jon! Not a cron expert so have few questions.. we have a ec2 instance and we can have a cron Job in that. How will my cron job trigger the sql. Should I write a shell script. Pls explain.
  • That's also a valid option. I will add mention of it to my answer above. (I have not used that method myself yet)