Machine Learning & Big Data Blog

Writing SQL Statements in Amazon Redshift

Walker Rowe
4 minute read
Walker Rowe
image_pdfimage_print

In this tutorial, we show how to write Amazon Redshift SQL statements. Since this topic is large and complex, we start with the basics.

This tutorial will show you how to:

  • Use the query editor
  • Aggregate rows using group by
  • Convert dates to year and month
  • Export the results to a csv file

Redshift query editor

To open the query editor, click the editor from the clusters screen. Redshift will then ask you for your credentials to connect to a database. One nice feature is there is an option to generate temporary credentials, so you don’t have to remember your password. It’s good enough to have a login to the Amazon AWS Console.

Below we have one cluster which we are resuming after having it in a paused state (to reduce Amazon billing charges).

 

You write the SQL statement here. Only one statement is allowed at a time, since Redshift can only display one set of results at a time. To write more than one statement click the plus (+) to add an additional tab.

When you run each query, it takes a few seconds as it submits the job and then runs it. So, it’s not instantaneous, as you might expect with other products.

The results are shown at the bottom where you can export those as a CSV, TXT, or HTML. You can also chart the results.

Get table schema

For this tutorial, we use a table of weather data. (See more on loading data to Amazon Redshift from S3.) This is 20 years of weather data for Paphos, Cyprus. It has four columns:

  • dt_iso
  • temp
  • temp_min
  • temp_max

dt_dso is of type timestamp and is the primary key. One nice thing about Redshift is you can load the date in almost any format you want, and Redshift understands that. Then Redshift provides the to_char() function to print out any part of the date you want, like the hour, year, minute, etc.

To look at the table schema query the pg_table_def table.

SELECT *

FROM pg_table_def

WHERE tablename = 'paphos'

AND schemaname = 'public';

Here is the schema.

schemaname,tablename,column,type,encoding,distkey,sortkey,notnull

public,paphos,dt_iso,timestamp without time zone,none,t,1,t

public,paphos,temp,real,none,f,0,f

public,paphos,temp_min,real,none,f,0,f

public,paphos,temp_max,real,none,f,0,f

Aggregate SQL statements

This query calculates the average temperature per month for the summer months May through September. Notice:

  • to_char() extracts any portion of the date that you want, such as YYYY year or MM month number.
  • We use the in() statement to select the months.
  • The order statement uses a 1. That means use the first column returned by the query. That’s an alternative to typing the column name.
  • We group by the year and month since we want to calculate the average [avg()] for month within the year
  • We use the round() function to round two decimal places. Otherwise Redshift gives too many decimal places.
  • As with other databases, the as statement lets us give an alias to the column resulting from the calculating. Without it the column would not have a descriptive name. Here we call the average temperature aveTemp.
select round(avg(temp),2) as aveTemp,

                 to_char(dt_iso,'YYYY') as year,

                 to_char(dt_iso,'MM') as month

                 from paphos where

                 month in ('05','06','07','08','09')

                 group by year, month

                 order by 1 desc

Here are the results. It shows the hottest months for the 20 years of data. I have cut off the display to make it short. For example, in the 20 years, August 2010 was the hottest month.

We grouped by year then month as we want the month within the year given daily weather observation.

avetemp year month
84.11 2010 8
83.12 2012 8
83.05 2012 7
82.9 2015 8
82.39 2017 7
82.04 2014 8
81.85 2007 7
81.73 2020 9
81.72 2013 8
81.72 2008 8
81.62 2000 7
81.61 2009 8
81.49 2017 8

We export the data to a csv format using the button to the right of the results. Then we import it to a spreadsheet so that we can more easily see the results and give it colors and such.

Here are the hottest years. We get that by dropping the month from the aggregation.

select round(avg(temp),2) as aveTemp,

                 to_char(dt_iso,'YYYY') as year 

                                    from paphos 

                                                group by year

                 order by 1 desc

Additional resources

For more tutorials like this, explore these resources:

Automate workflows to simplify your big data lifecycle

In this e-book, you’ll learn how you can automate your entire big data lifecycle from end to end—and cloud to cloud—to deliver insights more quickly, easily, and reliably.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

BMC Bring the A-Game

From core to cloud to edge, BMC delivers the software and services that enable nearly 10,000 global customers, including 84% of the Forbes Global 100, to thrive in their ongoing evolution to an Autonomous Digital Enterprise.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.