Amazon Redshift Spectrum at a glance

Amazon Redshift Spectrum at a glance

I wrote about Amazon Redshift in a previous article, and I preferred to dedicate a separate article for Amazon Redshift Spectrum due its different properties. This is based on my understanding of available AWS documentations.

Introduction

  • Amazon Redshift Spectrum can be used to efficiently query a very large amount of data that resides in S3 files.
  • The service is used from within a Redshift cluster using normal Redshift SQL queries. The SQL query can join normal Redshift tables with the virtual tables accessed through Amazon Redshift Spectrum.
  • Amazon Redshift Spectrum has its dedicated servers that are independent of the Redshift cluster servers.
  • It has advanced partitioning pruning to eliminate partitions based on the query.
  • It supports several data formats, compressed files and S3 server side encryption.

Service type

Serverless: as a user you don't provision or control the servers used by the service.

How the service is accessed

  • Amazon Redshift Spectrum can be used only within Redshift SQL queries. So the client must be  already connected to a Redshift endpoint.

Scalability and performance

  • Amazon Redshift Spectrum scales automatically to possibly thousands of nodes to scan, group, aggregate and filter the data within S3.
  • S3 data is then returned to the Redshift nodes for further processing. This stage is limited to the resources available within your Redshift cluster.

Availability

  • Amazon Redshift Spectrum is highly available and multi-AZ.

Security

  • Amazon Redshift Spectrum doesn't persist any data and all processing is done using transient in memory data.
  • It depends on the security features of other AWS services like S3 and Redshift.
  • Amazon Redshift Spectrum access to your S3 buckets and AWS Glue catalogues is controlled by IAM policies.

Integrations

  • S3: to read data files
  • Redshift: Amazon Redshift Spectrum can be used only within Redshift SQL queries.
  • AWS Glue: Amazon Redshift Spectrum can use an external data catalog defined in AWS Glue.
  • Apache Hive: Amazon Redshift Spectrum can use an external data catalog defined in your managed Apache Hive metastore.

Service limitations

  • Used S3 bucket and Amazon Redshift cluster must reside in the same AWS Region.

Pricing

The service is priced based on:

  • The amount of S3 data scanned - per TB
  • Normal S3 rates apply for data storage and requests.
  • As the service is used within Amazon Redshift, so additional costs of the provisioned Redshift cluster should be considered.