Amazon Redshift Spectrum at a glance

I wrote about Amazon Redshift in a previous article, and I preferred to dedicate a separate article for Amazon Redshift Spectrum due its different properties. This is based on my understanding of available AWS documentations.

Introduction

Amazon Redshift Spectrum can be used to efficiently query a very large amount of data that resides in S3 files.
The service is used from within a Redshift cluster using normal Redshift SQL queries. The SQL query can join normal Redshift tables with the virtual tables accessed through Amazon Redshift Spectrum.
Amazon Redshift Spectrum has its dedicated servers that are independent of the Redshift cluster servers.
It has advanced partitioning pruning to eliminate partitions based on the query.
It supports several data formats, compressed files and S3 server side encryption.

Service type

Serverless: as a user you don't provision or control the servers used by the service.

How the service is accessed

Amazon Redshift Spectrum can be used only within Redshift SQL queries. So the client must be already connected to a Redshift endpoint.

Scalability and performance

Amazon Redshift Spectrum scales automatically to possibly thousands of nodes to scan, group, aggregate and filter the data within S3.
S3 data is then returned to the Redshift nodes for further processing. This stage is limited to the resources available within your Redshift cluster.

Availability

Amazon Redshift Spectrum is highly available and multi-AZ.

Security

Amazon Redshift Spectrum doesn't persist any data and all processing is done using transient in memory data.
It depends on the security features of other AWS services like S3 and Redshift.
Amazon Redshift Spectrum access to your S3 buckets and AWS Glue catalogues is controlled by IAM policies.

Integrations

S3: to read data files
Redshift: Amazon Redshift Spectrum can be used only within Redshift SQL queries.
AWS Glue: Amazon Redshift Spectrum can use an external data catalog defined in AWS Glue.
Apache Hive: Amazon Redshift Spectrum can use an external data catalog defined in your managed Apache Hive metastore.

Service limitations

Used S3 bucket and Amazon Redshift cluster must reside in the same AWS Region.

Pricing

The service is priced based on:

The amount of S3 data scanned - per TB
Normal S3 rates apply for data storage and requests.
As the service is used within Amazon Redshift, so additional costs of the provisioned Redshift cluster should be considered.