Amazon Redshift Spectrum at a glance
I wrote about Amazon Redshift in a previous article, and I preferred to dedicate a separate article for Amazon Redshift Spectrum due its different properties. This is based on my understanding of available AWS documentations.
- Amazon Redshift Spectrum can be used to efficiently query a very large amount of data that resides in S3 files.
- The service is used from within a Redshift cluster using normal Redshift SQL queries. The SQL query can join normal Redshift tables with the virtual tables accessed through Amazon Redshift Spectrum.
- Amazon Redshift Spectrum has its dedicated servers that are independent of the Redshift cluster servers.
- It has advanced partitioning pruning to eliminate partitions based on the query.
- It supports several data formats, compressed files and S3 server side encryption.
Serverless: as a user you don't provision or control the servers used by the service.
How the service is accessed
- Amazon Redshift Spectrum can be used only within Redshift SQL queries. So the client must be already connected to a Redshift endpoint.
Scalability and performance
- Amazon Redshift Spectrum scales automatically to possibly thousands of nodes to scan, group, aggregate and filter the data within S3.
- S3 data is then returned to the Redshift nodes for further processing. This stage is limited to the resources available within your Redshift cluster.
- Amazon Redshift Spectrum is highly available and multi-AZ.
- Amazon Redshift Spectrum doesn't persist any data and all processing is done using transient in memory data.
- It depends on the security features of other AWS services like S3 and Redshift.
- Amazon Redshift Spectrum access to your S3 buckets and AWS Glue catalogues is controlled by IAM policies.
- S3: to read data files
- Redshift: Amazon Redshift Spectrum can be used only within Redshift SQL queries.
- AWS Glue: Amazon Redshift Spectrum can use an external data catalog defined in AWS Glue.
- Apache Hive: Amazon Redshift Spectrum can use an external data catalog defined in your managed Apache Hive metastore.
- Used S3 bucket and Amazon Redshift cluster must reside in the same AWS Region.
The service is priced based on:
- The amount of S3 data scanned - per TB
- Normal S3 rates apply for data storage and requests.
- As the service is used within Amazon Redshift, so additional costs of the provisioned Redshift cluster should be considered.