Amazon Redshift at a glance

Amazon Redshift at a glance

I'll try to give in points a very short overview about Amazon Redshift. This is based on my understanding of available AWS documentations.

Introduction

Amazon Redshift is a managed data warehouse database service. It is optimized for complex analytical queries working with large amounts of data in multiple-stage operations.

It is based on a modified version of PostgreSQL 8.0.2 to add the following features:

  • Parallel processing
  • Columnar data storage
  • Efficient data compression

An Amazon Redshift cluster is composed of a Leader Node and one or more Compute Nodes. Each Compute Node is further divided into Node Slices. The number and type of Compute Nodes is managed by the user and determines the compute and storage capacity of the cluster.

The client communicates only with the Leader Node which distributes the tasks into the Node Slices, and then aggregates the results before returning to the client.

Amazon Redshift supports workload management by assigning different priorities to running queries. The user can define up to 8 custom query queues.

Service type

Server-based: you have to specify the number and type of the nodes (servers) that are serving Redshift cluster.

How the service is accessed

Scalability and performance

  • Compute
    • Compute (read and write) load can be scaled by controlling the number of worker nodes and their type.
    • Read queries bursts can be scaled by enabling Concurrency Scaling, which uses one or more exra cluster that has access to current data. You are charged for the time these extra cluster(s) are used. Concurrency Scaling is enabled per query queue.
  • Storage
    • Storage depends on the number and type of compute nodes. So to scale storage, you need to change the type of the compute nodes or increase their count.
    • RA3 Node Types, storage is managed independently of the compute nodes.

Availability

  • Point-in-time backups are provided using automatic Amazon Redshift Snapshots.
  • Backup storage costs and recovery time can be reduced by marking transient tables as BACKUP NO during creation.
  • Amazon Redshift doesn't support multi-az configuration, so the cluster will not be available in case of outage of the used AZ.

Security

  • Management of the Redshift cluster is controlled by IAM
  • Optionally cluster data is encrypted in storage level (at rest)
  • Access to cluster data is protected at network level with VPC and security groups.
  • Access to cluster data is managed within the DB engine using the assigned privileges per users and groups.
  • DB user credentials can have the password stored within the DB engine, or a temporary password can be used using IAM integration.
  • SSL encryption for data in transient.

Integrations

  • S3: loading and saving multiple files with parallel processing
  • DynamoDB: parallel data loading from one DynamoDB table
  • SSH: Loading data from multiple hosts in parallel using SSH connections.
  • AWS Data Pipeline: moving data in or out of the Redshift cluster

Migration

Data Migration Service (DMS) can be used to migrate data from one of the DMS supported sources.

Service limitations

Pricing

The service is priced based on:

  • Number and type of nodes in the cluster. On demand per hour and reserved charging models are available
  • Concurrency Scaling is charged per second of actual usage. Concurrency Scaling clusters has the same number and type of the main cluster, so each is charged based on the number and type of nodes configured for the main cluster.
  • In case of RA3 nodes, storage is priced separately per GB-Month.
  • S3 standard rates apply to backup snapshots storage. Backup with size 100% of total cluster storage is free of charge.