prosperops logo

Exploring Amazon Redshift Architecture: A Comprehensive Guide

amazon-redshift-architecture

Amazon Redshift is an AWS data warehousing service for storing large amounts of data and analyzing complex queries. Like most AWS solutions, it’s fully managed, cost-effective, and offers features that work for different use cases. 

As exciting as the benefits of using this service seem, getting the most from Redshift starts with understanding its architecture. 

This blog will introduce you to the basics of the Amazon Redshift by focusing on its core components and how it works. Read on!

What is Amazon Redshift?


Image Source: AWS Redshift

Amazon Redshift is a cloud-based data warehousing solution that stores, distributes, and processes data on a petabyte scale. Redshift offers a solution for enterprises and organizations struggling to store, organize, and interpret vast volumes of data for informed decision-making. 

It’s based on PostgreSQL and built for Online Analytical Processing (OLAP), which makes it excellent at handling complex queries and delivering business intelligence. Additionally, it allows you to integrate with other AWS tools like Amazon SageMaker Autopilot to create machine learning models for business forecasting

Core components of Amazon Redshift’s architecture

Redshift relies on the following components to work as a data warehouse: 

Clusters

Clusters are one of the most essential components of the Redshift infrastructure. Within these clusters, you’ll find nodes. Nodes are the basic computing units that enable Amazon Redshift to analyze, store, and distribute data. Redshift charges its users based on the number of nodes they use, not clusters, as the former does the main work users need. 

There are two types of Amazon Redshift clusters:

  • Single-node clusters: Utilize one node that acts as the leader node and compute nodes.
  • Multi-node clusters: Have multiple compute nodes and one leader node. Depending on the node type in use, it can accommodate up to 100 compute nodes at once. 

Leader node

Leader nodes are in charge of query execution and data distribution within the Redshift clusters. Whenever a data query comes in from the client applications or a user queries a database, the leader nodes parse the query and create an execution plan. This plan includes breaking up the query into tasks and assigning it to the compute nodes for them to work on. 

However, if the query involves a catalog table or some select SQL functions, the leader node handles this request and returns the result to the user without involving compute nodes. 

Another unique function of the leader node is caching queries and query results. So, next time, if a user makes a query for a cached result, it pulls it up immediately without distributing it to the compute nodes to act on. 

Compute nodes

Compute nodes do the heavy lifting by storing data and executing the data query plan sent by the leader node. They come in multiples and work in parallel for fast query execution. 

Each compute node has a CPU, memory, and disk storage to carry out data operations. Redshift hosts its compute nodes on a separate isolated network to guarantee data safety within these commute nodes. 

Amazon Redshift offers two types of nodes:

  • Dense storage (DS): This node type provides more storage capacity and is best used for large workloads of 500GB and above. It uses a Hard Disk Drive (HDD) for storing information. 
  • Dense compute (DC): Nodes in this category deliver great performance with limited storage capacity. It uses a Solid State Drive (SSD), which offers faster processing speed and durability. 

Node slices

Node slices are the divisions of Redshift compute nodes. They get a share of the node’s memory and disk space, which equips them to work on a part of the data query assigned to a compute node. Slices are allocated based on the node type in use. 

Parallel processing

Parallel processing, or massively parallel processing (MPP), occurs when processing power is shared among different computing units—i.e., nodes—allowing data queries to proceed simultaneously. 

It empowers the data warehouse service to analyze data easily by creating the leader and compute node hierarchy within Redshift’s clusters. With this parallelism, users can work on multiple queries (involving structured or semi-structured data) without the system lagging.

MPP is also known as a “shared nothing system” because the different nodes work independently with their respective operating systems and memory. 

Columnar storage

Amazon Redshift allows its users to maximize their storage space by storing their data in columns. Unlike regular database row storage, which is less efficient because the data is spread out, columnar storage allows users to store more data per column. 

So, you can store three times the amount of data or more from a row in Redshift’s columns. Datasets are stored according to their type for better organization and access. For example, dates will go into a single column since they are all numeric data.

With columnar data storage, data retrieval is less stressful since you retrieve only the columns you need for a query instead of viewing all the data in storage at once like you would with a spreadsheet (that is, row storage design). 

Data compression

Data compression applies to Redshift’s columnar storage style. To optimize for more storage space and query performance, Redshift allows users to encode the data in its columns using a more straightforward language, a different format, or any other encoding technique the system offers. An example of this is replacing recurring data values with a unique identifier to release more storage space. 

For best results, users can do the following:

  • Allow the system to analyze the columns and apply an encoding technique that suits it.
  • Follow Redshift’s compression encoding template and manually select the technique that applies to the data type within a column.
  • Allow the system to apply the default ENCODE AUTO feature, which automatically applies to tables with no specified encoding instruction.

Query planning and optimization

For every SQL query that comes in, Redshift creates a query plan and execution workflow in response. This begins with the parser (a component within the leader node) that creates a query tree to make sense of the user’s request. 

Afterward, the parser hands over the query tree to the optimizer, another component within the leader node. The optimizer modifies this query tree by rewriting the query, indexing, creating subqueries, or any other enhancement technique. 

It will also develop a query plan with steps on how the query execution could go. This could be by aggregation, join order, or data distribution. Redshift’s execution engine turns the query plan into compiled code and distributes it to the compute nodes to work on. 

Workload management

Workload management (WLM) is Redshift’s way of maximizing resources while prioritizing user queries. Ideally, queries are assigned to a queue in a haphazard manner, which leads to a mixture of queries of uneven-sized workloads. 

For instance, a query that should take 10 seconds to run takes longer because the cluster is attending to a larger query within the same queue, with a runtime of 40 seconds. 

With WLM, you get to decide the kind of queues your queries will fall under according to a set parameter you set. The result? Freedom to work on queues and queries based on their level of importance without wasting resources on simple queries or missing deadlines when you take time to get a high-priority query. 

How does Amazon Redshift handle data warehousing?

Amazon Redshift offers different features to make data warehousing effective. These features are:

1. Ingest and store data

There are two types of data ingestion on Redshift. 

One is streaming ingestion, where you ingest large volumes of data from different sources. This ingestion process takes less time on Redshift, as the platform offers low latency and high throughput, especially for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka. 

The other involves batch data ingestion through Amazon Simple Storage Service (Amazon S3). With this, you can directly load new files from your S3 into Redshift. 

Amazon’s columnar storage design seals the deal by offering better data storage from different sources and ensuring less I/O (input/output) is used during data execution. 

2. Process and analyze data

Redshift uses SQL to analyze queries from different data sources, including data lakes and external databases. Additionally, its other features, like MPP, ensures fast processing, and compression encoding saves storage space, thereby freeing up more resources for enhanced performance.

3. Manage and secure data

Just like other AWS services, Redshift offers robust security features to protect users’ data assets. Some of these are:

  • Encryption: Protects your Redshift clusters
  • Virtual private cloud (VPC): Provides a virtual networking environment for clusters
  • Cluster security groups: Provides cluster access to a select group of users
  • Access management: Gatekeeps access to databases and other Redshift resources

4. Scale and optimize costs

As your data grows on Redshift, you can use concurrency scaling to add more compute nodes, maintaining the efficiency of query performance. Your concurrency clusters come at an extra cost, but only when you have actively running queries. In other instances, where you need to scale down, there’s also the option of resizing your clusters. 

Another way to optimize cost is through Redshift Advisor, a tool that provides recommendations on the ideal Redshift features for a particular use case. 

For example, if a user has multiple uncompressed columns that they have to pay to maintain, the advisor can detect them and recommend columns that can be compressed to save cost and storage space. Or you can log in to your AWS account and check your AWS cost and usage report for more insight.

Best practices for using Amazon Redshift effectively

To get started on Amazon Redshift, you’ll need to consider the following best practices to help you maximize the efficiency of the data warehousing service:

  • Design tables to maximize query efficiency. This involves: choosing the best sort key, choosing a suitable distribution style, allowing COPY to choose compression encodings, defining primary key and foreign key constraints, using the smallest possible column size, and using date/time data types for date columns.
  • Use the Amazon Redshift Advisor to help improve performance and reduce operating costs.
  • Use the latest Redshift drivers from AWS.
  • Take advantage of Redshift features ‘elastic resize’ and ‘concurrency scaling.

Maximize AWS cost efficiency with ProsperOps

Service costs can be a major factor when signing up for new software, especially for cloud solutions that offer different pricing tiers. However, optimizing those costs doesn’t have to be a burden that requires manual oversight from your engineers—whose plates are already full. With ProsperOps, your business can automate the process of optimizing your AWS cloud costs.

ProsperOps Automated Discount Management for Amazon Redshift automatically maximizes your AWS discount instruments and operates in the background with no manual intervention from your team—securing you the best saving opportunities while minimizing long-term and inelastic commitment risk. And best of all? You pay nothing unless we save you money.

Sign up for a free demo of ProsperOps to discover how ProsperOps can help your business realize greater AWS savings with less risk.

Share

Facebook
Twitter
LinkedIn
Reddit

Get started for free

Request a Free Savings Analysis

3 out of 4 customers see at least a 50% increase in savings.

Get a deeper understanding of your current cloud spend and savings, and find out how much more you can save with ProsperOps!

Submit this form to request your free cloud savings analysis.

🚀 Discount automation now available for Google Cloud Platform (GCP) in Early Access!