Elastic MapReduce (EMR)
Table of Contents
EMR vs. Redshift
Redshift is far more cost effective than EMR for analytics that can be performed on a traditional database.
Redshift
- Redshift is ideal for large volumes of structured data that you want to persist and query using standard SQL and
your existing BI tools.
- Petabyte-scale
- Redshift is based of PostgreSQL (kind of like a database).
- Max of 128 nodes of the large sizes.
- Sources: S3, DynamoDB, EMR, Data Pipeline, SSH enabled host on EC2 or on-premise
- Encryption: KMS or HSM or none
EMR
- EMR is ideal for processing and transforming unstructured or semi-structured data to bring in to Redshift.
- EMR is also a much better option for data sets that are relatively transitory, not stored for long-term use.
- Terabyte-scale, Petabyte-scale
- Use EMR if you need map-reduce algorithms.
- Store and process massive quantity of data (100s TBs to PBs) on several servers (100s to 1000s).
- Sources: AWS services / on-premises -> S3 -> EMR leader node; streams
- Encryption
- Data at rest
- For EMRFS on S3: SSE-S3, SSE-KMS, or CSE-KMS, CSE-Custom
- For cluster nodes (EC2): HDFS LUKS (Linux Unified Key System)
- Data in transit (in-flight)
- For EMRFS traffic between S3 and cluster nodes (enabled automatically): TLS encryption (certification using PEM).
- For data in transit between nodes in a cluster:
- SSL (Secure Sockets Layer) for MapReduce
- SASL (Simple Authentication and Security Layer) for Spark shuffle encryption
HDFS vs. EMRFS
HDFS
- HDFS distributes the data it stores across multiple instances in the cluster.
- HDFS storage is lost when the cluster is terminated.
- Most often used for intermediate results.
- The filesystems that can run HDFS: EBS, Local attached storage, S3 on EMRFS
EMRFS
- Implementation of HDFS used fo reading/writing regular files from EMR directly to S3.
- For storing persistent data in S3 for use with Hadoop while also providing features like S3 server-side encryption,
read-after-write consistency, and list consistency.
- Consistent View
- Allows EMR clusters to check for list and read-after-write consistency for S3 objects written by or synced with EMRFS.
- Number of retries: 5 (default); if an inconsistency is detected, EMRFS tries to call S3 this number of times.
- Retry period (in seconds): 10 (default)