aws-notebook

My AWS Notebook

View the Project on GitHub kyhau/aws-notebook

S3

Topics

Limits

Consistent view: Allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects by or synced with EMRFS.

Access Control and Permission

  1. IAM Policy
    • Attached to IAM User/Group/Role, not buckets or objects.
  2. Bucket Policy - (a resource based policy)
    • Attached to S3 bucket (not objects in bucket).
    • Permissions are applied to all objects in the bucket unless the object was created by a user in another account.
    • Allow control of unauthenticated access and allow conditions.
    • In JSON
    • Use cases:
      1. Can be used to grant access to IAM Users, other AWS accounts, or an anonymous User (“Principal”).
      2. Can be used to grant access to non-AWS identities with conditions.
  3. ACL (Access Control List) - (a resource based policy)
    • Attached to S3 buckets and S3 objects.
    • Cannot deny permissions.
    • Not allow control conditions.
    • In XML
    • Use cases:
      1. Manage access to objects not owned by the bucket owner.
      2. Manager permissions at the object level, and permissions vary by object.
      3. Allow external accounts to manage policies on S3 objects.
  4. S3 Block Public Access
    • S3 Block Public Access provides controls across an entire AWS Account or at the individual S3 bucket level to ensure that objects never have public permissions.
    • If an object is written to an AWS Account or S3 bucket with S3 Block Public Access enabled, and that object specifies any type of public permissions via ACL or policy, those public permissions are blocked.
    • Block Public Access is a good second layer of protection to ensure that you don’t inadvertently grant broader access to objects than intended.
  5. S3 Access Points and Access Point Policies
    • https://aws.amazon.com/blogs/aws/easily-manage-shared-data-sets-with-amazon-s3-access-points/
    • After I create the access point, I can access it by hostname using the format https://[access_point_name]-[accountID].s3-accesspoint.[region].amazonaws.com.
    • Via the SDKs and CLI: aws s3api get-object --key /Alice/object.zip --bucket arn:aws:s3:us-east-1:[my-account-id]:alices-access-point download.zip
    • By default each account can create 1,000 access points per region.
    • f you use AWS Organizations, you can add a Service Control Policy (SCP) requiring all access points are restricted to a VPC.

Read Consistency Rules

  1. All regions support read-after-write consistency for PUTs of new objects into S3.
  2. All regions use eventual consistency for PUTs and DELETEs on existing objects.

Encryption

See also Encryption solution for data at rest and data in transit.

Data protection against accidental loss

  1. Cross-region replication
    • Cross-region replication is a bucket-level feature that enables automatic, asynchronous copying of objects across buckets in different AWS regions.
  2. MFA Delete
    • If a bucket’s versioning configuration is MFA Delete–enabled, the bucket owner must include the x-amz-mfa request header in requests to permanently delete an object version or change the versioning state of the bucket.
  3. Versioning
    • Once activated cannot be removed; only suspended.
    • Apply to all objects in a bucket.
  4. S3 Object Lock
    • S3 Object Lock allows you to attach a specific retention date (or specify indefinite retention we call Legal Hold) to an S3 object. S3 then prevents deletion of that object until the date passes.

Cross-Region Replication (CRR)

Optimal storage

Storage Classes

  1. Standard
    • Designed for frequently accessed data.
  2. RRS (Reduced Redundancy Storage)
    • Designed for frequently accessed, non-critical, reproducible data.
    • AWS recommend that you not use this storage class. The STANDARD storage class is more cost effective.
  3. Intelligent Tiering
    • Designed for long-lived data with changing or unknown access patterns.
    • Optimize storage costs by automatically moving data to the most cost-effective storage access tier, without performance impact or operational overhead. INTELLIGENT_TIERING delivers automatic cost savings by moving data on a granular object level between two access tiers, a frequent access tier and a lower-cost infrequent access tier, when access patterns change (not accessed for 30 consecutive days).
  4. Standard-IA (Infrequent Access)
    • Designed for long-lived, infrequently accessed data.
    • Minimum object size: 128 KB
    • Minimum storage duration: 30 days
  5. OneZone-IA (Infrequent Access)
    • Designed for long-lived, infrequently accessed, non-critical data.
    • Minimum object size: 128 KB
    • Minimum storage duration: 30 days
  6. Glacier (via Lifecycle management)
    • Designed for long-term data archiving with retrieval times ranging from minutes to hours.
    • Minimum storage duration: 90 days
    • Glacier Vault: keep your data safe forever

Lifecycle policy: S3 Standard -> S3 IA -> Glacier

Event Notifications

PUTs, POSTs, DELETEs can trigger events.

CloudWatch

Use cases

Best Practices

  1. Optimal storage - randomize naming structure for better throughput
    • Random prefixes on folders in S3 ensure higher throughput on read and write.
    • Data storage is “partitioned” across servers.
    • The “partition-key” is the folder (if exists) and/or the name of the object.
    • E.g. Monday is the name of the partition key from this bucket: s3://mybucket/Monday/103094.csv
    • Random prefix (e.g. s3://mybucket/kdd8-monday/1030934.csv)
    • Reverse sequence keys (e.g. s3://mybucket/10107102/1030934.csv)
  2. Upload operations: Single Operation Upload can upload file up to 5GB; however, should use Multipart Operation Upload > 100 MB.
  3. Compression: Compress files (if possible) - GZIP, LZO, Snappy
  4. Leverage custom data formats like ORC, Parquet, AVRO, JSON.
  5. Use CloudFront for intensive GET workloads (cache S3 objects instead of making requests