service-logo

Amazon Simple Storage Service (S3)

Concepts

http://{bucketname}.s3.amazonaws.com/{objectname}

Buckets

A bucket is a container for objects stored in Amazon S3.
Every object is contained in a bucket.
You can think of a bucket in traditional terminology similar to a drive or volume.

Limitations

  • Maximum limit of 100 buckets per account.
  • Unlimited number of objects
  • An object can only be up to 5 TB in size.
  • The largest objects uploaded in a single PUT is 5 GB.
    For objects larger than 100 MB, you should consider using multipart upload
  • A bucket is owned by the AWS account that created it, and bucket ownership is not transferable.
  • A bucket must be empty before you can delete it.
  • After a bucket is deleted, that name becomes available to reuse, but the name might not be available for you to reuse for various reasons, if you expect to use same bucket name, do not delete the bucket.
  • A bucket name must be unique across all existing bucket names in Amazon S3 across all of AWS.

Region

Amazon S3 creates buckets in a region that you specify. Objects belonging to a bucket that you create in a specific Region never leave that region unless you explicitly transfer them to another region.

Versioning

You can use versioning to preserve, retrieve, and restore every version of every object stored in your Amazon S3 bucket, including recovering deleted objects.
With versioning, you can easily recover from both unintended user actions and application failures.

Once you enable versioning on a bucket, it can never return to an unversioned state. You can, however, suspend versioning.

Object facets

  • Key
    Name that you assign to an object, which may include a simulated folder structure. Each key must be unique within a bucket (unless the bucket has versioning turned on).
  • Version ID
    Within a bucket, a key+versionID uniquely identify an object.
    If versioning is turned off, you have only a single version.
    If versioning is turned on, you may have multiple versions of a stored object.
  • Value
    Actual content that you are storing, any sequence of bytes.
  • Metadata
    A set of name-value pairs to store information regarding the object.
    AWS assigns system metadata to these objects, and you can assign custom metadata, referred to as user-defined metadata.
  • Subresources
    Is a mechanism to store additional object-specific information:
    • Access control list (ACL): A list of grants identifying the grantees and the permissions granted
    • Torrent: The torrent file associated with the specific object
  • Access Control Information
    Supports both resource-based access control, such as an ACL and bucket policies, and user-based access control.
  • Object Tagging Object tagging enables you to categorize storage.
    Each tag is a key-value pair.
    - Keys and values are case sensitive.
    - Max 10 tags per object
    - Tag keys must be unique
    - Tag key can be up to 128 Unicode characters in length
    - Tag value can be up to 256 Unicode characters in length

Cross-Origin Resource Sharing

Defines a way for client web applications that are loaded in one domain to interact with resources in a different domain. With CORS support in Amazon S3, you can selectively allow cross-origin access to your S3 resources while avoiding the need to use a proxy.

Storage classes

S3 Storage Classes can be configured at the object level and a single bucket can contain objects stored across S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, and S3 One Zone-IA.

  • S3 Standard
    Scope: Region
    Durability: 11 9s (99.999999999%)
    Availability: 4 9s (99.99%) (over a given year)
    Replicated 3 times in 3 AZ
  • S3 Intelligent-Tiering
    Automatically moves objects to the most cost-effective access tier based on access frequency.
    - Objects not accessed for 30 days -> Infrequent Access tier (40% lower cost)
    - Objects not accessed for 90 days -> Archive Instant Access tier (68% lower cost)
  • S3 Standard-Infrequent Access [S3 Standard-IA]
    Lower cost per GB stored and GB retrieval.
    High cost per PUT, COPY, POST or GET request 30-day storage minimum
    Availability: 3 9s (99.9%)
  • S3 One Zone-Infrequent Access [S3 One Zone-IA]
    Non critical data, infrequent but rapid access (cost 20% less than S3-IA)
    Scope: AZ
    Availability: 2.5 9s (99.5%)
    Data replicated 3 times in the same AZ

See also S3 Glacier - Archive storage classes

Data Consistency Model

S3 provides read-after-write consistency only for PUTs of new objects.

Amazon S3 offers eventual consistency for overwrite PUTs and DELETEs in all regions, and updates to a single key are atomic.

If you need a strongly consistent data store, choose a different data store than Amazon S3 or add code consistency checks into your application.

Encryption

You can protect data in transit by using Amazon S3 SSL API endpoints, which ensures that all data sent to and from S3 is encrypted using the HTTPS protocol while in transit.

You can encrypt data at rest using different options of Server-Side Encryption (SSE).
Your objects in Amazon S3 are encrypted at the object level as they are written to disk in the data centers and then decrypted for you when you access the objects using AES-256.

Server-Side Encryption (SSE)

There are 3, mutually exclusive options:

  • SSE-S3 (Amazon S3 managed keys)
    • Each object is encrypted with a unique data key.\
    • This key is encrypted with a periodically-rotated master key managed by Amazon S3.\
    • AES-256 is used for both object and master keys.
      This feature is offered at no additional cost beyond what you pay for using Amazon S3.
  • SSE-C (Customer-provided keys)
    You can use your own encryption key while uploading an object to Amazon S3.
    This encryption key is used by S3 to encrypt your data using AES-256.
    After the object is encrypted, the encryption key you supplied is deleted.
    When you retrieve this object you must provide the same encryption key in your request.
    S3 verifies that the encryption key matches, decrypts the object, and returns the object.
    This feature is also offered at no additional cost beyond what you pay for using Amazon S3.
  • SSE-KMS (AWS KMS managed encryption keys)
    When you upload your object:
    - A request is sent to AWS KMS to create an object key.
    - KMS generates this object key and encrypts it using the master key that you specified earlier. - KMS returns this encrypted object key along with the plaintext object key to Amazon S3.
    - The S3 web server encrypts your object using the plaintext object key.
    • Stores the now encrypted object.\
    • Deletes the plaintext object key from memory.
      When you download an object:\
    • A request is sent to AWS KMS to decrypt an object key.\
    • KMS decrypts the object key using the master key and returns the plaintext object key.\
    • With the plaintext object key, Amazon S3 decrypts the encrypted object and returns it to you.
      This feature does incur an additional charge.

Client-Side Encryption

You can also use client-side encryption, with which you encrypt the objects before uploading to Amazon S3 and then decrypt them after you have downloaded them.
For extra protection you can use a combination of both server-side and client-side encryption.

There are 2options for using data encryption keys.

  • Client-Side Master Key

    The first option is to use a client-side master key of your own.
    When uploading an object, you provide a client-side master key to the Amazon S3 encryption client (for example, AmazonS3EncryptionClient when using the AWS SDK for Java).

  • AWS KMS-Managed Customer Master Key (CMK)
    This process is similar to the process described for using KMS-SSE, except that it is used for data at rest instead of data in transit.

Access Control

By default, all S3 resources are private.
Only the resource owner, an account that created it, can access the resource.

There are two access policy options to grant permissions to your Amazon S3 resources.
Both use a JSON-based access policy language, as do all AWS services that use policies.

Bucket Policies and User Policies

  • Bucket Policies
    A bucket policy is attached to S3 buckets, and can allow or deny requests based on the elements in the policy, including the requester, S3 actions, resources, and aspects or conditions of the request (for example, the IP address used to make the request).
  • IAM User Policies
    A user policy is attached to IAM users to perform or not perform actions on your AWS resources.

Presigned URLs / Query String Authentication

A presigned URL is a way to grant access to an object, allowing users to upload or download it without granting them direct access to Amazon S3 or the account.

Anyone with valid security credentials can create a presigned URL, but must be someone with permission to perform the operation granted.

  • Expiration max 7 days

Can be generated with the AWS CLI or SDKs, and not with the Management Console.

https://s3.amazonaws.com/examplebucket/test.txt

?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=<your-access-key-id>/20130721/us-east-1/s3/aws4_request
&X-Amz-Date=20130721T201207Z
&X-Amz-Expires=86400
&X-Amz-SignedHeaders=host
&X-Amz-Signature=<signature-value>
  • The line feeds are added for readability.
  • The X-Amz-Credential value in the URL shows the / character only for readability. In practice, it should be encoded as %2F.

Other features

S3 Select

It’s a feature that let you use structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve only the subset of data that you need.

  • Reduces the amount of data that Amazon S3 transfers
  • Reduces the cost and latency to retrieve this data

Can be use to objects stored in

  • CSV (even compressed in GZIP or BZIP2)
  • JSON (even compressed in GZIP or BZIP2)
  • Apache Parquet

It works with server-side encrypted objects.

Output in CSV or JSON, and you can determine how the records in the result are delimited.

Requirements

  • You must have s3:GetObject permission
  • If the object is encrypted with server-side encryption with customer-provided keys (SSE-C), you must use https, and you must provide the encryption key in the request.

Limits

  • Maximum length of SQL expression: 256 KB
  • Maximum length of a record (in the input or result): 1 MB
  • Nested data only by using the JSON output format
  • Not possible to query objects in the
    • S3 Glacier Flexible Retrieval
    • S3 Glacier Deep Archive
    • Reduced Redundancy Storage (RRS)
    • S3 Intelligent-Tiering Archive Access tier
    • S3 Intelligent-Tiering Deep Archive Access tier

Additional limitations using Parquet objects:

  • Only compression GZIP or Snappy are supported
  • Whole-object compression not supported
  • Parquet output not supported: only CSV or JSON
  • Maximum uncompressed row group size: 512 MB
  • You must use the data types that are specified in the object’s schema.
  • Selecting on a repeated field returns only the last value.

Hosting a Static Website

If your website contains static content and optionally client-side scripts, then you can host your static website directly in Amazon S3 without the use of web-hosting servers.

To host a static website, you configure an Amazon S3 bucket for website hosting and upload your website content to the bucket.

MFA Delete

MFA is another way to control deletes on your objects in Amazon S3.
It does so by adding another layer of protection against unintentional or malicious deletes, requiring an authorized request against Amazon S3 to delete the object.

Cross-Region Replication

Cross-region replication (CRR) is a bucket-level configuration that enables automatic, asynchronous copying of objects across buckets in different AWS Regions.
The source bucket and destination bucket can be owned by different accounts.

To activate this feature, add a replication configuration to your source bucket to direct Amazon S3 to replicate objects according to the configuration.
In the replication configuration, provide information including the following:
- The destination bucket
- The objects that need to be replicated
- Optionally, the destination storage class (otherwise the source storage class will be used)
All data is encrypted in transit across AWS Regions using SSL.
A replcated object cannot be replicated again.
Versioning must be enabled for both the source and destination buckets.
Source and destination buckets must be in different AWS Regions.
Amazon S3 must be granted appropriate permissions to replicate files.

VPC Endpoints

Enables you to connect your VPC privately to Amazon S3 without requiring an internet gateway, network address translation (NAT) device, virtual private network (VPN) connection, or AWS Direct Connect connection.
Instances in your VPC do not require public IP addresses to communicate with the resources in the service. Traffic between your VPC and Amazon S3 does not leave the Amazon network.

Amazon S3 Transfer Acceleration

It’s a feature that optimizes throughput when transferring larger objects across larger geographic distances, leveraging CloudFront edge locations to assist you in uploading your objects more quickly in cases where you are closer to an edge location than to the region to which you are transferring your files.

Instead of using the public internet to upload objects from Southeast Asia, across the globe to Northern Virginia, take advantage of the global Amazon content delivery network (CDN).

Multipart Uploads

Upload large objects in parts, to speed up your upload by doing so in parallel.

To use multipart upload, you first break the object into smaller parts, parallelize the upload, and then submit a manifest file telling Amazon S3 that all parts of the object have been uploaded. Amazon S3 will then assemble all of those individual pieces to a single object.

Multipart upload can be used for objects ranging from 5 MB to 5 TB in size.

Multipart upload initiation

  • You send a request to initiate
  • S3 returns a response with an upload ID, You must include this upload ID whenever you upload parts, list the parts, complete an upload, or stop an upload or provide any metadata describing the object being uploaded.

    Parts upload

  • Specify upload ID and a part number (between 1 and 10,000)
    • It doesn’t need to be in a consecutive sequence (for example, it can be 1, 5, and 14)
    • A previously uploaded part can be overwritten by uploading a new part using the same part number
  • S3 returns an entity tag (ETag)  for the part as a header in the response
  • For each part upload, you must record the part number and the ETag value

After either complete or stop a multipart upload will Amazon S3 free up the parts storage and stop charging you for the parts storage. After stopping a multipart upload, you cannot upload any part using that upload ID again. To make sure you free all storage consumed by all parts, you must stop a multipart upload only after all part uploads have been completed.

Multipart upload completion

  • The complete multipart upload request must include
    • the upload ID
    • a list of both part numbers and corresponding ETag values
  • Amazon S3 creates an object by concatenating the parts in ascending order
  • Any metadata provided in the _initiate request is associated with the object
  • After a successful complete request, the parts no longer exist
  • S3 response includes an ETag that uniquely identifies the combined object data
  • This ETag is not necessarily an MD5 hash of the object data

Range GETs

Range GETs are similar to multipart uploads, but in reverse.
If you are downloading a large object and tracking the offsets, use range GETs to download the object as multiple parts instead of a single part.
You can then download those parts in parallel and potentially see an improvement in performance.

Object Lifecycle Management

A lifecycle configuration is a set of rules that defines actions that Amazon S3 applies to a group of objects.
There are two types of actions:

  • Transition actions
    Define when objects transition to another storage class
  • Expiration actions
    Define when objects expire. Amazon S3 deletes expired objects on your behalf.

S3 Storage Lens

It’s a cloud-storage analytics feature that you can use to gain organization-wide visibility into object-storage usage and activity.

  • Summary insights, such as finding out how much storage you have across your entire organization
  • Contextual recommendations that you can use to optimize storage costs and apply best practices for protecting your data.
  • Understand which are the fastest-growing buckets and prefixes
  • Identify cost-optimization opportunities
  • implement data-protection and security best practices, and improve the performance of application workloads, for example:
    • Identify buckets that don’t have S3 Lifecycle rules to expire incomplete multipart uploads that are more than 7 days old.
    • Identify buckets that aren’t following data-protection best practices, such as using S3 Replication or S3 Versioning.

Metrics are exposed in the Account snapshot section on the Amazon S3 console Buckets page.

  • Interactive dashboard that you can use to visualize insights and trends, flag outliers, and receive recommendations for optimizing storage costs and applying data-protection best practices.
  • Your dashboard has drill-down options to generate and visualize insights at these levels:
    • organization
    • account
    • region
    • storage class
    • bucket
    • prefix
  • You can send a daily metrics export in CSV or Parquet format to an S3 bucket.

    S3 Inventory

    This feature helps you managing your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs.

S3 Inventory does not use the List API operations to audit your objects and does not affect the request rate of your bucket.

Provides a list of your objects and their corresponding metadata in these formats:

  • CSV
  • Apache optimized row columnar (ORC)
  • Apache Parquet

If you set up a weekly inventory, a report is generated every Sunday (UTC time zone) after the initial report. You can configure multiple inventory lists for a bucket. When you’re configuring an inventory list, you can specify the following:

  • What object metadata to include in the inventory
  • Whether to list all object versions or only current versions
  • Where to store the inventory list file output
  • Whether to generate the inventory on a daily or weekly basis
  • Whether to encrypt the inventory list file

Source bucket

  • Contains the objects that are listed in the inventory
  • Contains the configuration for the inventory You can get an inventory list for an entire bucket, or you can filter the list by object key name prefix.

Destination bucket

  • Contains the manifest files that list all the inventory list files
  • Contains the inventory list files, filled with:
    • The list of the objects in the source bucket
    • Metadata for each object
    • Can be in these formats:
      • CSV (compressed with GZIP)
      • Apache optimized row columnar (ORC) (compressed with ZLIB)
      • Apache Parquet (compressed with Snappy)
  • Must have a bucket policy to give Amazon S3 permission to verify ownership of the bucket and permission to write files to the bucket
  • Must be in the same AWS Region as the source bucket
  • Can be the same as the source bucket
  • Can be owned by a different AWS account than the account that owns the source bucket

Server access logging

Detailed records for the requests that are made to a bucket. Can be useful for security and access audits, or to learn about your customer base and understand the billing.

  • Disabled by default
  • From source bucket to destination bucket (also known as a target bucket)
  • The destination bucket must be in the same AWS Region and AWS account as the source bucket

Batch Operations

Used to perform large-scale batch operations on Amazon S3 objects.

  • Can perform a single operation on lists of Amazon S3 objects that you specify
  • Tracks progress, sends notifications, and stores a detailed completion report of all actions
  • Operations:
    • Copy objects
    • Set object tags or access control lists (ACLs)
    • Initiate object restores from S3 Glacier Flexible Retrieval
    • Invoke an AWS Lambda function to perform custom actions using your objects.
  • You can use custom list of objects, or you can use an Amazon S3 Inventory report

  • Job A job is the basic unit of work for S3 Batch Operations. A job contains all of the information necessary to run the specified operation on the objects listed in the manifest.
  • Operation The operation is the type of API action, such as copying objects. Each job performs a single type of operation across all objects that are specified in the manifest.
  • Task A task is the unit of execution for a job. A task represents a single call. Over the course of a job’s lifetime, one task is created for each object specified in the manifest.

S3 on Outposts

You can create S3 buckets on your Outposts and easily store and retrieve objects on premises. S3 on Outposts provides a new storage class, OUTPOSTS, which uses the Amazon S3 APIs and is designed to store data durably and redundantly across multiple devices and servers on your Outposts. You communicate with your Outposts bucket by using an access point and endpoint connection over a virtual private cloud (VPC).

Example of an (ARN) format for S3 on Outposts buckets: arn:aws:s3-outposts:<region>:<account-id>:outpost/<outpost-id>/bucket/<bucket-name>

S3 access points

Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations, such as GetObject and PutObject. Each access point has distinct permissions and network controls that S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached** to the underlying bucket**. You can configure any access point to accept requests only from a virtual private cloud (VPC) to restrict Amazon S3 data access to a private network. You can also configure custom block public access settings for each access point.

Cost allocation S3 bucket tags

cost allocation tag is a key-value pair that you associate with an S3 bucket. After you activate cost allocation tags, AWS uses the tags to organize your resource costs on your cost allocation report. Cost allocation tags can only be used to label buckets.

The cost allocation report lists the AWS usage for your account by product category and linked account user. The report contains the same line items as the detailed billing report and additional columns for your tag keys.

service-logo

Amazon S3 Glacier

Archive storage classes

  • S3 Glacier Instant Retrieval
    Lowest-cost storage for long-lived data rarely accessed that requires retrieval in milliseconds.
    (cost 68% less than S3 Standard-IA)
    Availability: 3 9s (99.9%)
    Minimum object size: 128 KB
  • S3 Glacier Flexible Retrieval (Formerly S3 Glacier)
    Ideal for archiving data that is accessed 1-2 times per year, asynchronously.
    Retrieval times: from minutes to hours, configurable
    (cost 10% less than S3 Glacier Instant Retrieval)
    Availability: 4 9s (99.99%)
  • S3 Glacier Deep Archive
    Lowest-cost storage class, for data that may be accessed 1-2 times per year.
    Designed for customers that must retain data for many years to meet regulatory compliance requirements.
    Retrieval time: 12 hours\