Kinesis
Table of Contents
Kinesis Data Streams vs. Kinesis Data Firehose
KPL (Kinesis Producer Library) vs. AWS SDK
- KPL is best for high rate producers.
AWS SDK is best for low rate producers (mobile apps, IoT devices, web clients (that are low rate producers)).
- KPL is best for those that need record (batching, aggregation+collection).
SDK can create and delete streams, but aggregation is managed by you.
- For operational reporting:
KPL implements an asynchronous send function, so it can be used for the informational messages.
SDK PutRecords is a synchronous send function, so it must be used for the critical events.
- KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable).
Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance.
Applications that cannot tolerate this additional delay may need to use the AWS SDK directly.
Auto Scaling in KCL
- Read state (shard iterator) is kept in a DynamoDB table per consumer (application) name.
- Resharding (splits and merges) automatically.
- Remember that the KCL uses a DynamoDB table to keep track of the data that has been read, you might need to increase
DynamoDB for high throughput.
Data encryption in Kinesis Data Stream
They are already encrypted.
Server-side encryption is a feature in Kinesis Data Streams that automatically encrypts data before it’s at rest by
using an AWS KMS customer master key (CMK) you specify. Data is encrypted before it’s written to the Kinesis stream
storage layer, and decrypted after it’s retrieved from storage. As a result, your data is encrypted at rest within the
Kinesis Data Streams service.
Data encryption in Kinesis Data Firehose
Amazon Kinesis Data Firehose allows you to encrypt your data after it’s delivered to your S3 bucket.
While creating your delivery stream, you can choose to encrypt your data with an AWS Key Management Service (KMS) key
that you own.
Comparison of Streams
- Kinesis Data Streams: Ordered data, Replay capability, Record-by-record, 1MB per record.
- Apache Spark Streaming: Only-once delivery, Out-of-order data, Micro-batching
- Apache Flink: Only-once delivery, Out-of-order data, Replay capability, Batch processing
- Apache Kafka: Ordered data, Batching processing, > 1MB per record