Data Duplication in Kinesis Streams: Possible Reasons | Valu$

Possible Reasons for Data Duplication in Kinesis Streams

Question

Valu$ Is an online retail store provides a powerful online marketplace platform that allows its buyers enjoy highly secure and convenient shopping experience.

Valu$ runs an online web application to process business transactions, uses Kinesis streams as a backbone, KPL with aggregation for data ingestion and KCL to load data into DynamoDB.

The administrator observed lot of duplicate data loaded into the table, analysed and identified that duplication of data is happening in Kinesis streams.

Which of the following option could be possible reasons? Select 2 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

The duplication of data in Kinesis streams can happen due to various reasons. Below are the possible reasons:

A. Usage of default sequence number scheme in KPL to uniquely identify your KPL user records while loading:

Kinesis Producer Library (KPL) generates sequence numbers to uniquely identify each record that it writes to a Kinesis stream. KPL uses a default sequence number scheme that generates a sequence number based on the current timestamp and the ordering of the records within the same millisecond. This default scheme ensures that the records are ordered in the stream based on their arrival time. However, if the same data is ingested into the stream multiple times with the same timestamp, KPL will generate the same sequence number, causing duplication of data in the stream.

B. KPL Retry mechanism:

KPL has a retry mechanism that retries the failed records for a configurable number of times. If a record fails to write to the stream and the retry mechanism is enabled, KPL retries to write the same record multiple times. If the retries are successful, the same data will be written to the stream multiple times, causing duplication.

C. Time to Live:

Kinesis streams have a feature called "Time to Live (TTL)," which allows you to specify how long you want the records to be retained in the stream. When a record reaches its TTL, Kinesis deletes the record from the stream. If the TTL is not set appropriately, the same data may be retained in the stream multiple times, causing duplication.

D. Rate Limiting:

Kinesis streams have a rate-limiting feature that limits the number of records that can be written to the stream in a specified period. If the rate limit is reached, the producer application may retry writing the same records, causing duplication.

Therefore, options A and B could be the possible reasons for duplication of data in Kinesis streams. The usage of the default sequence number scheme in KPL and the retry mechanism could cause duplication of data in the stream.