Designing a Flexible Architecture for Data Ingestion and Processing in Google Cloud

Supporting Unstructured Data Ingestion and Processing Pipelines on Google Cloud

Question

Your company is designing its data lake on Google Cloud and wants to develop different ingestion pipelines to collect unstructured data from different sources.

After the data is stored in Google Cloud, it will be processed in several data pipelines to build a recommendation engine for end users on the website.

The structure of the data retrieved from the source systems can change at any time.

The data must be stored exactly as it was retrieved for reprocessing purposes in case the data structure is incompatible with the current processing pipelines.

You need to design an architecture to support the use case after you retrieve the data.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

D.

To design an architecture that can support the use case, you need to consider the following requirements:

  • Ability to collect unstructured data from different sources
  • Store the data exactly as it was retrieved for reprocessing purposes in case the data structure is incompatible with the current processing pipelines
  • Process the data in several data pipelines to build a recommendation engine for end users on the website

Considering these requirements, the best approach would be to store the raw data in an object storage service like Google Cloud Storage (GCS) and process it using data pipelines that can handle schema evolution.

Option A: Send the data through the processing pipeline, and then store the processed data in a BigQuery table for reprocessing. This approach doesn't store the raw data as it was retrieved and cannot support the requirement of reprocessing data in case the data structure is incompatible with the current processing pipelines.

Option B: Store the data in a BigQuery table. Design the processing pipelines to retrieve the data from the table. This approach doesn't support the requirement of storing the raw data as it was retrieved. Additionally, BigQuery is not a suitable storage option for unstructured data.

Option C: Send the data through the processing pipeline, and then store the processed data in a Cloud Storage bucket for reprocessing. This approach stores the processed data in GCS, which can be used for reprocessing in case the data structure changes. However, it doesn't support the requirement of storing the raw data as it was retrieved.

Option D: Store the data in a Cloud Storage bucket. Design the processing pipelines to retrieve the data from the bucket. This approach meets all the requirements for the use case. The unstructured data can be stored in GCS, and data pipelines can be designed to handle schema evolution. The processed data can also be stored in GCS for reprocessing in case the data structure changes.

Therefore, the correct answer is D. Store the data in a Cloud Storage bucket. Design the processing pipelines to retrieve the data from the bucket.