You are building a real-time prediction engine that streams files which may contain Personally Identifiable Information (PII) to Google Cloud.
You want to use the Cloud Data Loss Prevention (DLP) API to scan the files.
How should you ensure that the PII is not accessible by unauthorized individuals?
Click on the arrows to vote for the correct answer
A. B. C. D.A.
When building a real-time prediction engine that streams files containing Personally Identifiable Information (PII) to Google Cloud, it's important to ensure that the PII is not accessible by unauthorized individuals. One approach to address this is by using the Cloud Data Loss Prevention (DLP) API to scan the files for PII.
Option A involves streaming all files to Google Cloud and then writing the data to BigQuery. Periodically, a bulk scan of the table is conducted using the DLP API. This approach does not ensure that the PII is not accessible by unauthorized individuals because the PII is available in BigQuery, and it is not clear how the data is managed or protected.
Option B involves streaming all files to Google Cloud and writing batches of the data to BigQuery. While the data is being written to BigQuery, a bulk scan of the data is conducted using the DLP API. This approach is better than option A because the PII is not exposed for an extended period in BigQuery, but it still doesn't provide complete protection.
Option C involves creating two buckets of data: Sensitive and Non-sensitive. All data is written to the Non-sensitive bucket, and periodically a bulk scan of that bucket is conducted using the DLP API. Sensitive data is moved to the Sensitive bucket. This approach provides a higher level of protection because only non-sensitive data is stored in the Non-sensitive bucket, and sensitive data is moved to a more secure location.
Option D involves creating three buckets of data: Quarantine, Sensitive, and Non-sensitive. All data is written to the Quarantine bucket, and periodically a bulk scan of that bucket is conducted using the DLP API. Data is moved to either the Sensitive or Non-sensitive bucket. This approach is similar to option C, but it includes an additional Quarantine bucket. This provides an extra layer of protection because data is not directly written to either the Sensitive or Non-sensitive bucket, and the data is subject to additional scrutiny before being classified.
Overall, option D provides the highest level of protection because it includes a Quarantine bucket and the additional scrutiny before data is classified as sensitive or non-sensitive. However, the appropriate option depends on the specific requirements and constraints of the use case.