Data Lake Security Measures for Machine Learning Data
Question
You work as a data scientist manager at a large financial services firm where your team is responsible for building machine learning solutions such as price prediction of equities, futures, and options.
You need petabytes of data from dozens of sources internal and external to your organization.
All external data sources are contractually constrained as to where the data is used and who has access to the data.
Your machine learning models require storage of these data in a data lake to allow quick retrieval of data to fuel your ML models.
You have chosen to use S3 to house your data lake.
How will you most efficiently protect this data lake, your machine learning data source, against internal threats to data confidentiality and security?
Answers
Explanations
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: B.
Option A is incorrect because this is a very inefficient approach to the problem of securing a data lake.
Most large data lakes contain large numbers of buckets and objects.
Using resource-based policies would mean creating an extensive set of policies to secure the data lake.
Option B is correct.
Per the AWS white paper Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility - Securing, Protecting, and Managing Data, “for most data lake environments, we recommend using user policies so that permissions to access data assets can also be tied to user roles and permissions for the data processing and analytics services and tools that your data lake users will use.”
Option C is incorrect because access keys are used primarily for applications running outside the AWS environment.
Resources running inside AWS, as is the case in this scenario, the best practice is to use IAM roles and policies.
(See AWS Security blog entry Guidelines for protecting your AWS account while using programmatic access: https://aws.amazon.com/blogs/security/guidelines-for-protecting-your-aws-account-while-using-programmatic-access/)
Option D is incorrect.
The CloudHSM module is used to generate encrypted access keys.
However, since we're dealing with users internal to AWS, IAM role-based security is the best practice for this scenario.
Reference:
Please see the AWS white paper Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility - Securing, Protecting, and Managing Data.
As a data scientist manager, you are responsible for protecting the petabytes of data from dozens of sources internal and external to your organization. All external data sources are contractually constrained as to where the data is used and who has access to the data. You have chosen to use S3 to house your data lake.
To most efficiently protect the data lake against internal threats to data confidentiality and security, you need to implement the appropriate security measures. Among the given options, the most suitable approach is to create IAM resource-based policies for each data lake S3 bucket resource. Use bucket policies and Access Control Lists (ACLs) to control the resources at the bucket level and the object level.
Option A provides a granular approach to access control by allowing you to manage access to your data lake resources at the bucket level and the object level. With IAM resource-based policies, you can control access to S3 resources such as buckets, objects, and related permissions. You can use bucket policies to specify who can access your S3 bucket, what actions they can perform on the bucket and its contents, and under what conditions. Access Control Lists (ACLs) can be used to grant read, write, and delete permissions to individual objects in the bucket.
Option B suggests creating IAM user policies, which link permissions to access your S3 data lake assets to user roles and permissions. You can place your data scientists into IAM groups and assign the user policies to those groups. These policies and permissions will define access to the data processing and analytics services which your data scientists will use. This approach can be useful for managing large groups of users, but it may not be the most efficient way to protect your data lake against internal threats.
Option C involves creating an access key ID and a secret access key for each internal user of your S3 data lake. Your internal users will then only be able to gain access to your data lake using these keys. This approach provides a level of security, but it may not be the most efficient way to manage access to a large number of users and data sources.
Option D suggests using the AWS CloudHSM cloud-based hardware security module (HSM) to secure your S3 data lake. Internal users of your data lake will use the encryption keys generated by the CloudHSM module to gain access to the data needed for their machine learning models. This approach can provide additional security, but it may also increase the complexity and cost of managing access to your data lake.
In summary, the most efficient approach to protect your S3 data lake against internal threats to data confidentiality and security is to create IAM resource-based policies for each data lake S3 bucket resource. Use bucket policies and Access Control Lists (ACLs) to control the resources at the bucket level and the object level.