You recently joined an enterprise-scale company that has thousands of datasets.
You know that there are accurate descriptions for each table in BigQuery, and you are searching for the proper BigQuery table to use for a model you are building on AI Platform.
How should you find the data that you need?
Click on the arrows to vote for the correct answer
A. B. C. D. E.B.
As a machine learning engineer working with BigQuery datasets in an enterprise-scale company, it can be challenging to find the appropriate dataset that you need for your model. However, there are several ways to approach this issue, as listed in the multiple-choice answers provided.
Option A suggests using Data Catalog to search for BigQuery datasets based on keywords found in the table descriptions. Data Catalog is a fully managed and scalable metadata management service that helps organizations discover, understand, and manage their data assets. Using Data Catalog, you can find BigQuery tables by searching through their descriptions, tags, and other metadata. This option is a valid solution to finding datasets based on specific descriptions, but it may not be the most efficient way to search for data in large organizations with thousands of datasets.
Option B proposes that you tag each model and version resource on AI Platform with the name of the BigQuery table used for training. AI Platform is a machine learning platform that enables you to build, deploy, and manage machine learning models at scale. By tagging your models and versions with the table name, you can easily track and manage the data used in each model. This option is also valid and can be useful when you need to track the specific data sources used in a particular model.
Option C suggests maintaining a lookup table in BigQuery that maps table descriptions to their respective table IDs. This option involves creating a separate table in BigQuery that stores the descriptions and IDs of all datasets, allowing you to search for datasets based on their descriptions quickly. This option requires manual maintenance of the lookup table and can be time-consuming and prone to errors.
Option D recommends querying the lookup table to find the correct table ID for the data needed. Once the lookup table is set up, you can use SQL queries to retrieve the table ID for the dataset you need. This option can be useful when you need to find a specific dataset quickly, but it requires manual maintenance of the lookup table and may not scale well in large organizations.
Option E suggests executing a query in BigQuery to retrieve all existing table names in the project using the INFORMATION_SCHEMA metadata tables. INFORMATION_SCHEMA is a standard SQL interface that provides access to metadata information in a database. By querying the INFORMATION_SCHEMA metadata tables in BigQuery, you can retrieve a list of all existing table names in the project and use this list to find the table you need. This option is a valid solution and can be an efficient way to search for datasets in large organizations with thousands of datasets.
Overall, the best approach to finding the data needed for a model on AI Platform will depend on the specific requirements of the project, the size and complexity of the organization's data infrastructure, and the availability of resources and tools. Therefore, it is important to consider all available options and choose the one that best suits your needs.