TerramEarth manufactures heavy equipment for the mining and agricultural industries.
About 80% of their business is from mining and 20% from agriculture.
They currently have over 500 dealers and service centers in 100 countries.
Their mission is to build products that make their customers more productive.
Solution Concept - There are 20 million TerramEarth vehicles in operation that collect 120 fields of data per second.
Data is stored locally on the vehicle and can be accessed for analysis when a vehicle is serviced.
The data is downloaded via a maintenance port.
This same port can be used to adjust operational parameters, allowing the vehicles to be upgraded in the field with new computing modules.
Approximately 200,000 vehicles are connected to a cellular network, allowing TerramEarth to collect data directly.
At a rate of 120 fields of data per second, with 22 hours of operation per day, TerramEarth collects a total of about 9 TB/day from these connected vehicles.
Existing Technical Environment - TerramEarth's existing architecture is composed of Linux and Windows-based systems that reside in a single U.S, west coast based data center.
These systems gzip CSV files from the field and upload via FTP, and place the data in their data warehouse.
Because this process takes time, aggregated reports are based on data that is 3 weeks old.
With this data, TerramEarth has been able to preemptively stock replacement parts and reduce unplanned downtime of their vehicles by 60%
However, because the data is stale, some customers are without their vehicles for up to 4 weeks while they wait for replacement parts.
Business Requirements -Decrease unplanned vehicle downtime to less than 1 weekSupport the dealer network with more data on how their customers use their equipment to better position new products and servicesHave the ability to partner with different companies " especially with seed and fertilizer suppliers in the fast-growing agricultural business " to create compelling joint offerings for their customers Technical Requirements -Expand beyond a single datacenter to decrease latency to the American midwest and east coastCreate a backup strategyIncrease security of data transfer from equipment to the datacenterImprove data in the data warehouseUse customer and equipment data to anticipate customer needs Application 1: Data ingest - A custom Python application reads uploaded datafiles from a single server, writes to the data warehouse.
Compute:Windows Server 2008 R2 - 16 CPUs - 128 GB of RAM - 10 TB local HDD storage Application 2: Reporting - An off the shelf application that business analysts use to run a daily report to see what equipment needs repair.
Only 2 analysts of a team of 10 (5 west coast, 5 east coast) can connect to the reporting application at a time.
Compute:Off the shelf application.
License tied to number of physical CPUs - Windows Server 2008 R2 - 16 CPUs - 32 GB of RAM - 500 GB HDD Data warehouse:A single PostgreSQL server - RedHat Linux - 64 CPUs - 128 GB of RAM - 4x 6TB HDD in RAID 0 Executive Statement -
Click on the arrows to vote for the correct answer
A. B. C. D.A.
Based on the provided scenario, TerramEarth is collecting a large amount of data from their vehicles that could be utilized to improve their products and services, as well as reduce vehicle downtime for their customers. The current data infrastructure of TerramEarth is not meeting the business and technical requirements, which include decreasing vehicle downtime, expanding to other regions, improving data security, and analyzing data in real-time.
To address these requirements, TerramEarth needs to implement a cloud-based solution that can handle the high volume of data, provide real-time analytics, and offer scalability and availability. Google Cloud Platform (GCP) offers a suite of services that can meet these requirements. Among the available options, option A seems to be the best fit for TerramEarth.
Option A: Use BigQuery as the data warehouse. Connect all vehicles to the network and stream data into BigQuery using Cloud Pub/Sub and Cloud Dataflow. Use Google Data Studio for analysis and reporting.
Explanation: BigQuery is a fully managed, cloud-native data warehouse that offers scalability, performance, and real-time analytics. It can handle large volumes of data and provide insights through SQL queries, machine learning, and other tools. BigQuery also integrates with other GCP services, such as Cloud Dataflow and Cloud Pub/Sub, to support real-time data ingestion and processing.
By connecting all vehicles to the network, TerramEarth can collect and stream data into BigQuery using Cloud Pub/Sub, which is a messaging service that can handle large volumes of data and support real-time data transfer. Cloud Dataflow can be used to transform and process the data before storing it in BigQuery.
Google Data Studio can be used to create custom dashboards and reports that can provide real-time insights into the data. This can help TerramEarth to identify potential issues with their vehicles and reduce unplanned downtime. Data Studio can also be used to share insights with dealers and service centers, which can help them to better understand their customers and improve their services.
Option B: Use BigQuery as the data warehouse. Connect all vehicles to the network and upload gzip files to a Multi-Regional Cloud Storage bucket using gcloud. Use Google Data Studio for analysis and reporting.
Explanation: Option B is similar to Option A, but instead of using Cloud Pub/Sub and Cloud Dataflow for real-time data processing, it suggests uploading gzip files to a Multi-Regional Cloud Storage bucket using gcloud. This method is not suitable for real-time data analysis and can result in stale data. It may also require more storage and processing resources than necessary.
Option C: Use Cloud Dataproc Hive as the data warehouse. Upload gzip files to a Multi-Regional Cloud Storage bucket. Upload this data into BigQuery using gcloud. Use Google Data Studio for analysis and reporting.
Explanation: Option C suggests using Cloud Dataproc Hive as the data warehouse, which is a managed Hadoop service that can be used for data processing and analysis. However, Hive is not suitable for real-time data ingestion and analysis. It may also require additional resources for managing and maintaining the Hadoop cluster.
Uploading gzip files to a Multi-Regional Cloud Storage bucket is similar to Option B and may result in stale data. Uploading data to BigQuery using gcloud can be slow and inefficient compared to using Cloud Pub/Sub and Cloud Dataflow for real-time data ingestion.
Option D: Use Cloud Dataproc Hive as the data warehouse. Directly stream data into partitioned Hive tables. Use Pig scripts to analyze data.
Explanation: Option D suggests using Cloud Dataproc Hive and Pig for data processing and analysis. This approach requires managing and maintaining a Hadoop cluster, which can be complex and resource-intensive. Directly streaming data into partitioned Hive tables is suitable for real-time data analysis, but it may require additional resources for