Performing Data Aggregations and Count of Distinct Data Frame Operations with PySpark in Databricks | Whizlabs Inc

Data Aggregations and Count of Distinct Data Frame Operations in PySpark | Whizlabs Inc

Question

Henry is a Data Engineer of Whizlabs Inc working on Databricks Spark streaming.

He's using PySpark for the development of dataframes.

He needs to perform the data aggregations & count of distinct data frame operations in the dataframe.

Which of the following is the correct code snippet in this scenario?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. A. D. A. F. A. H.

Correct Answer: A.

The correct code snippet in this scenario is:

B. select(“emp_id”, “emp_name”).groupBy(“emp_id”).agg(countDistinct(“emp_name”).alias(“distinct_emp_name”)).display()

Explanation:

The code snippet performs the data aggregations and count of distinct dataframe operations in the dataframe. Let's break down the code snippet step-by-step:

Step 1: select(“emp_id”, “emp_name”) This step selects the two columns “emp_id” and “emp_name” from the dataframe.

Step 2: groupBy(“emp_id”) This step groups the data by the “emp_id” column.

Step 3: agg(countDistinct(“emp_name”).alias(“distinct_emp_name”)) This step performs the aggregation operation countDistinct on the “emp_name” column and aliases the result as “distinct_emp_name”.

Step 4: display() This step displays the result of the aggregation operation on the console.

Therefore, the correct code snippet is:

select(“emp_id”, “emp_name”).groupBy(“emp_id”).agg(countDistinct(“emp_name”).alias(“distinct_emp_name”)).display()