Infosys Certified PySpark Professional

Question 1

Consider a scenario where an HDFS file is divided into four blocks and to be processed by a Spark application. As part of the RDD creation, data in each block is represented as?

Accepted Answer

partitions

.

Question 2

Consider a scenario to process large amount of data in distributed pattern using Spark program. Where does the actual data to be processed gets stored?

Accepted Answer

Driver.

Question 3

Consider a scenario where few partitions of RDD are lost while Spark job getting executed. Which of the following component facilitates the recreation of lost partitions?

Accepted Answer

accumulators

.

Question 4

Which of the below are the features of PySpark framework?

Accepted Answer

In-memory computation

.

Question 5

Consider a scenario where a PySpark Job is getting deployed in the cluster. What does "--master" parameter in spark-submit command indicate?

Accepted Answer

Provides the Cluster manager details used to run the Spark application.

.

Question 6

Sam works for a banking client and performs data analysis using Spark. Which of the following command can be used by him to get the URL of Spark Web User Interface to view the details of jobs and executors?

Accepted Answer

sc.webUI.

Question 7

Which of the below operation may result in data skewing with not an even distribution of data across partitions?

Accepted Answer

cache().

Question 8

Shane works on a data analytics project and needs to perform analysis on employee data (Employee.csv file).

Schema: EmployeeID, EmployeeName, Age, Salary, Department.

Which of the below code snippet can be used to sort the employees based on the department in descending order?

Accepted Answer

logsRDD = sc.textFile("/dataset/Employee")

FieldsRDD = logsRDD.map(lambda var1: var1.split(","))

logdata= FieldsRDD.map(lambda var1: (var1[4], [var1[0],var1[1],var1[2], var1[3]]))

sortdata= logdata.sortByKey(ascending=False,keyfunc=lambda k: k)

.

Question 9

Which of the following storage level in persist() method is equivalent to cache() method?

Accepted Answer

MEMORY_ONLY

.

Question 10

Consider the below code snippet:

def fun(x):

return x.split(",")

rdd=sc.parallelize(["1002,John,20000","1003,harry,7000","1004,lookie,900"])

newrdd=rdd.map(fun) # Line 1

newrdd.first() # Line 2

Predict the correct output.

Accepted Answer

[1002,John,20000]

.

10 Infosys Certified PySpark Professional practice questions with answers

Consider a scenario where an HDFS file is divided into four blocks and to be processed by a Spark application. As part of the RDD creation, data in each block is represented as?

Consider a scenario to process large amount of data in distributed pattern using Spark program. Where does the actual data to be processed gets stored?

Consider a scenario where few partitions of RDD are lost while Spark job getting executed. Which of the following component facilitates the recreation of lost partitions?

Which of the below are the features of PySpark framework?

Consider a scenario where a PySpark Job is getting deployed in the cluster. What does "--master" parameter in spark-submit command indicate?

Sam works for a banking client and performs data analysis using Spark. Which of the following command can be used by him to get the URL of Spark Web User Interface to view the details of jobs and executors?

Which of the below operation may result in data skewing with not an even distribution of data across partitions?

Shane works on a data analytics project and needs to perform analysis on employee data (Employee.csv file).
Schema: EmployeeID, EmployeeName, Age, Salary, Department.
Which of the below code snippet can be used to sort the employees based on the department in descending order?

Which of the following storage level in persist() method is equivalent to cache() method?

Consider the below code snippet:
def fun(x):
return x.split(",")
rdd=sc.parallelize(["1002,John,20000","1003,harry,7000","1004,lookie,900"])
newrdd=rdd.map(fun) # Line 1
newrdd.first() # Line 2

Predict the correct output.

More in Big Data

Infosys certified data analyst on Advanced AWS

Infosys Certified Confluent Kafka Developer

Infosys Certified Big Data Analyst on AWS

Pay once. Clear every cert this year.

Common questions, straight answers.

Infosys Certified PySpark Professional

Consider a scenario where an HDFS file is divided into four blocks and to be processed by a Spark application. As part of the RDD creation, data in each block is represented as?

Consider a scenario to process large amount of data in distributed pattern using Spark program. Where does the actual data to be processed gets stored?

Consider a scenario where few partitions of RDD are lost while Spark job getting executed. Which of the following component facilitates the recreation of lost partitions?

Which of the below are the features of PySpark framework?

Consider a scenario where a PySpark Job is getting deployed in the cluster. What does "--master" parameter in spark-submit command indicate?

Sam works for a banking client and performs data analysis using Spark. Which of the following command can be used by him to get the URL of Spark Web User Interface to view the details of jobs and executors?

Which of the below operation may result in data skewing with not an even distribution of data across partitions?

Shane works on a data analytics project and needs to perform analysis on employee data (Employee.csv file). Schema: EmployeeID, EmployeeName, Age, Salary, Department.Which of the below code snippet can be used to sort the employees based on the department in descending order?

Which of the following storage level in persist() method is equivalent to cache() method?

Consider the below code snippet: def fun(x): return x.split(",") rdd=sc.parallelize(["1002,John,20000","1003,harry,7000","1004,lookie,900"]) newrdd=rdd.map(fun) # Line 1newrdd.first() # Line 2 Predict the correct output.

More in Big Data

Infosys certified data analyst on Advanced AWS

Infosys Certified Confluent Kafka Developer

Infosys Certified Big Data Analyst on AWS

Pay once. Clear every cert this year.

Common questions, straight answers.

Shane works on a data analytics project and needs to perform analysis on employee data (Employee.csv file).
Schema: EmployeeID, EmployeeName, Age, Salary, Department.
Which of the below code snippet can be used to sort the employees based on the department in descending order?

Consider the below code snippet:
def fun(x):
return x.split(",")
rdd=sc.parallelize(["1002,John,20000","1003,harry,7000","1004,lookie,900"])
newrdd=rdd.map(fun) # Line 1
newrdd.first() # Line 2

Predict the correct output.