Problem Scenario 52 : You have been given below code snippet. val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

Operation_xyz -
Write a correct code snippet for Operation_xyz which will produce below output. scalaxollection.Map[lnt,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 ->

Solution :
Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the information in a single reducer.)

Listing Variants -
def countByValue(): Map[T, Long]

Problem Scenario 81 : You have been given MySQL DB with following details. You have been given following product.csv file product.csv productID,productCode,name,quantity,price
1001,PEN,Pen Red,5000,1.23
1002,PEN,Pen Blue,8000,1.25
1003,PEN,Pen Black,2000,1.25
1004,PEC,Pencil 2B,10000,0.48
1005,PEC,Pencil 2H,8000,0.49
1006,PEC,Pencil HB,0,9999.99
Now accomplish following activities.
1. Create a Hive ORC table using SparkSql
2. Load this data in Hive table.
3. Create a Hive parquet table using SparkSQL and load data in it.

Solution :
Step 1 : Create this tile in HDFS under following directory (Without header}
Step 2 : Now using Spark-shell read the file as RDD
// load the data into a new RDD
val products = sc.textFile("/user/cloudera/he/exam/task1/product.csv")
// Return the first element in this RDD
prod u rst()
Step 3 : Now define the schema using a case class
case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price:
Step 4 : create an RDD of Product objects
val prdRDD =",")).map(p =>
Step 5 : Now create data frame val prdDF = prdRDD.toDF()
Step 6 : Now store data in hive warehouse directory. (However, table will not be created } import org.apache.spark.sql.SaveMode prdDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("product_orc_table") step 7: Now create table using data stored in warehouse directory. With the help of hive. hive show tables
CREATE EXTERNAL TABLE products (productid int,code string,name string .quantity int, price float}

LOCATION 7user/hive/warehouse/product_orc_table';
Step 8 : Now create a parquet table
import org.apache.spark.sql.SaveMode
prdDF.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("product_parquet_ table")
Step 9 : Now create table using this
CREATE EXTERNAL TABLE products_parquet (productid int,code string,name string
.quantity int, price float}

STORED AS parquet -
LOCATION 7user/hive/warehouse/product_parquet_table';
Step 10 : Check data has been loaded or not.
Select * from products;
Select * from products_parquet;

Problem Scenario 19 : You have been given following mysql database details as well as other info. user=retail_dba password=cloudera database=retail_db jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Now accomplish following activities.
1. Import departments table from mysql to hdfs as textfile in departments_text directory.
2. Import departments table from mysql to hdfs as sequncefile in departments_sequence directory.
3. Import departments table from mysql to hdfs as avro file in departments avro directory.
4. Import departments table from mysql to hdfs as parquet file in departments_parquet directory.

Solution :
Step 1 : Import departments table from mysql to hdfs as textfile sqoop import \
-connect jdbc:mysql://quickstart:3306/retail_db \
~username=retail_dba \
-password=cloudera \
-table departments \
-as-textfile \
verify imported data
hdfs dfs -cat departments_text/part"
Step 2 : Import departments table from mysql to hdfs as sequncetlle sqoop import \
-connect jdbc:mysql://quickstart:330G/retaiI_db \
~username=retail_dba \
-password=cloudera \
--table departments \
-as-sequencetlle \
-~target-dir=departments sequence
verify imported data
hdfs dfs -cat departments_sequence/part*
Step 3 : Import departments table from mysql to hdfs as sequncetlle sqoop import \
-connect jdbc:mysql://quickstart:330G/retaiI_db \
~username=retail_dba \
--password=cloudera \
--table departments \
--as-avrodatafile \
verify imported data
hdfs dfs -cat departments avro/part*
Step 4 : Import departments table from mysql to hdfs as sequncetlle sqoop import \
-connect jdbc:mysql://quickstart:330G/retaiI_db \
~username=retail_dba \
--password=cloudera \
-table departments \
-as-parquetfile \
verify imported data
hdfs dfs -cat departmentsparquet/part*

Problem Scenario 58 : You have been given below code snippet. val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = a.keyBy(_.length) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle}}}

Solution :
groupByKey [Pair]
Very similar to groupBy, but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.

Listing Variants -
def groupByKeyQ: RDD[(K, lterable[V]}]
def groupByKey(numPartittons: Int): RDD[(K, lterable[V] )]
def groupByKey(partitioner: Partitioner): RDD[(K, lterable[V])]

Problem Scenario 53 : You have been given below code snippet. val a = sc.parallelize(1 to 10, 3) operation1 b.collect

Output 1 -
Array[lnt] = Array(2, 4, 6, 8,10)

Output 2 -
Array[lnt] = Array(1,2, 3)
Write a correct code snippet for operation1 and operation2 which will produce desired output, shown above.

Solution :
valb = a.filter(_%2==0)
a.filter(_ < 4).collect
Evaluates a boolean function for each data item of the RDD and puts the items for which the function returned true into the resulting RDD.
When you provide a filter function, it must be able to handle all data items contained in the
RDD. Scala provides so-called partial functions to deal with mixed data types (Tip: Partial functions to deal are very useful if you have some data which may be bad and you do not want to handle but for the good data (matching data) you want to apply some Kind of map function. The following article is good. It teaches you about partial functions in a very nice way and explains why case has to be used for partial functions:article)
Examples for mixed data without partial functions
val b = sc.parallelize(1 to 8)
b.filter(_ < 4)xollect
res15: Arrayjlnt] = Array(1, 2, 3)
val a = sc.parallelize(List("cat'\ "horse", 4.0, 3.5, 2, "dog")) a.filter(_<4).collect error: value < is not a member of Any

