Which two of the following are true about this trivial Pig program' (choose Two)
Answer : A,D
Which best describes what the map method accepts and emits?
Answer : D
Explanation: public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> extends Object
Maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks which transform input records into a intermediate records.
The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
Reference: org.apache.hadoop.mapreduce
Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Which one of the following Hive commands uses an HCatalog table named x?
Answer : C
Given the following Hive command:
INSERT OVERWRITE TABLE mytable SELECT * FROM myothertable;
Which one of the following statements is true?
Answer : B
Assuming default settings, which best describes the order of data provided to a reducers reduce method:
Answer : D
Explanation: Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort -
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key,
(collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Table metadata in Hive is:
Answer : C
Explanation: By default, hive use an embedded Derby database to store metadata information. The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.
The Metastore is an application that runs on an RDBMS and uses an open source ORM layer called DataNucleus, to convert object representations into a relational schema and vice versa. They chose this approach as opposed to storing this information in hdfs as they need the Metastore to be very low latency. The DataNucleus layer allows them to plugin many different RDBMS technologies.
Note:
* By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.
* features of Hive include:
Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
Reference: Store Hive Metadata into RDBMS
You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, youve decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.
Indentify which invocation correctly passes.mapred.job.name with a value of Example to
Hadoop?
Answer : C
Explanation: Configure the property using the -D key=value notation:
-D mapred.job.name='My Job'
You can list a whole bunch of options by calling the streaming jar with just the -info argument
Reference: Python hadoop streaming : Setting a job name
In Hadoop 2.0, which one of the following statements is true about a standby NameNode?
The Standby NameNode:
Answer : B
Which one of the following statements describes the relationship between the
ResourceManager and the ApplicationMaster?
Answer : A
Review the following 'data' file and Pig code.
Answer : A
What does Pig provide to the overall Hadoop solution?
Answer : B
In the reducer, the MapReduce API provides you with an iterator over Writable values.
What does calling the next () method return?
Answer : C
Explanation: Calling Iterator.next() will always return the SAME EXACT instance of
IntWritable, with the contents of that instance replaced with the next value.
Reference: manupulating iterator in mapreduce
Which one of the following statements is true about a Hive-managed table?
Answer : B
Which Two of the following statements are true about hdfs? Choose 2 answers
Answer : A,B
Given the following Hive commands:
Answer : A