What is a SequenceFile?
Answer : D
Explanation: SequenceFile is a flat file consisting of binary key/value pairs.
There are 3 different SequenceFile formats:
Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
Reference: http://wiki.apache.org/hadoop/SequenceFile
Which TWO of the following statements are true regarding Hive? Choose 2 answers
Answer : A,C
Which HDFS command displays the contents of the file x in the user's HDFS home directory?
Answer : C
Indentify which best defines a SequenceFile?
Answer : D
Explanation: SequenceFile is a flat file consisting of binary key/value pairs.
There are 3 different SequenceFile formats:
Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
Reference: http://wiki.apache.org/hadoop/SequenceFile
You have written a Mapper which invokes the following five calls to the
OutputColletor.collect method:
output.collect (new Text (Apple), new Text (Red) ) ;
output.collect (new Text (Banana), new Text (Yellow) ) ;
output.collect (new Text (Apple), new Text (Yellow) ) ;
output.collect (new Text (Cherry), new Text (Red) ) ;
output.collect (new Text (Apple), new Text (Green) ) ;
How many times will the Reducers reduce method be invoked?
Answer : B
Explanation: reduce() gets called once for each [key, (list of values)] pair. To explain, let's say you called: out.collect(new Text("Car"),new Text("Subaru"); out.collect(new Text("Car"),new Text("Honda"); out.collect(new Text("Car"),new Text("Ford"); out.collect(new Text("Truck"),new Text("Dodge"); out.collect(new Text("Truck"),new Text("Chevy");
Then reduce() would be called twice with the pairs
reduce(Car, <Subaru, Honda, Ford>)
reduce(Truck, <Dodge, Chevy>)
Reference: Mapper output.collect()?
MapReduce v2 (MRv2/YARN) is designed to address which two issues?
Answer : A,B
Reference: Apache Hadoop YARN – Concepts & Applications
Which process describes the lifecycle of a Mapper?
Answer : B
Explanation: For each map instance that runs, the TaskTracker creates a new instance of your mapper.
Note:
* The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat.
The mapper may perform a number of Extraction and Transformation functions on the
Key/Value pair before ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Value type.
* With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class. This class defines an 'Identity' map function by default - every input Key/Value pair obtained from the InputFormat is written out.
Examining the run() method, we can see the lifecycle of the mapper:
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
setup(Context) - Perform any setup for the mapper. The default implementation is a no-op method. map(Key, Value, Context) - Perform a map operation in the given Key / Value pair. The default implementation calls Context.write(Key, Value) cleanup(Context) - Perform any cleanup for the mapper. The default implementation is a no-op method.
Reference: Hadoop/MapReduce/Mapper
You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product indentifies
(Text).
Indentify what determines the data types used by the Mapper for a given job.
Answer : D
Explanation: The input types fed to the mapper are controlled by the InputFormat used.
The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs.
The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.
Note: The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override these.
Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD
Assuming default settings, which best describes the order of data provided to a reducers reduce method:
Answer : D
Explanation: Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
SecondarySort -
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key,
(collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Which one of the following is NOT a valid Oozie action?
Answer : D
For each intermediate key, each reducer task can emit:
Answer : C
Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce
You need to move a file titled weblogs into HDFS. When you try to copy the file, you cant.
You know you have ample space on your DataNodes. Which action should you take to relieve this situation and store more files in HDFS?
Answer : C
You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mappers map method?
Answer : C
Explanation: The mapper output (intermediate data) is stored on the Local file system
(NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the Mapper Output (intermediate kay-value data) stored ?
Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data?
Answer : A
Explanation: Use Apache HBase when you need random, realtime read/write access to your Big Data.
Note: This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A
Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides
Bigtable-like capabilities on top of Hadoop and HDFS.
Features -
Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via
JMX -
Reference: http://hbase.apache.org/ (when would I use HBase? First sentence)
Indentify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?
Answer : D
Explanation: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Reference: http://hadoop.apache.org/common/docs/r0.20.1/streaming.html (Hadoop
Streaming, second sentence)