Hortonworks Data Platform Certified Developer v5.0 (HDPCD)

Page:    1 / 8   
Total 108 questions

What is a SequenceFile?

  • A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable objects.
  • B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable objects.
  • C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
  • D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type.


Answer : D

Explanation: SequenceFile is a flat file consisting of binary key/value pairs.
There are 3 different SequenceFile formats:
Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
Reference: http://wiki.apache.org/hadoop/SequenceFile

Which TWO of the following statements are true regarding Hive? Choose 2 answers

  • A. Useful for data analysts familiar with SQL who need to do ad-hoc queries
  • B. Offers real-time queries and row level updates
  • C. Allows you to define a structure for your unstructured Big Data
  • D. Is a relational database


Answer : A,C

Which HDFS command displays the contents of the file x in the user's HDFS home directory?

  • A. hadoop fs -Is x
  • B. hdfs fs -get x
  • C. hadoop fs -cat x
  • D. hadoop fs -cp x


Answer : C

Indentify which best defines a SequenceFile?

  • A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous Writable objects
  • B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous Writable objects
  • C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
  • D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be the same type.


Answer : D

Explanation: SequenceFile is a flat file consisting of binary key/value pairs.
There are 3 different SequenceFile formats:
Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
Reference: http://wiki.apache.org/hadoop/SequenceFile

You have written a Mapper which invokes the following five calls to the
OutputColletor.collect method:
output.collect (new Text (Apple), new Text (Red) ) ;
output.collect (new Text (Banana), new Text (Yellow) ) ;
output.collect (new Text (Apple), new Text (Yellow) ) ;
output.collect (new Text (Cherry), new Text (Red) ) ;
output.collect (new Text (Apple), new Text (Green) ) ;
How many times will the Reducers reduce method be invoked?

  • A. 6
  • B. 3
  • C. 1
  • D. 0
  • E. 5


Answer : B

Explanation: reduce() gets called once for each [key, (list of values)] pair. To explain, let's say you called: out.collect(new Text("Car"),new Text("Subaru"); out.collect(new Text("Car"),new Text("Honda"); out.collect(new Text("Car"),new Text("Ford"); out.collect(new Text("Truck"),new Text("Dodge"); out.collect(new Text("Truck"),new Text("Chevy");
Then reduce() would be called twice with the pairs
reduce(Car, <Subaru, Honda, Ford>)
reduce(Truck, <Dodge, Chevy>)
Reference: Mapper output.collect()?

MapReduce v2 (MRv2/YARN) is designed to address which two issues?

  • A. Single point of failure in the NameNode.
  • B. Resource pressure on the JobTracker.
  • C. HDFS latency.
  • D. Ability to run frameworks other than MapReduce, such as MPI.
  • E. Reduce complexity of the MapReduce APIs.
  • F. Standardize on a single MapReduce API.


Answer : A,B

Reference: Apache Hadoop YARN – Concepts & Applications

Which process describes the lifecycle of a Mapper?

  • A. The JobTracker calls the TaskTrackers configure () method, then its map () method and finally its close () method.
  • B. The TaskTracker spawns a new Mapper to process all records in a single input split.
  • C. The TaskTracker spawns a new Mapper to process each key-value pair.
  • D. The JobTracker spawns a new Mapper to process all records in a single file.


Answer : B

Explanation: For each map instance that runs, the TaskTracker creates a new instance of your mapper.
Note:
* The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat.
The mapper may perform a number of Extraction and Transformation functions on the
Key/Value pair before ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Value type.
* With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class. This class defines an 'Identity' map function by default - every input Key/Value pair obtained from the InputFormat is written out.
Examining the run() method, we can see the lifecycle of the mapper:
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
setup(Context) - Perform any setup for the mapper. The default implementation is a no-op method. map(Key, Value, Context) - Perform a map operation in the given Key / Value pair. The default implementation calls Context.write(Key, Value) cleanup(Context) - Perform any cleanup for the mapper. The default implementation is a no-op method.
Reference: Hadoop/MapReduce/Mapper

You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product indentifies
(Text).
Indentify what determines the data types used by the Mapper for a given job.

  • A. The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
  • B. The data types specified in HADOOP_MAP_DATATYPES environment variable
  • C. The mapper-specification.xml file submitted with the job determine the mappers input key and value types.
  • D. The InputFormat used by the job determines the mapper’s input key and value types.


Answer : D

Explanation: The input types fed to the mapper are controlled by the InputFormat used.
The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs.
The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.
Note: The data types emitted by the reducer are identified by setOutputKeyClass() andsetOutputValueClass(). The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override these.
Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD

Assuming default settings, which best describes the order of data provided to a reducers reduce method:

  • A. The keys given to a reducer arent in a predictable order, but the values associated with those keys always are.
  • B. Both the keys and values passed to a reducer always appear in sorted order.
  • C. Neither keys nor values are in any predictable order.
  • D. The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order


Answer : D

Explanation: Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

SecondarySort -
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key,
(collection of values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Which one of the following is NOT a valid Oozie action?

  • A. mapreduce
  • B. pig
  • C. hive
  • D. mrunit


Answer : D

For each intermediate key, each reducer task can emit:

  • A. As many final key-value pairs as desired. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
  • B. As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
  • C. As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
  • D. One final key-value pair per value associated with the key; no restrictions on the type.
  • E. One final key-value pair per key; no restrictions on the type.


Answer : C

Reference: Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce

You need to move a file titled weblogs into HDFS. When you try to copy the file, you cant.
You know you have ample space on your DataNodes. Which action should you take to relieve this situation and store more files in HDFS?

  • A. Increase the block size on all current files in HDFS.
  • B. Increase the block size on your remaining files.
  • C. Decrease the block size on your remaining files.
  • D. Increase the amount of memory for the NameNode.
  • E. Increase the number of disks (or size) for the NameNode.
  • F. Decrease the block size on all current files in HDFS.


Answer : C

You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mappers map method?

  • A. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
  • B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
  • C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
  • D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
  • E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.


Answer : C

Explanation: The mapper output (intermediate data) is stored on the Local file system
(NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the Mapper Output (intermediate kay-value data) stored ?

Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data?

  • A. HBase
  • B. Hue
  • C. Pig
  • D. Hive
  • E. Oozie
  • F. Flume
  • G. Sqoop


Answer : A

Explanation: Use Apache HBase when you need random, realtime read/write access to your Big Data.
Note: This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A
Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides
Bigtable-like capabilities on top of Hadoop and HDFS.

Features -
Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via

JMX -
Reference: http://hbase.apache.org/ (when would I use HBase? First sentence)

Indentify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?

  • A. Oozie
  • B. Sqoop
  • C. Flume
  • D. Hadoop Streaming
  • E. mapred


Answer : D

Explanation: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Reference: http://hadoop.apache.org/common/docs/r0.20.1/streaming.html (Hadoop
Streaming, second sentence)

Page:    1 / 8   
Total 108 questions