Cloudera Certified Developer for Apache Hadoop (CCDH) v1.0 (CCD-410)

Page:    1 / 4   
Total 66 questions

In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?

  • A. Increase the parameter that controls minimum split size in the job configuration.
  • B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.
  • C. Set the number of mappers equal to the number of input files you want to process.
  • D. Write a custom FileInputFormat and override the method isSplitable to always return false.


Answer : D

FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobContext). Subclasses of FileInputFormat can also override the isSplitable(JobContext, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.
Reference: org.apache.hadoop.mapreduce.lib.input, Class FileInputFormat<K,V>

Which process describes the lifecycle of a Mapper?

  • A. The JobTracker calls the TaskTracker"™s configure () method, then its map () method and finally its close () method.
  • B. The TaskTracker spawns a new Mapper to process all records in a single input split.
  • C. The TaskTracker spawns a new Mapper to process each key-value pair.
  • D. The JobTracker spawns a new Mapper to process all records in a single file.


Answer : C

For each map instance that runs, the TaskTracker creates a new instance of your mapper.
Note:
* The Mapper is responsible for processing Key/Value pairs obtained from the InputFormat. The mapper may perform a number of Extraction and Transformation functions on the Key/Value pair before ultimately outputting none, one or many Key/Value pairs of the same, or different Key/Value type.
* With the new Hadoop API, mappers extend the org.apache.hadoop.mapreduce.Mapper class. This class defines an 'Identity' map function by default - every input Key/Value pair obtained from the InputFormat is written out.
Examining the run() method, we can see the lifecycle of the mapper:
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
setup(Context) - Perform any setup for the mapper. The default implementation is a no-op method. map(Key, Value, Context) - Perform a map operation in the given Key / Value pair. The default implementation calls Context.write(Key, Value) cleanup(Context) - Perform any cleanup for the mapper. The default implementation is a no-op method.
Reference: Hadoop/MapReduce/Mapper

Determine which best describes when the reduce method is first called in a MapReduce job?

  • A. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The programmer can configure in the job what percentage of the intermediate data should arrive before the reduce method begins.
  • B. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
  • C. Reduce methods and map methods all start at the beginning of a job, in order to provide optimal performance for map-only or reduce-only jobs.
  • D. Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called as soon as the intermediate key-value pairs start to arrive.


Answer : D

* In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
* Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers , When is the reducers are started in a MapReduce job?

You have written a Mapper which invokes the following five calls to the OutputColletor.collect method: output.collect (new Text ("Apple"), new Text ("Red") ) ; output.collect (new Text ("Banana"), new Text ("Yellow") ) ; output.collect (new Text ("Apple"), new Text ("Yellow") ) ; output.collect (new Text ("Cherry"), new Text ("Red") ) ; output.collect (new Text ("Apple"), new Text ("Green") ) ;
How many times will the Reducer"™s reduce method be invoked?

  • A. 6
  • B. 3
  • C. 1
  • D. 0
  • E. 5


Answer : B

reduce() gets called once for each [key, (list of values)] pair. To explain, let's say you called: out.collect(new Text("Car"),new Text("Subaru"); out.collect(new Text("Car"),new Text("Honda"); out.collect(new Text("Car"),new Text("Ford"); out.collect(new Text("Truck"),new Text("Dodge"); out.collect(new Text("Truck"),new Text("Chevy");
Then reduce() would be called twice with the pairs
reduce(Car, <Subaru, Honda, Ford>)
reduce(Truck, <Dodge, Chevy>)
Reference: Mapper output.collect()?

To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What is the best way to accomplish this?

  • A. Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the mapper.
  • B. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
  • C. Place the data file in the DataCache and read the data into memory in the configure method of the mapper.
  • D. Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper.


Answer : B

Hadoop has a distributed cache mechanism to make available file locally that may be needed by Map/Reduce jobs

Use Case -
Lets understand our Use Case a bit more in details so that we can follow-up the code snippets.
We have a Key-Value file that we need to use in our Map jobs. For simplicity, lets say we need to replace all keywords that we encounter during parsing, with some other value.

So what we need is -
A key-values files (Lets use a Properties files)
The Mapper code that uses the code
Write the Mapper code that uses it
view sourceprint?
01.
public class DistributedCacheMapper extends Mapper<LongWritable, Text, Text, Text> {
02.
03.
Properties cache;
04.
05.
@Override
06.
protected void setup(Context context) throws IOException, InterruptedException {
07.
super.setup(context);
08.
Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
09.
10.
if(localCacheFiles != null) {
11.
// expecting only single file here
12.
for (int i = 0; i < localCacheFiles.length; i++) {
13.
Path localCacheFile = localCacheFiles[i];
14.
cache = new Properties();
15.
cache.load(new FileReader(localCacheFile.toString()));
16.
}
17.
} else {
18.
// do your error handling here
19.
}
20.
21.
}
22.
23.
@Override
24.
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
25.
// use the cache here
26.
// if value contains some attribute, cache.get(<value>)
27.
// do some action or replace with something else
28.
}
29.
30.
}
Note:
* Distribute application-specific large, read-only files efficiently.
DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via hdfs:// urls are already present on the FileSystem at the path specified by the url.
Reference: Using Hadoop Distributed Cache

In a MapReduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values?

  • A. The values are in sorted order.
  • B. The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job.
  • C. The values are arbitrary ordered, but multiple runs of the same MapReduce job will always have the same ordering.
  • D. Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.


Answer : B

Note:
* Input to the Reducer is the sorted output of the mappers.
* The framework calls the application's Reduce function once for each unique key in the sorted order.
* Example:
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?

  • A. Processor and network I/O
  • B. Disk I/O and network I/O
  • C. Processor and RAM
  • D. Processor and disk I/O


Answer : B

You want to count the number of occurrences for each unique word in the supplied input data. You"™ve decided to implement this by having your mapper tokenize each word and emit a literal value 1, and then have your reducer increment a counter for each literal 1 it receives. After successful implementing this, it occurs to you that you could optimize this by specifying a combiner. Will you be able to reuse your existing Reduces as your combiner in this case and why or why not?

  • A. Yes, because the sum operation is both associative and commutative and the input and output types to the reduce method match.
  • B. No, because the sum operation in the reducer is incompatible with the operation of a Combiner.
  • C. No, because the Reducer and Combiner are separate interfaces.
  • D. No, because the Combiner is incompatible with a mapper which doesn"™t use the same data type for both the key and value.
  • E. Yes, because Java is a polymorphic object-oriented language and thus reducer code can be reused as a combiner.


Answer : A

Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job?

Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot schedule a MapReduce operation.

  • A. TaskTracker
  • B. NameNode
  • C. DataNode
  • D. JobTracker
  • E. Secondary NameNode


Answer : D

JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job
Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location.
The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions(from Hadoop Wiki:)
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What is a JobTracker in Hadoop? How many instances of JobTracker run on a
Hadoop Cluster?

Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data?

  • A. HBase
  • B. Hue
  • C. Pig
  • D. Hive
  • E. Oozie
  • F. Flume
  • G. Sqoop


Answer : A

Use Apache HBase when you need random, realtime read/write access to your Big Data.
Note: This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and
HDFS.

Features -
Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
Reference:
http://hbase.apache.org/
(when would I use HBase? First sentence)

You use the hadoop fs ""put command to write a 300 MB file using and HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another user see when trying to access this life?

  • A. They would see Hadoop throw an ConcurrentFileAccessException when they try to access this file.
  • B. They would see the current state of the file, up to the last bit written by the command.
  • C. They would see the current of the file through the last completed block.
  • D. They would see no content until the whole file written and closed.


Answer : D

Note:
* put
Usage: hadoop fs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem.

Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and generate Java classes to interact with that imported data?

  • A. Oozie
  • B. Flume
  • C. Pig
  • D. Hue
  • E. Hive
  • F. Sqoop
  • G. fuse-dfs


Answer : Answer: C, E

Sqoop ("SQL-to-Hadoop") is a straightforward command-line tool with the following capabilities:
Imports individual tables or entire databases to files in HDFS
Generates Java classes to allow you to interact with your imported data
Provides the ability to import from SQL databases straight into your Hive data warehouse
Note:
Data Movement Between Hadoop and Relational Databases
Data can be moved between Hadoop and a relational database as a bulk data transfer, or relational tables can be accessed from within a MapReduce map function.
Note:
* Cloudera's Distribution for Hadoop provides a bulk data transfer tool (i.e., Sqoop) that imports individual tables or entire databases into HDFS files. The tool also generates Java classes that support interaction with the imported data. Sqoop supports all relational databases over JDBC, and Quest Software provides a connector (i.e., OraOop) that has been optimized for access to data residing in Oracle databases.
Reference:
http://log.medcl.net/item/2011/08/hadoop-and-mapreduce-big-data-analytics-gartner/
(Data Movement between hadoop and relational databases,
second paragraph)

You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt and #data.txt. How many files will be processed by the
FileInputFormat.setInputPaths () command when it's given a path object representing this directory?

  • A. Four, all files will be processed
  • B. Three, the pound sign is an invalid character for HDFS file names
  • C. Two, file names with a leading period or underscore are ignored
  • D. None, the directory cannot be named jobdata
  • E. One, no special characters can prefix the name of an input file


Answer : C

Files starting with '_' are considered 'hidden' like unix files starting with '.'.
# characters are allowed in HDFS file names.

You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.

  • A. There is no difference in output between the two settings.
  • B. With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS.
  • C. With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS.
  • D. With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.


Answer : D

* It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map- outputs before writing them out to the FileSystem.
* Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
Note:

Reduce -
In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.
The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(WritableComparable, Writable).
Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.
The output of the Reducer is not sorted.

A combiner reduces:

  • A. The number of values across different keys in the iterator supplied to a single reduce method call.
  • B. The amount of intermediate data that must be transferred between the mapper and reducer.
  • C. The number of input files a mapper must process.
  • D. The number of output files a reducer must produce.


Answer : B

Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job?

Page:    1 / 4   
Total 66 questions