Certified Data Engineer Professional v1.0 (Certified Data Engineer Professional)

Page:    1 / 8   
Total 111 questions

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.



Which command should be removed from the notebook before scheduling it as a job?

  • A. Cmd 2
  • B. Cmd 3
  • C. Cmd 4
  • D. Cmd 5
  • E. Cmd 6


Answer : E

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms, and loads the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

  • A. Manually trigger a job anytime the business reporting team refreshes their dashboards
  • B. Schedule a job to execute the pipeline once an hour on a new job cluster
  • C. Schedule a Structured Streaming job with a trigger interval of 60 minutes
  • D. Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster
  • E. Configure a job that executes every time new data lands in a given directory


Answer : C

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:


SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

  • A. The total count of rows is calculated by scanning all data files
  • B. The total count of rows will be returned from cached results unless REFRESH is run
  • C. The total count of records is calculated from the Delta transaction logs
  • D. The total count of records is calculated from the parquet file metadata
  • E. The total count of records is calculated from the Hive metastore


Answer : A

A Delta Lake table was created with the below query:



Consider the following query:


DROP TABLE prod.sales_by_store -

If this statement is executed by a workspace admin, which result will occur?

  • A. Nothing will occur until a COMMIT command is executed.
  • B. The table will be removed from the catalog but the data will remain in storage.
  • C. The table will be removed from the catalog and the data will be deleted.
  • D. An error will occur because Delta Lake prevents the deletion of production data.
  • E. Data will be marked as deleted but still recoverable with Time Travel.


Answer : D

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

  • A. DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.
  • B. By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.
  • C. The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.
  • D. Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.
  • E. The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.


Answer : E

The following code has been migrated to a Databricks notebook from a legacy workload:



The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

  • A. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
  • B. Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
  • C. %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
  • D. Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
  • E. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.


Answer : C

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

  • A. Delta Lake statistics are not optimized for free text fields with high cardinality.
  • B. Text data cannot be stored with Delta Lake.
  • C. ZORDER ON review will need to be run to see performance gains.
  • D. The Delta log creates a term matrix for free text fields to support selective filtering.
  • E. Delta Lake statistics are only collected on the first 4 columns in a table.


Answer : D

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

  • A. configure
  • B. fs
  • C. jobs
  • D. libraries
  • E. workspace


Answer : C

The business intelligence team has a dashboard configured to track various summary metrics for retail stores. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:



For demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table, named products_per_order, includes the following fields:



Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

  • A. Populate the dashboard by configuring a nightly batch job to save the required values as a table overwritten with each update.
  • B. Use Structured Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.
  • C. Configure a webhook to execute an incremental read against products_per_order each time the dashboard is refreshed.
  • D. Use the Delta Cache to persist the products_per_order table in memory to quickly update the dashboard with each query.
  • E. Define a view against the products_per_order table and define the dashboard against this view.


Answer : A

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

  • A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
  • B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
  • C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
  • D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
  • E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.


Answer : B

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:



Choose the response that correctly fills in the blank within the code block to complete this task.

  • A. withWatermark("event_time", "10 minutes")
  • B. awaitArrival("event_time", "10 minutes")
  • C. await("event_time + ‘10 minutes'")
  • D. slidingWindow("event_time", "10 minutes")
  • E. delayWrite("event_time", "10 minutes")


Answer : D

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.

Original query:



Proposed query:



Proposed query:

.start(“/item_agg”)

Which step must also be completed to put the proposed query into production?

  • A. Specify a new checkpointLocation
  • B. Increase the shuffle partitions to account for additional aggregates
  • C. Run REFRESH TABLE delta.'/item_agg'
  • D. Register the data in the "/item_agg" directory to the Hive metastore
  • E. Remove .option(‘mergeSchema’, ‘true’) from the streaming write


Answer : A

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

  • A. Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
  • B. Increase the number of shuffle partitions to maximize parallelism, since the trigger interval cannot be modified without modifying the checkpoint directory.
  • C. Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
  • D. Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.
  • E. Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.


Answer : C

Which statement describes the correct use of pyspark.sql.functions.broadcast?

  • A. It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
  • B. It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
  • C. It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.
  • D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
  • E. It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.


Answer : C

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

  • A. Set the configuration delta.deduplicate = true.
  • B. VACUUM the Delta table after each batch completes.
  • C. Perform an insert-only merge with a matching condition on a unique key.
  • D. Perform a full outer join on a unique key and overwrite existing data.
  • E. Rely on Delta Lake schema enforcement to prevent duplicate records.


Answer : D

Page:    1 / 8   
Total 111 questions