AWS Certified Machine Learning - Specialty (MLS-C01) v1.0 (AWS Certified Machine Learning - Specialty)

Disclaimer: We declare no affiliation, sponsorship, nor any partnerships with Amazon or any of its trademarks.
Page:    1 / 13   
Total 195 questions

A Machine Learning Specialist is working with a large cybersecurity company that manages security events in real time for companies around the world. The cybersecurity company wants to design a solution that will allow it to use machine learning to score malicious events as anomalies on the data as it is being ingested. The company also wants be able to save the results in its data lake for later processing and analysis.
What is the MOST efficient way to accomplish these tasks?

  • A. Ingest the data using Amazon Kinesis Data Firehose, and use Amazon Kinesis Data Analytics Random Cut Forest (RCF) for anomaly detection. Then use Kinesis Data Firehose to stream the results to Amazon S3.
  • B. Ingest the data into Apache Spark Streaming using Amazon EMR, and use Spark MLlib with k-means to perform anomaly detection. Then store the results in an Apache Hadoop Distributed File System (HDFS) using Amazon EMR with a replication factor of three as the data lake.
  • C. Ingest the data and store it in Amazon S3. Use AWS Batch along with the AWS Deep Learning AMIs to train a k-means model using TensorFlow on the data in Amazon S3.
  • D. Ingest the data and store it in Amazon S3. Have an AWS Glue job that is triggered on demand transform the new data. Then use the built-in Random Cut Forest (RCF) model within Amazon SageMaker to detect anomalies in the data.


Answer : B

A Data Scientist wants to gain real-time insights into a data stream of GZIP files.
Which solution would allow the use of SQL to query the stream with the LEAST latency?

  • A. Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.
  • B. AWS Glue with a custom ETL script to transform the data.
  • C. An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.
  • D. Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.


Answer : A

Reference:
https://aws.amazon.com/big-data/real-time-analytics-featured-partners/

A retail company intends to use machine learning to categorize new products. A labeled dataset of current products was provided to the Data Science team. The dataset includes 1,200 products. The labeled dataset has 15 features for each product such as title dimensions, weight, and price. Each product is labeled as belonging to one of six categories such as books, games, electronics, and movies.
Which model should be used for categorizing new products using the provided dataset for training?

  • A. AnXGBoost model where the objective parameter is set to multi:softmax
  • B. A deep convolutional neural network (CNN) with a softmax activation function for the last layer
  • C. A regression forest where the number of trees is set equal to the number of product categories
  • D. A DeepAR forecasting model based on a recurrent neural network (RNN)


Answer : B

A Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor, and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset.
Which tool should be used to improve the validation accuracy?

  • A. Amazon Comprehend syntax analysis and entity detection
  • B. Amazon SageMaker BlazingText cbow mode
  • C. Natural Language Toolkit (NLTK) stemming and stop word removal
  • D. Scikit-leam term frequency-inverse document frequency (TF-IDF) vectorizer


Answer : D

Reference:
https://monkeylearn.com/sentiment-analysis/

Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors. While exploring the data, the
Specialist notices that the magnitude of the input features vary greatly. The Specialist does not want variables with a larger magnitude to dominate the model.
What should the Specialist do to prepare the data for model training?

  • A. Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution.
  • B. Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude.
  • C. Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude.
  • D. Apply the orthogonal sparse bigram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude.


Answer : C

Reference:
https://docs.aws.amazon.com/machine-learning/latest/dg/data-transformations-reference.html

A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800,000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1.5 MB in size. Most queries will span 5 to 10 columns only.
How should the Machine Learning Specialist transform the dataset to minimize query runtime?

  • A. Convert the records to Apache Parquet format.
  • B. Convert the records to JSON format.
  • C. Convert the records to GZIP CSV format.
  • D. Convert the records to XML format.


Answer : A

Explanation:
Using compressions will reduce the amount of data scanned by Amazon Athena, and also reduce your S3 bucket storage. Itג€™s a Win-Win for your AWS bill.
Supported formats: GZIP, LZO, SNAPPY (Parquet) and ZLIB.
Reference:
https://www.cloudforecast.io/blog/using-parquet-on-athena-to-save-money-on-aws/

A Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs. The workflow consists of the following processes:
ג€¢ Start the workflow as soon as data is uploaded to Amazon S3.
ג€¢ When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets with multiple terabyte-sized datasets already stored in Amazon
S3.
ג€¢ Store the results of joining datasets in Amazon S3.
ג€¢ If one of the jobs fails, send a notification to the Administrator.
Which configuration will meet these requirements?

  • A. Use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. Use AWS Glue to join the datasets. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.
  • B. Develop the ETL workflow using AWS Lambda to start an Amazon SageMaker notebook instance. Use a lifecycle configuration script to join the datasets and persist the results in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.
  • C. Develop the ETL workflow using AWS Batch to trigger the start of ETL jobs when data is uploaded to Amazon S3. Use AWS Glue to join the datasets in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.
  • D. Use AWS Lambda to chain other Lambda functions to read and join the datasets in Amazon S3 as soon as the data is uploaded to Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.


Answer : A

Reference:
https://aws.amazon.com/step-functions/use-cases/

An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen.
Which combination of algorithms would provide the appropriate insights? (Choose two.)

  • A. The factorization machines (FM) algorithm
  • B. The Latent Dirichlet Allocation (LDA) algorithm
  • C. The principal component analysis (PCA) algorithm
  • D. The k-means algorithm
  • E. The Random Cut Forest (RCF) algorithm


Answer : CD

Explanation:
The PCA and K-means algorithms are useful in collection of data using census form.

A large consumer goods manufacturer has the following products on sale:
ג€¢ 34 different toothpaste variants
ג€¢ 48 different toothbrush variants
ג€¢ 43 different mouthwash variants
The entire sales history of all these products is available in Amazon S3. Currently, the company is using custom-built autoregressive integrated moving average
(ARIMA) models to forecast demand for these products. The company wants to predict the demand for a new product that will soon be launched.
Which solution should a Machine Learning Specialist apply?

  • A. Train a custom ARIMA model to forecast demand for the new product.
  • B. Train an Amazon SageMaker DeepAR algorithm to forecast demand for the new product.
  • C. Train an Amazon SageMaker k-means clustering algorithm to forecast demand for the new product.
  • D. Train a custom XGBoost model to forecast demand for the new product.


Answer : B

Explanation:
The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series. They then use that model to extrapolate the time series into the future.
Reference:
https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html

A Machine Learning Specialist uploads a dataset to an Amazon S3 bucket protected with server-side encryption using AWS KMS.
How should the ML Specialist define the Amazon SageMaker notebook instance so it can read the same dataset from Amazon S3?

  • A. Define security group(s) to allow all HTTP inbound/outbound traffic and assign those security group(s) to the Amazon SageMaker notebook instance.
  • B. ׀¡onfigure the Amazon SageMaker notebook instance to have access to the VPC. Grant permission in the KMS key policy to the notebookג€™s KMS role.
  • C. Assign an IAM role to the Amazon SageMaker notebook with S3 read access to the dataset. Grant permission in the KMS key policy to that role.
  • D. Assign the same KMS key used to encrypt data in Amazon S3 to the Amazon SageMaker notebook instance.


Answer : D

Reference:
https://docs.aws.amazon.com/sagemaker/latest/dg/encryption-at-rest.html

A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing.
The Data Scientist has been given the following requirements to the cloud solution:
✑ Combine multiple data sources.
✑ Reuse existing PySpark logic.
✑ Run the solution on the existing schedule.
✑ Minimize the number of servers that will need to be managed.
Which architecture should the Data Scientist use to build this solution?

  • A. Write the raw data to Amazon S3. Schedule an AWS Lambda function to submit a Spark step to a persistent Amazon EMR cluster based on the existing schedule. Use the existing PySpark logic to run the ETL job on the EMR cluster. Output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
  • B. Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
  • C. Write the raw data to Amazon S3. Schedule an AWS Lambda function to run on the existing schedule and process the input data from Amazon S3. Write the Lambda logic in Python and implement the existing PySpark logic to perform the ETL process. Have the Lambda function output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
  • D. Use Amazon Kinesis Data Analytics to stream the input data and perform real-time SQL queries against the stream to carry out the required transformations within the stream. Deliver the output results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.


Answer : D

A Data Scientist is building a model to predict customer churn using a dataset of 100 continuous numerical features. The Marketing team has not provided any insight about which features are relevant for churn prediction. The Marketing team wants to interpret the model and see the direct impact of relevant features on the model outcome. While training a logistic regression model, the Data Scientist observes that there is a wide gap between the training and validation set accuracy.
Which methods can the Data Scientist use to improve the model performance and satisfy the Marketing teamג€™s needs? (Choose two.)

  • A. Add L1 regularization to the classifier
  • B. Add features to the dataset
  • C. Perform recursive feature elimination
  • D. Perform t-distributed stochastic neighbor embedding (t-SNE)
  • E. Perform linear discriminant analysis


Answer : BE

An aircraft engine manufacturing company is measuring 200 performance metrics in a time-series. Engineers want to detect critical manufacturing defects in near- real time during testing. All of the data needs to be stored for offline analysis.
What approach would be the MOST effective to perform near-real time defect detection?

  • A. Use AWS IoT Analytics for ingestion, storage, and further analysis. Use Jupyter notebooks from within AWS IoT Analytics to carry out analysis for anomalies.
  • B. Use Amazon S3 for ingestion, storage, and further analysis. Use an Amazon EMR cluster to carry out Apache Spark ML k-means clustering to determine anomalies.
  • C. Use Amazon S3 for ingestion, storage, and further analysis. Use the Amazon SageMaker Random Cut Forest (RCF) algorithm to determine anomalies.
  • D. Use Amazon Kinesis Data Firehose for ingestion and Amazon Kinesis Data Analytics Random Cut Forest (RCF) to perform anomaly detection. Use Kinesis Data Firehose to store data in Amazon S3 for further analysis.


Answer : B

A Machine Learning team runs its own training algorithm on Amazon SageMaker. The training algorithm requires external assets. The team needs to submit both its own algorithm code and algorithm-specific parameters to Amazon SageMaker.
What combination of services should the team use to build a custom algorithm in Amazon SageMaker? (Choose two.)

  • A. AWS Secrets Manager
  • B. AWS CodeStar
  • C. Amazon ECR
  • D. Amazon ECS
  • E. Amazon S3


Answer : CE

A Machine Learning Specialist wants to determine the appropriate SageMakerVariantInvocationsPerInstance setting for an endpoint automatic scaling configuration. The Specialist has performed a load test on a single instance and determined that peak requests per second (RPS) without service degradation is about 20 RPS. As this is the first deployment, the Specialist intends to set the invocation safety factor to 0.5.
Based on the stated parameters and given that the invocations per instance setting is measured on a per-minute basis, what should the Specialist set as the
SageMakerVariantInvocationsPerInstance setting?

  • A. 10
  • B. 30
  • C. 600
  • D. 2,400


Answer : C

Page:    1 / 13   
Total 195 questions