How to prepare for Microsoft Big Data, AI, ML Certifications 70-773,70-774,70-775

Exam 70-773 : Analyzing Big Data with Microsoft R

Read and explore big data

  • Read data with R Server
    • Read supported data file formats, such as text files, SAS, and SPSS; convert data to XDF format; identify trade-offs between XDF and flat text files; read data through Open Database Connectivity (ODBC) data sources; read in files from other file systems; use an internal data frame as a data source; process data from sources that cannot be read natively by R Server
  • Summarize data
    • Compute crosstabs and univariate statistics, choose when to use rxCrossTabs versus rxCube, integrate with open source technologies by using packages such as dplyrXdf, use group by functionality, create complex formulas to perform multiple tasks in one pass through the data, extract quantiles by using rxQuantile
  • Visualize data
    • Visualize in-memory data with base plotting functions and ggplot2; create custom visualizations with rxSummary and rxCube; visualize data with rxHistogram and rxLinePlot, including faceted plots

Process big data

  • Process data with rxDataStep
    • Subset rows of data, modify and create columns by using the Transforms argument, choose when to use on-the-fly transformations versus in-data transform trade-offs, handle missing values through filtering or replacement, generate a data frame or an XDF file, process dates (POSIXctPOSIXlt)
  • Perform complex transforms that use transform functions
    • Define a transform function; reshape data by using a transform function; use open source packages, such as lubridate; pass in values by using transformVars and transformEnvir; use internal .rx variables and functions for tasks, including cross-chunk communication
  • Manage data sets
    • Sort data in various orders, such as ascending and descending; use rxSort deduplication to remove duplicate values; merge data sources using rxMerge(); merge options and types; identify when alternatives to rxSort and rxMerge should be used
  • Process text using RML packages
    • Create features using RML functions, such as featurizeText(); create indicator variables and arrays using RML functions, such as categorical() and categoricalHash(); perform feature selection using RML functions

Build predictive models with ScaleR

  • Estimate linear models
    • Use rxLinModrxGlm, and rxLogit to estimate linear models; set the family for a generalized linear model by using functions such as rxTweedie; process data on the fly by using the appropriate arguments and functions, such as the F function and Transforms argument; weight observations through frequency or probability weights; choose between different types of automatic variable selections, such as greedy searches, repeated scoring, and byproduct of training; identify the impact of missing values during automatic variable selection
  • Build and use partitioning models
    • Use rxDTreerxDForest, and rxBTrees to build partitioning models; adjust the weighting of false positives and misses by using loss; select parameters that affect bias and variance, such as pruning, learning rate, and tree depth; use as.rpart to interact with open source ecosystems
  • Generate predictions and residuals
    • Use rxPredict to generate predictions; perform parallel scoring using rxExec; generate different types of predictions, such as link and response scores for GLM, response, prob, and vote for rxDForest; generate different types of residuals, such as Usual, Pearson, and DBM
  • Evaluate models and tuning parameters
    • Summarize estimated models; run arbitrary code out of process, such as parallel parameter tuning by using rxExec; evaluate tree models by using RevoTreeView and rxVarImpPlot; calculate model evaluation metrics by using built-in functions; calculate model evaluation metrics and visualizations by using custom code, such as mean absolute percentage error and precision recall curves
  • Create additional models using RML packages
    • Build and use a One-Class Support Vector Machine, build and use linear and logistic regressions that use L1 and L2 regularization, build and use a decision tree by using FastTree, use FastTree as a recommender with ranking loss (NDCG), build and use a simple three-layer feed-forward neural network

Use R Server in different environments

  • Use different compute contexts to run R Server effectively
    • Change the compute context (rxHadoopMRrxSparkrxLocalseq, and rxLocalParallel); identify which compute context to use for different tasks; use different data source objects, depending on the context (RxOdbcData and RxTextData); identify and use appropriate data sources for different data sources and compute contexts (HDFS and SQL Server); debug processes across different compute contexts; identify use cases for RevoPemaR
  • Optimize tasks by using local compute contexts
    • Identify and execute tasks that can be run only in the local compute context, identify tasks that are more efficient to run in the local compute context, choose between rxLocalseq and rxLocalParallel, profile across different compute contexts
  • Perform in-database analytics by using SQL Server
    • Choose when to perform in-database versus out-of-database computations, identify limitations of in-database computations, use in-database versus out-of-database compute contexts appropriately, use stored procedures for data processing steps, serialize objects and write back to binary fields in a table, write tables, configure R to optimize SQL Server ( chunksizenumtasks, and computecontext), effectively communicate performance properties to SQL administrators and architects (SQL Server Profiler)
  • Implement analysis workflows in the Hadoop ecosystem and Spark
    • Use appropriate R Server functions in Spark; integrate with Hive, Pig, and Hadoop MapReduce; integrate with the Spark ecosystem of tools, such as SparklyR and SparkR; profile and tune across different compute contexts; use doRSR for parallelizing code that was written using open source foreach
  • Deploy predictive models to SQL Server and Azure Machine Learning
    • Deploy predictive models to SQL Server as a stored procedure, deploy an arbitrary function to Azure Machine Learning by using the AzureML R package, identify when to use DeployR

Exam 70-774 : Perform Cloud Data Science with Azure Machine Learning

Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning

  • Import and export data to and from Azure Machine Learning
    • Import and export data to and from Azure Blob storage, import and export data to and from Azure SQL Database, import and export data via Hive Queries, import data from a website, import data from on-premises SQL
  • Explore and summarize data
    • Create univariate summaries, create multivariate summaries, visualize univariate distributions, use existing Microsoft R or Python notebooks for custom summaries and custom visualizations, use zip archives to import external packages for R or Python
  • Cleanse data for Azure Machine Learning
    • Apply filters to limit a dataset to the desired rows, identify and address missing data, identify and address outliers, remove columns and rows of datasets
  • Perform feature engineering
    • Merge multiple datasets by rows or columns into a single dataset by columns, merge multiple datasets by rows or columns into a single dataset by rows, add columns that are combinations of other columns, manually select and construct features for model estimation, automatically select and construct features for model estimation, reduce dimensions of data through principal component analysis (PCA), manage variable metadata, select standardized variables based on planned analysis

Develop Machine Learning Models

  • Select an appropriate algorithm or method
    • Select an appropriate algorithm for predicting continuous label data, select an appropriate algorithm for supervised versus unsupervised scenarios, identify when to select R versus Python notebooks, identify an appropriate algorithm for grouping unlabeled data, identify an appropriate algorithm for classifying label data, select an appropriate ensemble
  • Initialize and train appropriate models
    • Tune hyperparameters manually; tune hyperparameters automatically; split data into training and testing datasets, including using routines for cross-validation; build an ensemble using the stacking method
  • Validate models
    • Score and evaluate models, select appropriate evaluation metrics for clustering, select appropriate evaluation metrics for classification, select appropriate evaluation metrics for regression, use evaluation metrics to choose between Machine Learning models, compare ensemble metrics against base models

Operationalize and Manage Azure Machine Learning Services

  • Deploy models using Azure Machine Learning
    • Publish a model developed inside Azure Machine Learning, publish an externally developed scoring function using an Azure Machine Learning package, use web service parameters, create and publish a recommendation model, create and publish a language understanding model
  • Manage Azure Machine Learning projects and workspaces
    • Create projects and experiments, add assets to a project, create new workspaces, invite users to a workspace, switch between different workspaces, create a Jupyter notebook that references an intermediate dataset
  • Consume Azure Machine Learning models
    • Connect to a published Machine Learning web service, consume a published Machine Learning model programmatically using a batch execution service, consume a published Machine Learning model programmatically using a request response service, interact with a published Machine Learning model using Microsoft Excel, publish models to the marketplace
  • Consume exemplar Cognitive Services APIs
    • Consume Vision APIs to process images, consume Language APIs to process text, consume Knowledge APIs to create recommendations

Use Other Services for Machine Learning

  • Build and use neural networks with the Microsoft Cognitive Toolkit
    • Use N-series VMs for GPU acceleration, build and train a three-layer feed forward neural network, determine when to implement a neural network
  • Streamline development by using existing resources
    • Clone template experiments from Cortana Intelligence Gallery, use Cortana Intelligence Quick Start to deploy resources, use a data science VM for streamlined development
  • Perform data sciences at scale by using HDInsights
    • Deploy the appropriate type of HDI cluster, perform exploratory data analysis by using Spark SQL, build and use Machine Learning models with Spark on HDI, build and use Machine Learning models using MapReduce, build and use Machine Learning models using Microsoft R Server
  • Perform database analytics by using SQL Server R Services on Azure
    • Deploy a SQL Server 2016 Azure VM, configure SQL Server to allow execution of R scripts, execute R scripts inside T-SQL statements

Here is documentation links for each topic:

Exam 70-775 : Perform Data Engineering on Microsoft Azure HDInsight

Administer and Provision HDInsight Clusters

  • Deploy HDInsight clusters
    • Create a cluster in a private virtual network, create a cluster that has a custom metastore, create a domain-joined cluster, select an appropriate cluster type based on workload considerations, customize a cluster by using script actions, provision a cluster by using Portal, provision a cluster by using Azure CLI tools, provision a cluster by using Azure Resource Manager (ARM) templates and PowerShell, manage managed disks, configure vNet peering
  • Deploy and secure multi-user HDInsight clusters
    • Provision users who have different roles; manage users, groups, and permissions through Apache Ambari, PowerShell, and Apache Ranger; configure Kerberos; configure service accounts; implement SSH tunneling; restrict access to data
  • Ingest data for batch and interactive processing
    • Ingest data from cloud or on-premises data; store data in Azure Data Lake; store data in Azure Blob Storage; perform routine small writes on a continuous basis using Azure CLI tools; ingest data in Apache Hive and Apache Spark by using Apache Sqoop, Application Development Framework (ADF), AzCopy, and AdlCopy; ingest data from an on-premises Hadoop cluster
  • Configure HDInsight clusters
    • Manage metastore upgrades; view and edit Ambari configuration groups; view and change service configurations through Ambari; access logs written to Azure Table storage; enable heap dumps for Hadoop services; manage HDInsight configuration, use HDInsight .NET SDK, and PowerShell; perform cluster-level debugging; stop and start services through Ambari; manage Ambari alerts and metrics
  • Manage and debug HDInsight jobs
    • Describe YARN architecture and operation; examine YARN jobs through ResourceManager UI and review running applications; use YARN CLI to kill jobs; find logs for different types of jobs; debug Hadoop and Spark jobs; use Azure Operations Management Suite (OMS) to monitor and manage alerts, and perform predictive actions

Implement Big Data Batch Processing Solutions

  • Implement batch solutions with Hive and Apache Pig
    • Define external Hive tables; load data into a Hive table; use partitioning and bucketing to improve Hive performance; use semi-structured files such as XML and JSON with Hive; join tables with Hive using shuffle joins and broadcast joins; invoke Hive UDFs with Java and Python; design scripts with Pig; identify query bottlenecks using the Hive query graph; identify the appropriate storage format, such as Apache Parquet, ORC, Text, and JSON
  • Design batch ETL solutions for big data with Spark
    • Share resources between Spark applications using YARN queues and preemption, select Spark executor and driver settings for optimal performance, use partitioning and bucketing to improve Spark performance, connect to external Spark data sources, incorporate custom Python and Scala code in a Spark DataSets program, identify query bottlenecks using the Spark SQL query graph
  • Operationalize Hadoop and Spark
    • Create and customize a cluster by using ADF; attach storage to a cluster and run an ADF activity; choose between bring-your-own and on-demand clusters; use Apache Oozie with HDInsight; choose between Oozie and ADF; share metastore and storage accounts between a Hive cluster and a Spark cluster to enable the same table across the cluster types; select an appropriate storage type for a data pipeline, such as Blob storage, Azure Data Lake, and local Hadoop Distributed File System (HDFS)

Implement Big Data Interactive Processing Solutions

  • Implement interactive queries for big data with Spark SQL
    • Execute queries using Spark SQL, cache Spark DataFrames for iterative queries, save Spark DataFrames as Parquet files, connect BI tools to Spark clusters, optimize join types such as broadcast versus merge joins, manage Spark Thrift server and change the YARN resources allocation, identify use cases for different storage types for interactive queries
  • Perform exploratory data analysis by using Spark SQL
    • Use Jupyter and Apache Zeppelin for visualization and developing tidy Spark DataFrames for modeling, use Spark SQL’s two-table joins to merge DataFrames and cache results, save tidied Spark DataFrames to performant format for reading and analysis (Apache Parquet), manage interactive Livy sessions and their resources
  • Implement interactive queries for big data with Interactive Hive
    • Enable Hive LLAP through Hive settings, manage and configure memory allocation for Hive LLAP jobs, connect BI tools to Interactive Hive clusters
  • Perform exploratory data analysis by using Hive
    • Perform interactive querying and visualization, use Ambari Views, use HiveQL, parse CSV files with Hive, use ORC versus Text for caching, use internal and external tables in Hive, use Zeppelin to visualize data
  • Perform interactive processing by using Apache Phoenix on HBase
    • Use Phoenix in HDInsight; use Phoenix Grammar for queries; configure transactions, user-defined functions, and secondary indexes; identify and optimize Phoenix performance; select between Hive, Spark, and Phoenix on HBase for interactive processing; identify when to share metastore between a Hive cluster and a Spark cluster.

Implement Big Data Real-Time Processing Solutions

  • Create Spark streaming applications using DStream API
    • Define DStreams and compare them to Resilient Distributed Dataset (RDDs), start and stop streaming applications, transform DStream (flatMap, reduceByKey, UpdateStateByKey), persist long-term data stores in HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Apache Kafka or Event Hub, visualize streaming data in a PowerBI real-time dashboard
  • Create Spark structured streaming applications
    • Use DataFrames and DataSets APIs to create streaming DataFrames and Datasets; create Window Operations on Event Time; define Window Transformations for Stateful and Stateless Operations; stream Window Functions, Reduce by Key, and Window to Summarize Streaming Data; persist Long Term Data HBase and SQL; persist Long Term Data Azure Data Lake and Azure Blob Storage; stream data from Kafka or Event Hub; visualize streaming data in a PowerBI real-time dashboard
  • Develop big data real-time processing solutions with Apache Storm
    • Create Storm clusters for real-time jobs, persist Long Term Data HBase and SQL, persist Long Term Data Azure Data Lake and Azure Blob Storage, stream data from Kafka or Event Hub, configure event windows in Storm, visualize streaming data in a PowerBI real-time dashboard, define Storm topologies and describe Storm Computation Graph Architecture, create Storm streams and conduct streaming joins, run Storm topologies in local mode for testing, configure Storm applications (Workers, Debug mode), conduct Stream groupings to broadcast tuples across components, debug and monitor Storm jobs
  • Build solutions that use Kafka
    • Create Spark and Storm clusters in the virtual network, manage partitions, configure MirrorMaker, start and stop services through Ambari, manage topics
  • Build solutions that use HBase
    • Identify HBase use cases in HDInsight, use HBase Shell to create updates and drop HBase tables, monitor an HBase cluster, optimize the performance of an HBase cluster, identify uses cases for using Phoenix for analytics of real-time data, implement replication in HBase

Step-by-Step Tutorials

Learn how to use Azure HDInsight in different scenarios:

Microsoft Professional Program in Big Data


One of my favourite 5 hour Video covering most of the topics of all 3 Exams 773,774 and 775.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.