Understanding Binary Classification using through ML .NET (Part 3 of 5)

It is the third part of a 5 part blog series of MachineLearning.net, here are the first and second parts.

First Blog Post on the introduction of Machine Learning.NET https://cloudandmobileblog.com/2018/07/09/introduction-of-machine-learning-net-part-1-of-5/

Second Blog Post on Clustering in Machine Learning .NET https://cloudandmobileblog.com/2018/07/15/clustering-in-machinelearning-net/


Binary or binomial classification is the task of classifying the elements of a given set into two groups (predicting which group each one belongs to) on the basis of a classification rule. Binary Classification would generally fall into the domain of Supervised Learning since the training dataset is labeled. And as the name suggests it is simply a special case in which there are only two classes.
Some typical examples include:

  1. Credit Card Fraudulent Transaction detection
  2. Medical Diagnosis
  3. Spam Detection

Now there are various paradigms that are used for learning binary classifiers which include:

  1. Decision Trees
  2. Neural Networks
  3. Bayesian Classification
  4. Support Vector Machines

The actual output of many binary classification algorithms is a prediction score. The score indicates the system’s certainty that the given observation belongs to the positive class. To make the decision about whether the observation should be classified as positive or negative, as a consumer of this score, you will interpret the score by picking a classification threshold (cut-off) and compare the score against it. Any observations with scores higher than the threshold are then predicted as the positive class and scores lower than the threshold are predicted as the negative class.

Depending on your business problem, you might be more interested in a model that performs well for a specific subset of these metrics. For example, two business applications might have very different requirements for their ML models:

  • One application might need to be extremely sure about the positive predictions actually being positive (high precision) and be able to afford to misclassify some positive examples as negative (moderate recall).
  • Another application might need to correctly predict as many positive examples as possible (high recall) and will accept some negative examples being misclassified as positive (moderate precision).

Problem

This problem is centered around predicting if a passenger aboard the Titanic survived or not. We will use the data provided in the repo: Real-World Machine Learning in which each passenger has been assigned a label:

  • 0 – did not survive
  • 1 – survived

Using those datasets we will build a model that will analyze a string and predict if a passenger survived.


Step 1. Create a new Dot Net Core Console App, I am using Visual Studio for Mac as shown below,  you can also use Visual Studio code on Linux or Visual Studio 2017 for Windows.

1-NewConsoleApp.png

I named my Application as TitanicSurvivalClassifier

2-ProjectCreated.png


Step 2:  Add Microsoft.ML NuGet package and import these two CSV files for the training and evaluating our model. Add these files and set their properties as “Copy to output directory”

https://github.com/abhiongithub/ML-for-Dot-Net-developers/blob/master/3-BinaryClassification/TitanicSurvivalClassifier/TitanicSurvivalClassifier/titanic-train.csv

https://github.com/abhiongithub/ML-for-Dot-Net-developers/blob/master/3-BinaryClassification/TitanicSurvivalClassifier/TitanicSurvivalClassifier/titanic-test.csv

 


Step 3:  Now add TitanicData.cs file as shown below.

using System;
using Microsoft.ML.Runtime.Api;
namespace TitanicSurvivalClassifier
{
public class TitanicData
{
[Column("0")]
public float PassengerId;
[Column(ordinal: "1", name: "Label")]
public float Survived;
[Column("2")]
public float Pclass;
[Column("3")]
public string Name;
[Column("4")]
public string Sex;
[Column("5")]
public float Age;
[Column("6")]
public float SibSp;
[Column("7")]
public float Parch;
[Column("8")]
public string Ticket;
[Column("9")]
public string Fare;
[Column("10")]
public string Cabin;
[Column("11")]
public string Embarked;
}
}

view raw
TitanicData.cs
hosted with ❤ by GitHub


Step 4: Now add TitanicPrediction.cs file as shown below

public class TitanicPrediction
{
[ColumnName("PredictedLabel")]
public bool Survived;
}

view raw
TitanicPrediction.cs
hosted with ❤ by GitHub


Step 5: Now add TestTitanicData.cs

public class TestTitanicData
{
public static readonly TitanicData Passenger = new TitanicData()
{
Pclass = 2,
Name = "Shelley, Mrs. William (Imanita Parrish Hall)",
Sex = "female",
Age = 25,
SibSp = 0,
Parch = 1,
Ticket = "230433",
Fare = "26",
Cabin = "",
Embarked = "S"
};
}

view raw
TestTitanicData.cs
hosted with ❤ by GitHub


Step 6: Now Modify Program.cs .

public static class Program
{
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string TrainDataPath => Path.Combine(AppPath, "titanic-train.csv");
private static string TestDataPath => Path.Combine(AppPath, "titanic-test.csv");
private static string ModelPath => Path.Combine(AppPath, "TitanicModel.zip");
private static void Main(string[] args)
{
// STEP 1: Create a model
var model = TrainAsync().GetAwaiter().GetResult();
// STEP2: Test accuracy
Evaluate(model);
// STEP 3: Make a prediction
var prediction = model.Predict(TestTitanicData.Passenger);
Console.WriteLine($"Did this passenger survive? Actual: Yes Predicted: {(prediction.Survived ? "Yes" : "No")}");
Console.ReadLine();
}
public static async Task<PredictionModel<TitanicData, TitanicPrediction>> TrainAsync()
{
// LearningPipeline holds all steps of the learning process: data, transforms, learners.
var pipeline = new LearningPipeline();
// The TextLoader loads a dataset. The schema of the dataset is specified by passing a class containing
// all the column names and their types.
pipeline.Add(new TextLoader(TrainDataPath).CreateFrom<TitanicData>(useHeader: true, separator: ','));
// Transform any text feature to numeric values
pipeline.Add(new CategoricalOneHotVectorizer(
"Sex",
"Ticket",
"Fare",
"Cabin",
"Embarked"));
// Put all features into a vector
pipeline.Add(new ColumnConcatenator(
"Features",
"Pclass",
"Sex",
"Age",
"SibSp",
"Parch",
"Ticket",
"Fare",
"Cabin",
"Embarked"));
// FastTreeBinaryClassifier is an algorithm that will be used to train the model.
// It has three hyperparameters for tuning decision tree performance.
pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });
Console.WriteLine("=============== Training model ===============");
// The pipeline is trained on the dataset that has been loaded and transformed.
var model = pipeline.Train<TitanicData, TitanicPrediction>();
// Saving the model as a .zip file.
await model.WriteAsync(ModelPath);
Console.WriteLine("=============== End training ===============");
Console.WriteLine("The model is saved to {0}", ModelPath);
return model;
}
private static void Evaluate(PredictionModel<TitanicData, TitanicPrediction> model)
{
// To evaluate how good the model predicts values, the model is ran against new set
// of data (test data) that was not involved in training.
var testData = new TextLoader(TestDataPath).CreateFrom<TitanicData>(useHeader: true, separator: ',');
// BinaryClassificationEvaluator performs evaluation for Binary Classification type of ML problems.
var evaluator = new BinaryClassificationEvaluator();
Console.WriteLine("=============== Evaluating model ===============");
var metrics = evaluator.Evaluate(model, testData);
// BinaryClassificationMetrics contains the overall metrics computed by binary classification evaluators
// The Accuracy metric gets the accuracy of a classifier which is the proportion
//of correct predictions in the test set.
// The Auc metric gets the area under the ROC curve.
// The area under the ROC curve is equal to the probability that the classifier ranks
// a randomly chosen positive instance higher than a randomly chosen negative one
// (assuming 'positive' ranks higher than 'negative').
// The F1Score metric gets the classifier's F1 score.
// The F1 score is the harmonic mean of precision and recall:
// 2 * precision * recall / (precision + recall).
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End evaluating ===============");
Console.WriteLine();
}
}

view raw
TitanicProgram.cs
hosted with ❤ by GitHub


Step 7: Now Once you run this program, you must see the following output.

TitanicOutput.png

You can download the source code of this application from following GitHub repository.

https://github.com/abhiongithub/ML-for-Dot-Net-developers

Here is the link to next blog post of this series

https://cloudandmobileblog.com/2018/07/28/sentiment-analysis-using-machine-learning-net-part-4-of-5/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.