Sentiment Analysis using Machine Learning .NET (Part 4 of 5)

It is the fourth part of a 5 part blog series of MachineLearning.net, here are the links of previous blogs of this series.

First Blog Post on the introduction of Machine Learning.NET https://cloudandmobileblog.com/2018/07/09/introduction-of-machine-learning-net-part-1-of-5/

Second Blog Post on Clustering in Machine Learning .NET  https://cloudandmobileblog.com/2018/07/15/clustering-in-machinelearning-net/

Third Blog Post on Understanding Binary Classification  https://cloudandmobileblog.com/2018/07/28/understanding-binary-classification-using-sentiment-analysis-through-ml-net-part-3-of-5/


Classification is a machine learning task that uses data to determine the category, type, or class of an item or row of data. For example, you can use classification to:

  • Identify sentiment as positive or negative.
  • Classify email as spam, junk, or good.
  • Determine whether a patient’s lab sample is cancerous.
  • Categorize customers by their propensity to respond to a sales campaign.

Classification tasks are frequently one of the following types:

  • Binary: either A or B.
  • Multiclass: multiple categories that can be predicted by using a single model.

This sample is a console app that uses ML.NET to train a model that classifies and predicts sentiment as either positive or negative. It also evaluates the model with the second dataset for quality analysis. The sentiment datasets are from the WikiDetox project.

We first need to understand the problem so we can break it down into parts that can support building and train the model. Breaking the problem down you to predict and evaluate the results.

The problem with this tutorial is to understand incoming website comment sentiment to take the appropriate action.

We can break down the problem into the sentiment text and sentiment value for the data you want to train the model with, and a predicted sentiment value that you can evaluate and then use operationally.


Step 1: Create a new Dot Net Core Console App, I am using Visual Studio for Mac as shown below,  you can also use Visual Studio code on Linux or Visual Studio 2017 for Windows.

1-NewConsoleApp


Step 2: Now Install the Microsoft.ML Nuget package and download the following datasets

https://github.com/abhiongithub/ML-for-Dot-Net-developers/blob/master/4-SentimentAnalysis/SentimentAnalysis/SentimentAnalysis/sentiment-imdb-train.txt

https://github.com/abhiongithub/ML-for-Dot-Net-developers/blob/master/4-SentimentAnalysis/SentimentAnalysis/SentimentAnalysis/sentiment-yelp-test.txt

Add these txt files into Visual Studio project and set files properties as “Copy to output directory”.


Step 3: Now create a new file SentimentData.cs as shown below

public class SentimentData
{
[Column("0")]
public string SentimentText;
[Column("1", name: "Label")]
public float Sentiment;
}

view raw
SentimentData.cs
hosted with ❤ by GitHub


Step 4: Now create a new file SentimentPrediction.cs

public class SentimentPrediction
{
[ColumnName("PredictedLabel")]
public bool Sentiment;
}


Step 5: Now add a new file TestSentimentData.cs

internal class TestSentimentData
{
internal static readonly IEnumerable<SentimentData> Sentiments = new[]
{
new SentimentData
{
SentimentText = "Contoso's 11 is a wonderful experience",
Sentiment = 0
},
new SentimentData
{
SentimentText = "The acting in this movie is very bad",
Sentiment = 0
},
new SentimentData
{
SentimentText = "Joe versus the Volcano Coffee Company is a great film.",
Sentiment = 0
}
};
}

view raw
TestSentimentData.cs
hosted with ❤ by GitHub

If you have done everything correctly by now your solution must look like this

Screen Shot 2018-07-29 at 12.44.14 AM


Step 6:  Here we will be using  FastTreeBinaryClassifier

A decision (or regression) tree is a binary tree-like flow chart, where at each interior node one decides which of the two child nodes to continue to based on one of the feature values from the input. At each leaf node, a value is returned. In the interior nodes, the decision is based on the test ‘x <= v’ where x is the value of the feature in the input sample and v is one of the possible values of this feature. The functions that can be produced by a regression tree are all the piece-wise constant functions.

The ensemble of trees is produced by computing, in each step, a regression tree that approximates the gradient of the loss function and adding it to the previous tree with coefficients that minimize the loss of the new tree. The output of the ensemble produced by MART on a given instance is the sum of the tree outputs.

  • In case of a binary classification problem, the output is converted to a probability by using some form of calibration.
  • In case of a regression problem, the output is the predicted value of the function.
  • In case of a ranking problem, the instances are ordered by the output value of the ensemble.

Now update your Program.cs as shown below

using System;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Models;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
namespace SentimentAnalysis
{
static class Program
{
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string TrainDataPath => Path.Combine(AppPath,"sentiment-imdb-train.txt");
private static string TestDataPath => Path.Combine(AppPath,"sentiment-yelp-test.txt");
private static string ModelPath => Path.Combine(AppPath, "SentimentModel.zip");
private static async Task Main(string[] args)
{
// STEP 1: Create a model
var model = await TrainAsync();
// STEP2: Test accuracy
Evaluate(model);
// STEP 3: Make a prediction
var predictions = model.Predict(TestSentimentData.Sentiments);
var sentimentsAndPredictions =
TestSentimentData.Sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
foreach (var item in sentimentsAndPredictions)
{
Console.WriteLine(
$"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")} sentiment");
}
Console.ReadLine();
}
public static async Task<PredictionModel<SentimentData, SentimentPrediction>> TrainAsync()
{
// LearningPipeline holds all steps of the learning process: data, transforms, learners.
var pipeline = new LearningPipeline();
// The TextLoader loads a dataset. The schema of the dataset is specified by passing a class containing
// all the column names and their types.
pipeline.Add(new TextLoader(TrainDataPath).CreateFrom<SentimentData>());
// TextFeaturizer is a transform that will be used to featurize an input column to format and clean the data.
pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
// FastTreeBinaryClassifier is an algorithm that will be used to train the model.
// It has three hyperparameters for tuning decision tree performance.
pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });
Console.WriteLine("=============== Training model ===============");
// The pipeline is trained on the dataset that has been loaded and transformed.
var model = pipeline.Train<SentimentData, SentimentPrediction>();
// Saving the model as a .zip file.
await model.WriteAsync(ModelPath);
Console.WriteLine("=============== End training ===============");
Console.WriteLine("The model is saved to {0}", ModelPath);
return model;
}
private static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
{
// To evaluate how good the model predicts values, the model is ran against new set
// of data (test data) that was not involved in training.
var testData = new TextLoader(TestDataPath).CreateFrom<SentimentData>();
// BinaryClassificationEvaluator performs evaluation for Binary Classification type of ML problems.
var evaluator = new BinaryClassificationEvaluator();
Console.WriteLine("=============== Evaluating model ===============");
var metrics = evaluator.Evaluate(model, testData);
// BinaryClassificationMetrics contains the overall metrics computed by binary classification evaluators
// The Accuracy metric gets the accuracy of a classifier which is the proportion
//of correct predictions in the test set.
// The Auc metric gets the area under the ROC curve.
// The area under the ROC curve is equal to the probability that the classifier ranks
// a randomly chosen positive instance higher than a randomly chosen negative one
// (assuming 'positive' ranks higher than 'negative').
// The F1Score metric gets the classifier's F1 score.
// The F1 score is the harmonic mean of precision and recall:
// 2 * precision * recall / (precision + recall).
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End evaluating ===============");
Console.WriteLine();
}
}
}


Step 7: Now when we run this program we can notice that it is predicting positive and negative sentiments correctly as shown below.

7-PredictedCorrectSenetiments.png


The GitHub repository of this sample code is here.

https://github.com/abhiongithub/ML-for-Dot-Net-developers

Here is the link to next blog post of this series https://cloudandmobileblog.com/2018/07/28/building-fare-predictor-using-regression-evaluator-of-machine-learning-netpart-5-of-5/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.