Predict Credit Risk using Binary Classification -Part 1 of 3

This is a 3 part Blog Series

  1. Part 1 of 3 – Predict Credit Risk using Binary Classification (This Post)
  2. Part 2 of 3 – Predict Credit Risk using Binary Classification 
  3. Part 3 of 3 – Predict Credit Risk using Binary Classification

In this blog post, we will demonstrate how to perform cost-sensitive binary classification in Azure ML Studio to predict credit risk based on the information given on a credit application.

The classification problem in this experiment is a cost-sensitive one because the cost of misclassifying the positive samples is five times the cost of misclassifying the negative samples.

In this experiment, we compare two different approaches to generate models to solve this problem:

  • Training using the original dataset
  • Training using a replicated dataset

In both approaches, we evaluate the models using the test data set with replication, to ensure that results are aligned with the cost function.

We test two classifiers in both approaches: Two-Class Support Vector Machine and Two-Class Boosted Decision Tree.

Data

We use the German Credit Card data set from the UC Irvine repository.

This dataset contains 1000 samples with 20 features and 1 label. Each sample represents a person. The 20 features include both numerical and categorical features. The last column is the label, which denotes the credit risk and has only two possible values: high credit risk = 2, and low credit risk = 1.

The cost of misclassifying a low-risk example as high is 1, whereas the cost of misclassifying a high-risk example as low is 5.

Data Processing

We started by using the Metadata Editor module to add column names to replace the default column names with more meaningful names, obtained from the data set description on the UCI site. The new column names are provided as comma-separated values in the New column name field of Metadata Editor.

Next, we generated training and test sets used for developing the risk prediction model. We split the original data set into training and test sets of the same size using the Split module. To create sets of equal size, we set the option, Fraction of rows in the first output, to 0.5.

Generating the New Data Set

Because the cost of underestimating risk is high in the real world, we set the cost of misclassification as follows:

  • For high-risk cases misclassified as low risk: 5
  • For low-risk cases misclassified as high risk: 1

To reflect this cost function, we generated a new data set, in which each high-risk example is replicated five times, whereas the number of low-risk examples is kept as is. We split the data into training and test datasets before replication to prevent the same example from being in both the training and test sets.

Feature Engineering

One of the machine learning algorithms requires that data be normalized. Therefore, we used the Normalize Data module to normalize the ranges of all numeric features, using a tanh transformation. A tanh transformation converts all numeric features to values within a range of 0-1, while preserving the overall distribution of values.

The Two-Class Support Vector Machine module handles string features for us, converting them to categorical features and then to binary features having a value of 0 or 1, so there is no need to normalize these features.

Model

In this experiment, we applied two classifiers: Two-Class Support Vector Machine (SVM) and Two-Class Boosted Decision Tree. Because we also used two datasets, we generated a total of four models:

  • SVM, trained with original data
  • SVM, trained with replicated data
  • Boosted Decision Tree, trained with original data
  • Boosted Decision Tree, trained with replicated data

We used the standard experimental workflow to create, train, and test the models:

  1. Initialize the learning algorithms, using Two-Class Support Vector Machine and Two-Class Boosted Decision Tree
  2. Use Train Model to apply the algorithm to the data and create the actual model.
  3. Use Score Model to produce scores using the test examples.

  1. Lets start building our Experiment by opening studio.azureml.net , click on new experiments and name it as “Predict Credit Risk using Binary Classification“.

1

2. Download data from link : https://1drv.ms/u/s!AiVE8zs7kZPmjbovMx2PHeW5kzdfUQ  save it to a file on your machine and now browse that file in this dataset. Click on New -> Dataset and click From Local File

2

3. Browse the file as shown below anf click on Tick icon.

3

4. Now drag the dataset from My Datasets category to canvas.

4.PNG

5. Now drag Edit Metadata from left on to canvas.

5.PNG

6. Now select the Edit meta data module and in the properties pane at right click on  launch columns selector as shown below.

6

After closing columns selector enter following list of following column names in “New Column names”  textbox as shown in previous image of step 5.

Status of checking account,Duration in months,Credit history,Purpose,Credit amount,Savings account/bond,Present employment since,Installment rate in percentage of disposable income,Personal status and sex,Other debtors/guarantors,Present residence since,Property,Age in years,Other installment plans,Housing,Number of existing credits at this bank,Job,Number of people being liable to provide maintenance for,Telephone,Foreign worker,Credit risk

7. Now drag “Split Data” module onto canvas as shown below.

here we will split the data-set intro training and test tests.

7

8.  To create a set of standard statistical measures that describe each column in the input table.

Such summary statistics are useful when you want to understand the characteristics of the complete dataset. Now drag the “Summarize Data” module on to canvas as shown below and click on Run Selected.

8

Once it ran successfully, click the output node and select visualize as shown below.

8.1.PNG

Now we can see a grid of rows and columns filled with data.

8.2

Now in the next blog posts we will add more modules in this experiment.

Continue to Next Part(Part 2) of this Blog Post

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.