It is the second part of a 5 part blog series of MachineLearning.net, here is the first part.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
As a first Step let us understand the problem of clustering.
This problem is about dividing the set of iris flowers in different groups based on the features of the flower. Those features are the length and width of a sepal and the length and width of a petal. For this tutorial, assume that the type of each flower is unknown. You want to learn the structure of a dataset from the features and predict how a data instance fits this structure.
As we don’t know to which group each flower belongs to, we need to choose the unsupervised machine learning task. To divide a data set in groups in such a way that elements in the same group are more similar to each other than to those in other groups, use a clustering machine learning task.
Now let us Create a new Console application in Dot NET Core using Visual Studio just like we did in the previous blog post, In Solution Explorer, right-click the project and select Add > New Folder. Type “Data” and hit Enter. Now install Install the Microsoft.ML NuGet package.
Download the iris.data dataset and save it to the Data folder you’ve created at the previous step.In Solution Explorer, right-click the iris.data file and select Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.
The iris.data file contains five columns that represent:
- sepal length in centimetres
- sepal width in centimetres
- petal length in centimetres
- petal width in centimetres
- type of iris flower
For the sake of the clustering example, we are ignoring the last column.
First, add the required namespace.
and now create the data classes
Here IrisData is the input data class and has definitions for each feature from the data set. Use the Column attribute to specify the indices of the source columns in the dataset file.
Now Program.cs add two fields to hold the paths to the dataset file and to the file to save the model:
_dataPathcontains the path to the file with the data set used to train the model.
_modelPathcontains the path to the file where the trained model is stored.
Now our Program.cs (Main File) will look like this.
Here in this screenshot the solution structure and output is clearly visible and when we executed this code it generated IrisClusteringModel.zip in the Data Folder.
Here is the Github repository.
Here is the link to next blog post of this series