GitHub - srijan-Git1247/ML.Net-Application-FileTypeClassifier: A File Type Classifier that predicts whether a file is a document, executable or a script based on a given set of attributes statically extracted from a file using K-means clustering trainer based on the Yinyang Method.

The K-means trainer used in this project is based on YinYang method. All of the inputs must be of Float Type and all input must be normalized into a single feature vector. In Clustering, we have multiple clusters representing a group of similar datapoints. With K-means, the distance between the data point and each of the clusters are the measures of which clusters the model will return. For K-means clustering specifically, it uses the center point of each of these clusters (also called a centroid) and then calculates the distance to the data point. The smallest of these values is the predicted cluster.

The sampledata.csv file contains 80 rows of random files comprising 30 Windows executables, 20 PowerShell scripts and 20 Word Documents. Feel free to adjust the data to fit your own observation or to adjust the trained model.

Here is the snippet of the data:

Each of these rows contains the value for the properties in FileData Class. These correspond to Label, IsBinary, IsMZHeader, IsPKHeader respectively.

MZ and PK are considered to be magic numbers of Windows executables and modern Microsoft Office files. Magic numbers are unique byte strings that are found at the beginning of every file.

In addition to this, testdata.csv file contains additional data points to test the newly trained model against and evaluate. The breakdown was even with 10 Windows executables, 10 PowerShell scripts and 10 Word Documents.

Here is the snippet of the data.

Run the Console Application with commandline arguments:

Assuming the folder of files called "TrainingData" and "TestData" exists, execute the following command. (Optional if you use the two pre-feature extracted files sampledata.csv and testdata.csv in the Data folder of this repository)

D:\Machine Learning Projects\FileClassifier\bin\Debug\net8.0\FileTypeClassifier.exe extract "D:\Machine Learning Projects\FileClassifier\TrainingData" "D:\Machine Learning Projects\FileClassifier\TestData" Extracted 80 to sampledata.csv Extracted 30 to testdata.csv

After Extracting the data, train the model by passing the newly created sampledata.csv and testdata.csv file

To run the model with this file, simply pass in the filename to the built application and the predicted output will show:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Common		Common
Data		Data
Enums		Enums
ML		ML
Properties		Properties
.gitattributes		.gitattributes
.gitignore		.gitignore
FileTypeClassifier.csproj		FileTypeClassifier.csproj
FileTypeClassifier.sln		FileTypeClassifier.sln
Program.cs		Program.cs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

srijan-Git1247/ML.Net-Application-FileTypeClassifier

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages