Skip to content

Trabalho Prático de Aprendizagem Extração de Conhecimento

Notifications You must be signed in to change notification settings

leoproject/TP2EAC

Repository files navigation

TP2EAC

Version Portuguese

TP2EAC - It is a project carried out by a discipline in the 4th year of the Master in Computer Engineering, from the profile of Intelligent Systems. Just aim and get a classification model to produce good results. For this job, we have the following problems: Preparation and analysis of a data set related to the characteristics of employees of multiple companies. The purpose is to predict the individual's annual salary level. This project was a set of case studies. Which presents information about employees from companies all over the world, as they have multiple characteristics about the individual.

About Team

The developer team was four master's students from the University of Minho, Braga, Portugal.

Dataset

The dataset provided to work on the proposed problem was found in the [resources] (resources /) directory which contains:

  1. training.csv, which presents the case data to be applied exclusively for training the predictive model;

  2. test.csv, which had the data that only were applied for the analysis and validation of the predictive model;

  3. attribute_info.docx, where you can find details about the attributes of the dataset in question is presente. It had 14 variables and one target attribute for annual salary classification. Whether the employee receives 50,000 per year or not.

Development Methodology:

The methodology adopted by the team is the CRISP-DM (Cross Industry Standard Process for Data Mining) consists of a cycle consisting of 6 phases.

  1. Business Study: This initial phase was the study of the dataset to understand which objectives and attributes are present in the dataset;

  2. Data Analysis: This phase was accomplished by analyzing the data. You can find in the data analysis notebook;

  3. Data Preparation: In this phase making process of the data, notebook, from both the training and test dataset was performed,input .

Such as analysis of missing values, correlations, feature engineering, and other techniques. The output of this notebook was the generation of new datasets. They are training and testing, with data processing and others in addition to data processing, containing normalization. Resulting in 4 datasets (2 of training and 2 of the test). These will be used at a later stage;

  1. Modeling of Algorithms: In this phase, we standardized the development flow of the model. You can see this in the standard notebook using the logistic regression algorithm. Which output is the classification model of the algorithm in use? Once all the models are generated by the chosen algorithms. We moved on to the next phase;

  2. Evaluation of Models: This phase has the evaluation notebook with all models so that we can view and compare the results of the models using the test datasets normalized or not. In a way, we decided which is the best classification model for this problem;

  3. Development: This phase was putting the model chosen into production. We used Heroku and API that we created to access the model with Flask, which is in a repository separately. Whether you want to test the API with the model. We have a notebook for that in the directory testApi, and we also have an APK, to install on the Android operating system.

Conclusion

By solving this work, the group developed capabilities to create and train classification models using a wide range of algorithms. We have also evolved in data analysis and preparation. They are two important phases in the development of classification and prediction methods.

This way, we conclude that the best model trained by the group, if we analyze only the accuracy of the models, is the one that implements the K-Nearest-Neighbors algorithm. However, what the group would choose for the Development phase, according to previously established criteria, we would choose the model that implements the Logistic Regression algorithm.

About

Trabalho Prático de Aprendizagem Extração de Conhecimento

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published