Skip to content

Latest commit

 

History

History
160 lines (108 loc) · 13.5 KB

4.Classification.md

File metadata and controls

160 lines (108 loc) · 13.5 KB

1. A table below shows a sample dataset of whether a customer responds to the survey or not. “Outcome” is a class label. Construct a Decision tree classifier for the dataset. For a new example ( Rural, semidetached, low, No) what will be the predicted class.

Table 4.1

2. Brifly explain Bagging and Boosting of the Classfiers

a) Bagging :

Bagging (stands for Bootstrap Aggregation) is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multisets of the same cardinality/size as your original data. By increasing the size of your training set you can't improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome.

Boosting is a two-step approach, where one first uses subsets of the original data to produce a series of averagely performing models and then "boosts" their performance by combining them together using a particular cost function (=majority vote). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subsets contains the elements that were (likely to be) misclassified by previous models.

Given a set, D, of d tuples, bagging works as follows. For iteration i .i D 1, 2, : : : , k/, a training set, Di , of d tuples is sampled with replacement from the original set of tuples, D. Note that the term bagging stands for bootstrap aggregation. Each training set is a bootstrap sample, as described in Section 8.5.4. Because sampling with replacement is used, some of the original tuples of D may not be included in Di , whereas others may occur more than once. A classifier model, Mi , is learned for each training set, Di .To classify an unknown tuple, X, each classifier, Mi , returns its class prediction, which counts as one vote. The bagged classifier, M, counts the votes and assigns the class with the most votes to X. Bagging can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple. The algorithm is summarized below. The bagged classifier often has significantly greater accuracy than a single classifier derived from D, the original training data. It will not be considerably worse and is more robust to the effects of noisy data and overfitting. The increased accuracy occurs because the composite model reduces the variance of the individual classifiers.

Algorithm: Bagging. The bagging algorithm—create an ensemble of classification models for a learning scheme where each model gives an equally weighted prediction. Input:

D, a set of d training tuples; k, the number of models in the ensemble; a classification learning scheme (decision tree algorithm, na¨ıve Bayesian, etc.).

Output: The ensemble—a composite model, M�. Method: (1) for i D 1 to k do // create k models: (2) create bootstrap sample, Di , by sampling D with replacement; (3) use Di and the learning scheme to derive a model, Mi ; (4) endfor

b) Boosting :

In boosting, weights are also assigned to each training tuple. A series of k classifiers is iteratively learned. After a classifier, Mi , is learned, the weights are updated to allow the subsequent classifier,MiC1, to “pay more attention” to the training tuples that were misclassified by Mi . The final boosted classifier, M�, combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy. AdaBoost (short for Adaptive Boosting) is a popular boosting algorithm. Suppose we want to boost the accuracy of a learning method. We are given D, a data set of d class-labeled tuples, .X1, y1/, .X2, y2/, : : : , .Xd, yd/, where yi is the class label of tuple Xi. Initially, AdaBoost assigns each training tuple an equal weight of 1=d. Generating k classifiers for the ensemble requires k rounds through the rest of the algorithm. In round i, the tuples from D are sampled to form a training set, Di , of size d. Sampling with replacement is used—the same tuple may be selected more than once. Each tuple’s chance of being selected is based on its weight. A classifier model, Mi , is derived from the training tuples of Di . Its error is then calculated using Di as a test set. The weights of the training tuples are then adjusted according to how they were classified. If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly classified, its weight is decreased. A tuple’s weight reflects how difficult it is to classify— the higher the weight, the more often it has been misclassified. These weights will be used to generate the training samples for the classifier of the next round. The basic idea is that when we build a classifier, we want it to focus more on the misclassified tuples of the previous round. Some classifiers may be better at classifying some “difficult” tuples than others. In this way, we build a series of classifiers that complement each other.

                 error(Mi) = summation(wj) * err(Xj) [for j= 1 to d]

where err.Xj/ is the misclassification error of tuple Xj: If the tuple was misclassified, then err.Xj/ is 1; otherwise, it is 0. If the performance of classifier Mi is so poor that its error exceeds 0.5, then we abandon it. Instead, we try again by generating a new Di training set, from which we derive a new Mi . The weight of classifier Mi ’s vote is

                 log(1-error(Mi)/error(Mi))

Algorithm: A boosting algorithm—create an ensemble of classifiers. Each one gives a weighted vote.

Input: D, a set of d class-labeled training tuples; k, the number of rounds (one classifier is generated per round); a classification learning scheme.

Output: A composite model. Method: (1) initialize the weight of each tuple in D to 1=d; (2) for i D 1 to k do // for each round: (3) sample D with replacement according to the tuple weights to obtain Di ; (4) use training set Di to derive a model, Mi ; (5) compute error.Mi/, the error rate of Mi (Eq. 8.34) (6) if error.Mi/ > 0.5 then (7) go back to step 3 and try again; (8) endif (9) for each tuple in Di that was correctly classified do (10) multiply the weight of the tuple by error.Mi/=.1􀀀error.Mi//; // update weights (11) normalize the weight of each tuple; (12) endfor

  1. Define Classification, Issues of Classification and Explain ID3 classfication with example.

  2. Explain different methods that can be used to evaluate and compare accuracy of different classfication algorthms. Using training data to derive a classifier and then estimate the accuracy of the resulting learned model can result in misleading overoptimistic estimates due to overspecialization of the learning algorithm to the data. Instead, it is better to measure the classifier’s accuracy on a test set consisting of class-labeled tuples that were not used to train the model. There are four terms that are the “building blocks” used incomputingmanyevaluationmeasures.Understandingthemwillmakeiteasytograsp the meaning of the various measures.

True positives (TP): These refer to the positive tuples that were correctly labeled by the classifier. Let TP be the number of true positives.

Truenegatives(TN): These are the negative tuples that were correctly labeled by the classifier. Let TN be the number of true negatives.

False positives (FP): These are the negative tuples that were incorrectly labeled as positive (e.g., tuples of class buys computer = no for which the classifier predicted buys computer=yes). Let FP be the number of false positives.

False negatives (FN): These are the positive tuples that were mislabeled as negative (e.g., tuples of class buys computer = yes for which the classifier predicted buys computer=no). Let FN be the number of false negatives.

               Predicted class  
        |       |  yes  |    no  |  Total  
        |  yes  |  TP   |    FN  |   P

Actual Class| | | | | no | FP | TN | N | Total| P' | N' | P + N

              CONFUSION MATRIX

A confusion matrix is a table of at least size m by m. An entry, CMi,j in the first m rows and m columns indicates the number of tuples of class i that were labeled by the classifier as class j. For a classifier to have good accuracy, ideally most of the tuples would be represented along the diagonal of the confusion matrix, from entry CM1,1 to entry CMm,m, with the rest of the entries being zero or close to zero. That is, ideally, FP and FN are around zero. The table may have additional rows or columns to provide totals. For example, in the confusion matrix of Figure 8.14, P and N are shown. In addition, P0 is the number of tuples that were labeled as positive (TP+FP) and N0 is the number of tuples that werelabeledasnegative(TN+FN).ThetotalnumberoftuplesisTP+TN+FP+TN, or P+N, or P0+N0. Note that although the confusion matrix shown is for a binary classification problem, confusion matrices can be easily drawn for multiple classes in a similar manner.

The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. That is, accuracy= (TP+TN)/(P+N)

The e errorrate or misclassificationrate of a classifier, M, which is simply 1−accuracy(M), where accuracy(M) is the accuracy of M. This also can be computed as error rate= (FP+FN)/(P+N)

If we were to use the training set (instead of a test set) to estimate the error rate of a model, this quantity is known as the resubstitution error. This error estimate is optimistic of the true error rate (and similarly, the corresponding accuracy estimate is optimistic) because the model is not tested on any samples that it has not already seen. We now consider the class imbalance problem, where the main class of interest is rare. That is, the data set distribution reflects a significant majority of the negative class and a minority positive class. For example, in fraud detection applications, the class of interest(orpositiveclass)is“fraud,” whichoccursmuchlessfrequentlythanthenegative “nonfraudulant” class. In medical data, there may be a rare class, such as “cancer.” Suppose that you have trained a classifier to classify medical data tuples, where the class label attribute is “cancer” and the possible class values are “yes” and “no.” An accuracy rate of, say, 97% may make the classifier seem quite accurate, but what if only, say, 3% of the training tuples are actually cancer? Clearly, an accuracy rate of 97% may not be acceptable—the classifier could be correctly labeling only the noncancer tuples, for instance, and misclassifying all the cancer tuples. Instead, we need other measures, which access how well the classifier can recognize the positive tuples (cancer=yes) and how well it can recognize the negative tuples (cancer=no). The sensitivity and specificity measures can be used, respectively, for this purpose. Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion of positive tuples that are correctly identified), while specificity is the true negative rate (i.e., the proportion of negative tuples that are correctly identified). These measures are defined as sensitivity= TP/P specificity= TN/N The precision and recall measures are also widely used in classification. Precision can be thought of as a measure of exactness (i.e., what percentage of tuples labeled as positive are actually such), whereas recall is a measure of completeness (what percentage ofpositivetuplesarelabeledassuch).Ifrecallseemsfamiliar,that’sbecauseitisthesame as sensitivity (or the true positive rate). These measures can be computed as precision= TP/(TP+FP) recall=TP/(TP+FN) = (TP/P)

** 5. Use any classification techniques and find out the class of (Homeowner = Yes, Status = Employed , Income = Average)

ID Homeowner Status Income Defaulted
1 Yes Employed High No
2 No Business Average No
3 No Employed Low No
4 Yes Business High No
5 No Unemployed Average Yes
6 No Business Low No
7 Yes Unemployed High No
8 No Employed Average Yes
9 No Business Low No
10 No Employed Average Yes

Using Naive Bayes classifier we get

enter image description here

x = (Homeowner = Yes, Status = Employed , Income = Average)

We have two class values for class attribute Defaulted = Yes, No. i.e. C1 = Yes, C2 = No

For given tuple x we have For class C1

P(Defaulted = Yes | x ) = P(Defaulted = Yes) (P(Homeowner = Yes | Defaulted = Yes) P(Status = Employed | Defaulted = Yes) P(Income = Average | Defaulted = Yes)) (1)

P(Defaulted = No | x ) = P(Defaulted = No) (P(Homeowner = Yes | Defaulted = No) P(Status = Employed | Defaulted = No) P(Income = Average | Defaulted = No)) (2)

From table we get P(Defaulted = Yes) = 0.3 P(Defaulted = No) = 0.7

P(Homeowner = Yes | Defaulted = Yes) = 0 P(Homeowner = Yes | Defaulted = No) = 3/7 = 0.4285

P(Status = Employed | Defaulted = Yes) = 2/3 = 0.6667 P(Status = Employed | Defaulted = No) = 2/7 = 0.2857

P(Income = Average | Defaulted = Yes) = 1 P(Income = Average | Defaulted = No) = 1/7 = 0.1428

Putting all values in given equations (1) and (2)

P(Defaulted = Yes | x ) = 0.3 (0 0.6667 1) = 0 (1)

P(Defaulted = No | x ) = 0.7 (0.4285 0.2857 0.1428) = 0.0122 (2)

Hence the predicted class value using Naive Bayes Classifier is Defaulted = No