forked from gavinwhyte/lessons
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlesson8.py
143 lines (101 loc) · 6.1 KB
/
lesson8.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# Predictive Model Building Balancing Performance, Complexity and Big Data
# The lesson provides the start of the technical definitions of performance for
# different types of machine learning problems.
# In an e-commerce application good performance means returning correct search results
# or presenting ads that site visitors frequently click.
# In a genetic problem it means isolating a few genes responsible for a
# heritable condition.
# The lesson is the start of what describes relevant performance
# measures for these different problems
# The goal of selecting and fitting a predictive algorithm is to achieve the best
# possible performance
# Achieving performance goals involves three factors:
# complexity of the problem, complexity of the algorithmic model employed,
# and the amount and richness of the data available.
# The basic problem understanding function approximation
# The algorithms covered in these lessons address a specific class of predictive problems
# The problem statement for these problems has two types of variables
# The variable you are attempting to predict(for example whether a visitor to a
# website will click an ad)
# Other Variables (for example, the visitors demogrpahics or past behavior on the site)
# that you can use to make the prediction
# Problem of this type are referred to as function approximation problems because the
# goal is to construct a model generating predictions of the first of these as a
# function of the second
# In a function approximation problem, the designer starts with a collection of
# historical example for which the correct answer is known.
# For example, historical web logs files will indicate whether a visitor clicked an ad
# when shown the ad.
# The data scientist might try using other pages that the vistor viewed before seeing the
# ad.
# If the user is registered with the site, data on past purchases or pages viewed
# might be available for making a prediction
# The variable being predicted refers to a number of names, such as target,label
# and outcome
# The variables used to make the predictions are variously called predictors,
# regressors, features and attributes.
# These terms are used interchangeably in this text.
# Determing what attributes to use for making predictions is called feature
# Engineering
# Feature Engineering usually requires a manual iterative process for selecting
# features, determining optimal potential, and experimenting with Different combinations
# of features.
# The algorithms covered in this book assign numeric importance to each attribute,
# These values indicate the relative importance of attributes in making predictions
# That information speeds up the feature engineering process.
# Working with Training data
# The datascientist starts algorithm development with a training set.
# The training set consists of outcome examples and the assemblage of features chosen
# by the data scientist
# The training set comprises two types of data:
# The outcomes you want to predict
# Features available for making the prediction
# Example of a training set
# Feature 1 - Gender Feature 2 Money spent on site Feature 3 Age Outcomes Clicked
# M 0 25 Yes
# F 250 32 No
# F 12 17 Yes
# The predictor values (AKA Features, Attributes etc) can be arranged in the form
# of a matrix.
# A proper matrix contains variables that are all the same type. Predictors however
# may not all be the same type of variable.
# Using the example above about predicting advertising clicks , the predictors may at
# times include demographic data about the site visitor
# Data could include martial status, yearly income, among other things.
# YEarly income is a real number, marital status is a categorical variable.
# That means that marital status does not admit arithmetic operations such as
# addition or multiplication and that no order relation exists between single,
# married and divorced.
# The entries in a column from X all have the same type, but the type may vary from
# one column to the next
# Attributes such as marital status, gender or the state of residence go by several
# different designations.
# They are called factor or categorical. Attributes like age or income that
# are represented by numbers are called numeric or real-valued. The distinction
# between these 2 types of attributes are important because some algorithms may
# not handle one type or the other.
# For example, linear methods require numeric attributes.
# When we cover penalized linear regression, i will show methods for converting
# categorical variables to numeric in order to apply linear methods to problems with
# categorical variables.
# The targets corresponding to each row in X are arranged in a column vector Y
# Targets may be of several different forms.
# For example they may be real numbers, like if the objective were to predict how much
# a customer will spend
# When the targets are real numbers, the problem is called a regression problem.
# Linear regression implies using a linear method to solve a regression problem
# This lesson covers linear and non-linear regression methods.
# If the targets are two valued the problem is called a binary classification problem.
# Predicting whether a customer will clcik an advertisement is a binary
# classification problem.
# If the target contains several discrete values, the problem is called a
# multiclass classification problem.
# Predicting which of several ads a customer will click will be a multiclass
# classification problem. The basic problem is to find a prediction function pred()
# that uses attributes to predict outcomes
# function pred() uses the attribute Xi to predict yi. This course describes some
# of the very best current methods for producing the function pred()
# Assessing Performance of Predictive Models
# Good performance means using the attributes xi to generate a prediction thats
# close to yi but close has different meanings for different problems.
# for a regression problem where yi is a real number