diff --git a/mlcourse_ai_jupyter_book/book/topic07/topic7_pca_clustering.md b/mlcourse_ai_jupyter_book/book/topic07/topic7_pca_clustering.md index ce2e09d2c..199064652 100644 --- a/mlcourse_ai_jupyter_book/book/topic07/topic7_pca_clustering.md +++ b/mlcourse_ai_jupyter_book/book/topic07/topic7_pca_clustering.md @@ -50,7 +50,7 @@ Principal Component Analysis is one of the easiest, most intuitive, and most fre More generally speaking, all observations can be considered as an ellipsoid in a subspace of an initial feature space, and the new basis set in this subspace is aligned with the ellipsoid axes. This assumption lets us remove highly correlated features since basis set vectors are orthogonal. -In the general case, the resulting ellipsoid dimensionality matches the initial space dimensionality, but the assumption that our data lies in a subspace with a smaller dimension allows us to cut off the "excessive" space with the new projection (subspace). We accomplish this in a 'greedy' fashion, sequentially selecting each of the ellipsoid axes by identifying where the dispersion is maximal. +In the general case, the resulting ellipsoid dimensionality matches the initial space dimensionality, but the assumption that our data lies in a subspace with a smaller dimension allows us to cut off the "excessive" space with the new projection (subspace). We accomplish this in a 'greedy' fashion, sequentially selecting each of the ellipsoid axes by identifying where the variance is maximal. > "To deal with hyper-planes in a 14 dimensional space, visualize a 3D space and say 'fourteen' very loudly. Everyone does it." - Geoffrey Hinton @@ -58,15 +58,15 @@ In the general case, the resulting ellipsoid dimensionality matches the initial Let's take a look at the mathematical formulation of this process: -In order to decrease the dimensionality of our data from $n$ to $k$ with $k \leq n$, we sort our list of axes in order of decreasing dispersion and take the top-$k$ of them. +In order to decrease the dimensionality of our data from $n$ to $k$ with $k \leq n$, we sort our list of axes in order of decreasing variance and take the top-$k$ of them. -We begin by computing the dispersion and the covariance of the initial features. This is usually done with the covariance matrix. According to the covariance definition, the covariance of two features is computed as follows: +We begin by computing the variance and the covariance of the initial features. This is usually done with the covariance matrix. According to the covariance definition, the covariance of two features is computed as follows: $$cov(X_i, X_j) = E[(X_i - \mu_i) (X_j - \mu_j)] = E[X_i X_j] - \mu_i \mu_j,$$ -where $\mu_i$ is the expected value of the $i$th feature. It is worth noting that the covariance is symmetric, and the covariance of a vector with itself is equal to its dispersion. +where $\mu_i$ is the expected value of the $i$th feature. It is worth noting that the covariance is symmetric, and the covariance of a vector with itself is equal to its variance. -Therefore the covariance matrix is symmetric with the dispersion of the corresponding features on the diagonal. Non-diagonal values are the covariances of the corresponding pair of features. In terms of matrices where $\mathbf{X}$ is the matrix of observations, the covariance matrix is as follows: +Therefore the covariance matrix is symmetric with the variance of the corresponding features on the diagonal. Non-diagonal values are the covariances of the corresponding pair of features. In terms of matrices where $\mathbf{X}$ is the matrix of observations, the covariance matrix is as follows: $$\Sigma = E[(\mathbf{X} - E[\mathbf{X}]) (\mathbf{X} - E[\mathbf{X}])^{T}]$$ @@ -251,7 +251,7 @@ plt.colorbar() plt.title("MNIST. t-SNE projection"); ``` -In practice, we would choose the number of principal components such that we can explain 90% of the initial data dispersion (via the `explained_variance_ratio`). Here, that means retaining 21 principal components; therefore, we reduce the dimensionality from 64 features to 21. +In practice, we would choose the number of principal components such that we can explain 90% of the initial data variance (via the `explained_variance_ratio`). Here, that means retaining 21 principal components; therefore, we reduce the dimensionality from 64 features to 21. ```{code-cell} ipython3 diff --git a/mlcourse_ai_jupyter_book/book/topic08/topic08_sgd_hashing_vowpal_wabbit.md b/mlcourse_ai_jupyter_book/book/topic08/topic08_sgd_hashing_vowpal_wabbit.md index b1e841756..a841aeb34 100644 --- a/mlcourse_ai_jupyter_book/book/topic08/topic08_sgd_hashing_vowpal_wabbit.md +++ b/mlcourse_ai_jupyter_book/book/topic08/topic08_sgd_hashing_vowpal_wabbit.md @@ -233,7 +233,7 @@ def logistic_regression_accuracy_on(dataframe, labels): ) logit = LogisticRegression() - logit.fit(train_features, train_labels) + logit.fit(train_features, train_labels.values.ravel()) return classification_report(test_labels, logit.predict(test_features))