diff --git a/ml/cc/exercises/linear_regression_with_a_real_dataset.ipynb b/ml/cc/exercises/linear_regression_with_a_real_dataset.ipynb index 061f1d8..e922115 100644 --- a/ml/cc/exercises/linear_regression_with_a_real_dataset.ipynb +++ b/ml/cc/exercises/linear_regression_with_a_real_dataset.ipynb @@ -5,8 +5,7 @@ "colab": { "name": "Linear Regression with a Real Dataset.ipynb", "provenance": [], - "private_outputs": true, - "collapsed_sections": [] + "private_outputs": true }, "kernelspec": { "name": "python3", @@ -18,9 +17,7 @@ "cell_type": "code", "metadata": { "id": "wDlWLbfkJtvu", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Copyright 2020 Google LLC. Double-click here for license information.\n", @@ -36,7 +33,7 @@ "# See the License for the specific language governing permissions and\n", "# limitations under the License." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { @@ -53,8 +50,7 @@ { "cell_type": "markdown", "metadata": { - "id": "TL5y5fY9Jy_x", - "colab_type": "text" + "id": "TL5y5fY9Jy_x" }, "source": [ "# Linear Regression with a Real Dataset\n", @@ -69,8 +65,7 @@ { "cell_type": "markdown", "metadata": { - "id": "h8wtceyJj2uX", - "colab_type": "text" + "id": "h8wtceyJj2uX" }, "source": [ "## Learning Objectives:\n", @@ -78,7 +73,7 @@ "After doing this Colab, you'll know how to do the following:\n", "\n", " * Read a .csv file into a [pandas](https://developers.google.com/machine-learning/glossary/#pandas) DataFrame.\n", - " * Examine a [dataset](https://developers.google.com/machine-learning/glossary/#data_set). \n", + " * Examine a [dataset](https://developers.google.com/machine-learning/glossary/#data_set).\n", " * Experiment with different [features](https://developers.google.com/machine-learning/glossary/#feature) in building a model.\n", " * Tune the model's [hyperparameters](https://developers.google.com/machine-learning/glossary/#hyperparameter)." ] @@ -86,8 +81,7 @@ { "cell_type": "markdown", "metadata": { - "id": "JJZEgJQSjyK4", - "colab_type": "text" + "id": "JJZEgJQSjyK4" }, "source": [ "## The Dataset\n", @@ -98,8 +92,7 @@ { "cell_type": "markdown", "metadata": { - "id": "xchnxAsaKKqO", - "colab_type": "text" + "id": "xchnxAsaKKqO" }, "source": [ "## Import relevant modules\n", @@ -111,9 +104,7 @@ "cell_type": "code", "metadata": { "id": "9n9_cTveKmse", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Import relevant modules\n", @@ -121,23 +112,22 @@ "import tensorflow as tf\n", "from matplotlib import pyplot as plt\n", "\n", - "# The following lines adjust the granularity of reporting. \n", + "# The following lines adjust the granularity of reporting.\n", "pd.options.display.max_rows = 10\n", "pd.options.display.float_format = \"{:.1f}\".format" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "X_TaJhU4KcuY", - "colab_type": "text" + "id": "X_TaJhU4KcuY" }, "source": [ "## The dataset\n", "\n", - "Datasets are often stored on disk or at a URL in [.csv format](https://wikipedia.org/wiki/Comma-separated_values). \n", + "Datasets are often stored on disk or at a URL in [.csv format](https://wikipedia.org/wiki/Comma-separated_values).\n", "\n", "A well-formed .csv file contains column names in the first row, followed by many rows of data. A comma divides each value in each row. For example, here are the first five rows of the .csv file holding the California Housing Dataset:\n", "\n", @@ -154,8 +144,7 @@ { "cell_type": "markdown", "metadata": { - "id": "sSFQkzNlj-l6", - "colab_type": "text" + "id": "sSFQkzNlj-l6" }, "source": [ "### Load the .csv file into a pandas DataFrame\n", @@ -171,9 +160,7 @@ { "cell_type": "code", "metadata": { - "id": "JZlvdpyYKx7V", - "colab_type": "code", - "colab": {} + "id": "JZlvdpyYKx7V" }, "source": [ "# Import the dataset.\n", @@ -185,14 +172,13 @@ "# Print the first rows of the pandas DataFrame.\n", "training_df.head()" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "5inxx49n4U9u", - "colab_type": "text" + "id": "5inxx49n4U9u" }, "source": [ "Scaling `median_house_value` puts the value of each house in units of thousands. Scaling will keep loss values and learning rates in a friendlier range. \n", @@ -203,17 +189,16 @@ { "cell_type": "markdown", "metadata": { - "id": "yMysi6-3IAbu", - "colab_type": "text" + "id": "yMysi6-3IAbu" }, "source": [ "## Examine the dataset\n", "\n", "A large part of most machine learning projects is getting to know your data. The pandas API provides a `describe` function that outputs the following statistics about every column in the DataFrame:\n", "\n", - "* `count`, which is the number of rows in that column. Ideally, `count` contains the same value for every column. \n", + "* `count`, which is the number of rows in that column. Ideally, `count` contains the same value for every column.\n", "\n", - "* `mean` and `std`, which contain the mean and standard deviation of the values in each column. \n", + "* `mean` and `std`, which contain the mean and standard deviation of the values in each column.\n", "\n", "* `min` and `max`, which contain the lowest and highest values in each column.\n", "\n", @@ -223,36 +208,31 @@ { "cell_type": "code", "metadata": { - "id": "rnUSYKw4LUuh", - "colab_type": "code", - "colab": {} + "id": "rnUSYKw4LUuh" }, "source": [ "# Get statistics on the dataset.\n", - "training_df.describe()\n" + "training_df.describe()" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "f9pcW_Yjtoo8", - "colab_type": "text" + "id": "f9pcW_Yjtoo8" }, "source": [ "### Task 1: Identify anomalies in the dataset\n", "\n", - "Do you see any anomalies (strange values) in the data? " + "Do you see any anomalies (strange values) in the data?" ] }, { "cell_type": "code", "metadata": { "id": "UoS7NWRXEs1H", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click to view a possible answer.\n", @@ -260,28 +240,27 @@ "# The maximum value (max) of several columns seems very\n", "# high compared to the other quantiles. For example,\n", "# example the total_rooms column. Given the quantile\n", - "# values (25%, 50%, and 75%), you might expect the \n", - "# max value of total_rooms to be approximately \n", - "# 5,000 or possibly 10,000. However, the max value \n", + "# values (25%, 50%, and 75%), you might expect the\n", + "# max value of total_rooms to be approximately\n", + "# 5,000 or possibly 10,000. However, the max value\n", "# is actually 37,937.\n", "\n", "# When you see anomalies in a column, become more careful\n", "# about using that column as a feature. That said,\n", - "# anomalies in potential features sometimes mirror \n", - "# anomalies in the label, which could make the column \n", + "# anomalies in potential features sometimes mirror\n", + "# anomalies in the label, which could make the column\n", "# be (or seem to be) a powerful feature.\n", - "# Also, as you will see later in the course, you \n", - "# might be able to represent (pre-process) raw data \n", + "# Also, as you will see later in the course, you\n", + "# might be able to represent (pre-process) raw data\n", "# in order to make columns into useful features." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "3014ezH3C7jT", - "colab_type": "text" + "id": "3014ezH3C7jT" }, "source": [ "## Define functions that build and train a model\n", @@ -289,7 +268,7 @@ "The following code defines two functions:\n", "\n", " * `build_model(my_learning_rate)`, which builds a randomly-initialized model.\n", - " * `train_model(model, feature, label, epochs)`, which trains the model from the examples (feature and label) you pass. \n", + " * `train_model(model, feature, label, epochs)`, which trains the model from the examples (feature and label) you pass.\n", "\n", "Since you don't need to understand model building code right now, we've hidden this code cell. You may optionally double-click the following headline to see the code that builds and trains a model." ] @@ -298,9 +277,7 @@ "cell_type": "code", "metadata": { "id": "pedD5GhlDC-y", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Define the functions that build and train a model\n", @@ -312,23 +289,23 @@ " # Describe the topography of the model.\n", " # The topography of a simple linear regression model\n", " # is a single node in a single layer.\n", - " model.add(tf.keras.layers.Dense(units=1, \n", + " model.add(tf.keras.layers.Dense(units=1,\n", " input_shape=(1,)))\n", "\n", " # Compile the model topography into code that TensorFlow can efficiently\n", - " # execute. Configure training to minimize the model's mean squared error. \n", - " model.compile(optimizer=tf.keras.optimizers.experimental.RMSprop(learning_rate=my_learning_rate),\n", + " # execute. Configure training to minimize the model's mean squared error.\n", + " model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=my_learning_rate),\n", " loss=\"mean_squared_error\",\n", " metrics=[tf.keras.metrics.RootMeanSquaredError()])\n", "\n", - " return model \n", + " return model\n", "\n", "\n", "def train_model(model, df, feature, label, epochs, batch_size):\n", " \"\"\"Train the model by feeding it data.\"\"\"\n", "\n", " # Feed the model the feature and the label.\n", - " # The model will train for the specified number of epochs. \n", + " # The model will train for the specified number of epochs.\n", " history = model.fit(x=df[feature],\n", " y=df[label],\n", " batch_size=batch_size,\n", @@ -340,26 +317,25 @@ "\n", " # The list of epochs is stored separately from the rest of history.\n", " epochs = history.epoch\n", - " \n", + "\n", " # Isolate the error for each epoch.\n", " hist = pd.DataFrame(history.history)\n", "\n", " # To track the progression of training, we're going to take a snapshot\n", - " # of the model's root mean squared error at each epoch. \n", + " # of the model's root mean squared error at each epoch.\n", " rmse = hist[\"root_mean_squared_error\"]\n", "\n", " return trained_weight, trained_bias, epochs, rmse\n", "\n", "print(\"Defined the build_model and train_model functions.\")" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "Ak_TMAzGOIFq", - "colab_type": "text" + "id": "Ak_TMAzGOIFq" }, "source": [ "## Define plotting functions\n", @@ -376,9 +352,7 @@ "cell_type": "code", "metadata": { "id": "QF0BFRXTOeR3", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Define the plotting functions\n", @@ -415,32 +389,29 @@ " plt.plot(epochs, rmse, label=\"Loss\")\n", " plt.legend()\n", " plt.ylim([rmse.min()*0.97, rmse.max()])\n", - " plt.show() \n", + " plt.show()\n", "\n", "print(\"Defined the plot_the_model and plot_the_loss_curve functions.\")" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "D-IXYVfvM4gD", - "colab_type": "text" + "id": "D-IXYVfvM4gD" }, "source": [ "## Call the model functions\n", "\n", - "An important part of machine learning is determining which [features](https://developers.google.com/machine-learning/glossary/#feature) correlate with the [label](https://developers.google.com/machine-learning/glossary/#label). For example, real-life home-value prediction models typically rely on hundreds of features and synthetic features. However, this model relies on only one feature. For now, you'll arbitrarily use `total_rooms` as that feature. \n" + "An important part of machine learning is determining which [features](https://developers.google.com/machine-learning/glossary/#feature) correlate with the [label](https://developers.google.com/machine-learning/glossary/#label). For example, real-life home-value prediction models typically rely on hundreds of features and synthetic features. However, this model relies on only one feature. For now, you'll arbitrarily use `total_rooms` as that feature.\n" ] }, { "cell_type": "code", "metadata": { "id": "nj3v5EKQFY8s", - "colab_type": "code", - "cellView": "both", - "colab": {} + "cellView": "both" }, "source": [ "# The following variables are the hyperparameters.\n", @@ -451,15 +422,15 @@ "# Specify the feature and the label.\n", "my_feature = \"total_rooms\" # the total number of rooms on a specific city block.\n", "my_label=\"median_house_value\" # the median value of a house on a specific city block.\n", - "# That is, you're going to create a model that predicts house value based \n", - "# solely on total_rooms. \n", + "# That is, you're going to create a model that predicts house value based\n", + "# solely on total_rooms.\n", "\n", "# Discard any pre-existing version of the model.\n", "my_model = None\n", "\n", "# Invoke the functions.\n", "my_model = build_model(learning_rate)\n", - "weight, bias, epochs, rmse = train_model(my_model, training_df, \n", + "weight, bias, epochs, rmse = train_model(my_model, training_df,\n", " my_feature, my_label,\n", " epochs, batch_size)\n", "\n", @@ -469,14 +440,13 @@ "plot_the_model(weight, bias, my_feature, my_label)\n", "plot_the_loss_curve(epochs, rmse)" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "Btp8zUNbYOcd", - "colab_type": "text" + "id": "Btp8zUNbYOcd" }, "source": [ "A certain amount of randomness plays into training a model. Consequently, you'll get different results each time you train the model. That said, given the dataset and the hyperparameters, the trained model will generally do a poor job describing the feature's relation to the label." @@ -485,8 +455,7 @@ { "cell_type": "markdown", "metadata": { - "id": "1xNqWWos_zyk", - "colab_type": "text" + "id": "1xNqWWos_zyk" }, "source": [ "## Use the model to make predictions\n", @@ -499,9 +468,7 @@ { "cell_type": "code", "metadata": { - "id": "nH63BmncAcab", - "colab_type": "code", - "colab": {} + "id": "nH63BmncAcab" }, "source": [ "def predict_house_values(n, feature, label):\n", @@ -519,14 +486,13 @@ " training_df[label][10000 + i],\n", " predicted_values[i][0] ))" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "NbBNQujU5WjK", - "colab_type": "text" + "id": "NbBNQujU5WjK" }, "source": [ "Now, invoke the house prediction function on 10 examples:" @@ -535,21 +501,18 @@ { "cell_type": "code", "metadata": { - "id": "Y_0DGBt0Kz_N", - "colab_type": "code", - "colab": {} + "id": "Y_0DGBt0Kz_N" }, "source": [ "predict_house_values(10, my_feature, my_label)" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "-gGaqArcpqY3", - "colab_type": "text" + "id": "-gGaqArcpqY3" }, "source": [ "### Task 2: Judge the predictive power of the model\n", @@ -561,32 +524,29 @@ "cell_type": "code", "metadata": { "id": "yVpjhUFm9uID", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click to view the answer.\n", "\n", "# Most of the predicted values differ significantly\n", - "# from the label value, so the trained model probably \n", + "# from the label value, so the trained model probably\n", "# doesn't have much predictive power. However, the\n", - "# first 10 examples might not be representative of \n", - "# the rest of the examples. " + "# first 10 examples might not be representative of\n", + "# the rest of the examples." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "wLoqis3IUPSd", - "colab_type": "text" + "id": "wLoqis3IUPSd" }, "source": [ "## Task 3: Try a different feature\n", "\n", - "The `total_rooms` feature had only a little predictive power. Would a different feature have greater predictive power? Try using `population` as the feature instead of `total_rooms`. \n", + "The `total_rooms` feature had only a little predictive power. Would a different feature have greater predictive power? Try using `population` as the feature instead of `total_rooms`.\n", "\n", "Note: When you change features, you might also need to change the hyperparameters." ] @@ -594,9 +554,7 @@ { "cell_type": "code", "metadata": { - "id": "H0ab6HD4ZO75", - "colab_type": "code", - "colab": {} + "id": "H0ab6HD4ZO75" }, "source": [ "my_feature = \"?\" # Replace the ? with population or possibly\n", @@ -609,7 +567,7 @@ "\n", "# Don't change anything below this line.\n", "my_model = build_model(learning_rate)\n", - "weight, bias, epochs, rmse = train_model(my_model, training_df, \n", + "weight, bias, epochs, rmse = train_model(my_model, training_df,\n", " my_feature, my_label,\n", " epochs, batch_size)\n", "plot_the_model(weight, bias, my_feature, my_label)\n", @@ -617,16 +575,14 @@ "\n", "predict_house_values(15, my_feature, my_label)" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "107mDkW7U6mg", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click to view a possible solution.\n", @@ -640,7 +596,7 @@ "\n", "# Don't change anything below.\n", "my_model = build_model(learning_rate)\n", - "weight, bias, epochs, rmse = train_model(my_model, training_df, \n", + "weight, bias, epochs, rmse = train_model(my_model, training_df,\n", " my_feature, my_label,\n", " epochs, batch_size)\n", "\n", @@ -649,14 +605,13 @@ "\n", "predict_house_values(10, my_feature, my_label)" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "Nd_rHJ59AUtk", - "colab_type": "text" + "id": "Nd_rHJ59AUtk" }, "source": [ "Did `population` produce better predictions than `total_rooms`?" @@ -666,48 +621,43 @@ "cell_type": "code", "metadata": { "id": "F0tPEtzcC-vK", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click to view the answer.\n", "\n", - "# Training is not entirely deterministic, but population \n", - "# typically converges at a slightly higher RMSE than \n", - "# total_rooms. So, population appears to be about \n", - "# the same or slightly worse at making predictions \n", + "# Training is not entirely deterministic, but population\n", + "# typically converges at a slightly higher RMSE than\n", + "# total_rooms. So, population appears to be about\n", + "# the same or slightly worse at making predictions\n", "# than total_rooms." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "C8uYpyGacsIg", - "colab_type": "text" + "id": "C8uYpyGacsIg" }, "source": [ "## Task 4: Define a synthetic feature\n", "\n", "You have determined that `total_rooms` and `population` were not useful features. That is, neither the total number of rooms in a neighborhood nor the neighborhood's population successfully predicted the median house price of that neighborhood. Perhaps though, the *ratio* of `total_rooms` to `population` might have some predictive power. That is, perhaps block density relates to median house value.\n", "\n", - "To explore this hypothesis, do the following: \n", + "To explore this hypothesis, do the following:\n", "\n", "1. Create a [synthetic feature](https://developers.google.com/machine-learning/glossary/#synthetic_feature) that's a ratio of `total_rooms` to `population`. (If you are new to pandas DataFrames, please study the [Pandas DataFrame Ultraquick Tutorial](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb?utm_source=linearregressionreal-colab&utm_medium=colab&utm_campaign=colab-external&utm_content=pandas_tf2-colab&hl=en).)\n", "2. Tune the three hyperparameters.\n", - "3. Determine whether this synthetic feature produces \n", - " a lower loss value than any of the single features you \n", + "3. Determine whether this synthetic feature produces\n", + " a lower loss value than any of the single features you\n", " tried earlier in this exercise." ] }, { "cell_type": "code", "metadata": { - "id": "4Kx2xHSgdcpg", - "colab_type": "code", - "colab": {} + "id": "4Kx2xHSgdcpg" }, "source": [ "# Define a synthetic feature named rooms_per_person\n", @@ -730,16 +680,14 @@ "plot_the_loss_curve(epochs, rmse)\n", "predict_house_values(15, my_feature, my_label)" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "xRfxp_3yofe3", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click to view a possible solution to Task 4.\n", @@ -762,14 +710,13 @@ "plot_the_loss_curve(epochs, mae)\n", "predict_house_values(15, my_feature, my_label)\n" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "HBiDWursB1Wi", - "colab_type": "text" + "id": "HBiDWursB1Wi" }, "source": [ "Based on the loss values, this synthetic feature produces a better model than the individual features you tried in Task 2 and Task 3. However, the model still isn't creating great predictions.\n" @@ -778,8 +725,7 @@ { "cell_type": "markdown", "metadata": { - "id": "XEG_9oU9O54u", - "colab_type": "text" + "id": "XEG_9oU9O54u" }, "source": [ "## Task 5. Find feature(s) whose raw values correlate with the label\n", @@ -789,7 +735,7 @@ "A **correlation matrix** indicates how each attribute's raw values relate to the other attributes' raw values. Correlation values have the following meanings:\n", "\n", " * `1.0`: perfect positive correlation; that is, when one attribute rises, the other attribute rises.\n", - " * `-1.0`: perfect negative correlation; that is, when one attribute rises, the other attribute falls. \n", + " * `-1.0`: perfect negative correlation; that is, when one attribute rises, the other attribute falls.\n", " * `0.0`: no correlation; the two columns [are not linearly related](https://en.wikipedia.org/wiki/Correlation_and_dependence#/media/File:Correlation_examples2.svg).\n", "\n", "In general, the higher the absolute value of a correlation value, the greater its predictive power. For example, a correlation value of -0.8 implies far more predictive power than a correlation of -0.2.\n", @@ -800,22 +746,19 @@ { "cell_type": "code", "metadata": { - "id": "zFGKL45LO8Tt", - "colab_type": "code", - "colab": {} + "id": "zFGKL45LO8Tt" }, "source": [ "# Generate a correlation matrix.\n", "training_df.corr()" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "hp0r3NAVPEdt", - "colab_type": "text" + "id": "hp0r3NAVPEdt" }, "source": [ "The correlation matrix shows nine potential features (including a synthetic\n", @@ -828,29 +771,26 @@ "cell_type": "code", "metadata": { "id": "RomQTd1OPVd0", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click here for the solution to Task 5\n", "\n", - "# The median_income correlates 0.7 with the label \n", - "# (median_house_value), so median_income might be a \n", + "# The median_income correlates 0.7 with the label\n", + "# (median_house_value), so median_income might be a\n", "# good feature. The other seven potential features\n", - "# all have a correlation relatively close to 0. \n", + "# all have a correlation relatively close to 0.\n", "\n", "# If time permits, try median_income as the feature\n", "# and see whether the model improves." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "8RqvEbaVSlRt", - "colab_type": "text" + "id": "8RqvEbaVSlRt" }, "source": [ "Correlation matrices don't tell the entire story. In later exercises, you'll find additional ways to unlock predictive power from potential features.\n", @@ -860,4 +800,4 @@ ] } ] -} +} \ No newline at end of file diff --git a/ml/cc/exercises/validation_and_test_sets.ipynb b/ml/cc/exercises/validation_and_test_sets.ipynb index 3291363..264f3e3 100644 --- a/ml/cc/exercises/validation_and_test_sets.ipynb +++ b/ml/cc/exercises/validation_and_test_sets.ipynb @@ -5,8 +5,7 @@ "colab": { "name": "Validation and Test Sets.ipynb", "provenance": [], - "private_outputs": true, - "collapsed_sections": [] + "private_outputs": true }, "kernelspec": { "name": "python3", @@ -18,9 +17,7 @@ "cell_type": "code", "metadata": { "id": "hMqWDc_m6rUC", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Copyright 2020 Google LLC. Double-click here for license information.\n", @@ -36,7 +33,7 @@ "# See the License for the specific language governing permissions and\n", "# limitations under the License." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { @@ -53,8 +50,7 @@ { "cell_type": "markdown", "metadata": { - "id": "4f3CKqFUqL2-", - "colab_type": "text" + "id": "4f3CKqFUqL2-" }, "source": [ "# Validation Sets and Test Sets\n", @@ -69,8 +65,7 @@ { "cell_type": "markdown", "metadata": { - "id": "3spZH_kNkWWX", - "colab_type": "text" + "id": "3spZH_kNkWWX" }, "source": [ "## Learning objectives\n", @@ -86,8 +81,7 @@ { "cell_type": "markdown", "metadata": { - "id": "gV82DJO3kWpk", - "colab_type": "text" + "id": "gV82DJO3kWpk" }, "source": [ "## The dataset\n", @@ -106,8 +100,7 @@ { "cell_type": "markdown", "metadata": { - "id": "S8gm6BpqRRuh", - "colab_type": "text" + "id": "S8gm6BpqRRuh" }, "source": [ "## Import relevant modules\n", @@ -119,9 +112,7 @@ "cell_type": "code", "metadata": { "id": "9D8GgUovHbG0", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Import modules\n", @@ -133,14 +124,13 @@ "pd.options.display.max_rows = 10\n", "pd.options.display.float_format = \"{:.1f}\".format" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "xjvrrClQeAJu", - "colab_type": "text" + "id": "xjvrrClQeAJu" }, "source": [ "## Load the datasets from the internet\n", @@ -155,54 +145,48 @@ { "cell_type": "code", "metadata": { - "id": "zUnTc_wfd_o3", - "colab_type": "code", - "colab": {} + "id": "zUnTc_wfd_o3" }, "source": [ "train_df = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\")\n", "test_df = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv\")" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "P_KBdj2M_yjM", - "colab_type": "text" + "id": "P_KBdj2M_yjM" }, "source": [ "## Scale the label values\n", "\n", - "The following code cell scales the `median_house_value`. \n", + "The following code cell scales the `median_house_value`.\n", "See the previous Colab exercise for details." ] }, { "cell_type": "code", "metadata": { - "id": "3hc7QQhaAFXD", - "colab_type": "code", - "colab": {} + "id": "3hc7QQhaAFXD" }, "source": [ "scale_factor = 1000.0\n", "\n", "# Scale the training set's label.\n", - "train_df[\"median_house_value\"] /= scale_factor \n", + "train_df[\"median_house_value\"] /= scale_factor\n", "\n", "# Scale the test set's label\n", "test_df[\"median_house_value\"] /= scale_factor" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "FhessIIV8VPc", - "colab_type": "text" + "id": "FhessIIV8VPc" }, "source": [ "## Load the functions that build and train a model\n", @@ -210,7 +194,7 @@ "The following code cell defines two functions:\n", "\n", " * `build_model`, which defines the model's topography.\n", - " * `train_model`, which will ultimately train the model, outputting not only the loss value for the training set but also the loss value for the validation set. \n", + " * `train_model`, which will ultimately train the model, outputting not only the loss value for the training set but also the loss value for the validation set.\n", "\n", "Since you don't need to understand model building code right now, we've hidden this code cell. As always, you must run hidden code cells." ] @@ -219,9 +203,7 @@ "cell_type": "code", "metadata": { "id": "bvonhK857msj", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Define the functions that build and train a model\n", @@ -234,15 +216,15 @@ " model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))\n", "\n", " # Compile the model topography into code that TensorFlow can efficiently\n", - " # execute. Configure training to minimize the model's mean squared error. \n", - " model.compile(optimizer=tf.keras.optimizers.experimental.RMSprop(learning_rate=my_learning_rate),\n", + " # execute. Configure training to minimize the model's mean squared error.\n", + " model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=my_learning_rate),\n", " loss=\"mean_squared_error\",\n", " metrics=[tf.keras.metrics.RootMeanSquaredError()])\n", "\n", - " return model \n", + " return model\n", "\n", "\n", - "def train_model(model, df, feature, label, my_epochs, \n", + "def train_model(model, df, feature, label, my_epochs,\n", " my_batch_size=None, my_validation_split=0.1):\n", " \"\"\"Feed a dataset into the model in order to train it.\"\"\"\n", "\n", @@ -256,26 +238,25 @@ " trained_weight = model.get_weights()[0][0]\n", " trained_bias = model.get_weights()[1]\n", "\n", - " # The list of epochs is stored separately from the \n", + " # The list of epochs is stored separately from the\n", " # rest of history.\n", " epochs = history.epoch\n", - " \n", + "\n", " # Isolate the root mean squared error for each epoch.\n", " hist = pd.DataFrame(history.history)\n", " rmse = hist[\"root_mean_squared_error\"]\n", "\n", - " return epochs, rmse, history.history \n", + " return epochs, rmse, history.history\n", "\n", "print(\"Defined the build_model and train_model functions.\")" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "8gRu4Ri0D8tH", - "colab_type": "text" + "id": "8gRu4Ri0D8tH" }, "source": [ "## Define plotting functions\n", @@ -287,9 +268,7 @@ "cell_type": "code", "metadata": { "id": "QA7hsqPZDvVM", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Define the plotting function\n", @@ -304,7 +283,7 @@ " plt.plot(epochs[1:], mae_training[1:], label=\"Training Loss\")\n", " plt.plot(epochs[1:], mae_validation[1:], label=\"Validation Loss\")\n", " plt.legend()\n", - " \n", + "\n", " # We're not going to plot the first epoch, since the loss on the first epoch\n", " # is often substantially greater than the loss for other epochs.\n", " merged_mae_lists = mae_training[1:] + mae_validation[1:]\n", @@ -315,20 +294,19 @@ "\n", " top_of_y_axis = highest_loss + (delta * 0.05)\n", " bottom_of_y_axis = lowest_loss - (delta * 0.05)\n", - " \n", + "\n", " plt.ylim([bottom_of_y_axis, top_of_y_axis])\n", - " plt.show() \n", + " plt.show()\n", "\n", "print(\"Defined the plot_the_loss_curve function.\")" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "jipBqEQXlsN8", - "colab_type": "text" + "id": "jipBqEQXlsN8" }, "source": [ "## Task 1: Experiment with the validation split\n", @@ -345,15 +323,13 @@ "\n", "If the data in the training set is similar to the data in the validation set, then the two loss curves and the final loss values should be almost identical. However, the loss curves and final loss values are **not** almost identical. Hmm, that's odd. \n", "\n", - "Experiment with two or three different values of `validation_split`. Do different values of `validation_split` fix the problem? \n" + "Experiment with two or three different values of `validation_split`. Do different values of `validation_split` fix the problem?\n" ] }, { "cell_type": "code", "metadata": { - "id": "knP23Taoa00a", - "colab_type": "code", - "colab": {} + "id": "knP23Taoa00a" }, "source": [ "# The following variables are the hyperparameters.\n", @@ -362,66 +338,61 @@ "batch_size = 100\n", "\n", "# Split the original training set into a reduced training set and a\n", - "# validation set. \n", + "# validation set.\n", "validation_split = 0.2\n", "\n", "# Identify the feature and the label.\n", "my_feature = \"median_income\" # the median income on a specific city block.\n", "my_label = \"median_house_value\" # the median house value on a specific city block.\n", - "# That is, you're going to create a model that predicts house value based \n", - "# solely on the neighborhood's median income. \n", + "# That is, you're going to create a model that predicts house value based\n", + "# solely on the neighborhood's median income.\n", "\n", "# Invoke the functions to build and train the model.\n", "my_model = build_model(learning_rate)\n", - "epochs, rmse, history = train_model(my_model, train_df, my_feature, \n", - " my_label, epochs, batch_size, \n", + "epochs, rmse, history = train_model(my_model, train_df, my_feature,\n", + " my_label, epochs, batch_size,\n", " validation_split)\n", "\n", - "plot_the_loss_curve(epochs, history[\"root_mean_squared_error\"], \n", + "plot_the_loss_curve(epochs, history[\"root_mean_squared_error\"],\n", " history[\"val_root_mean_squared_error\"])" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "TKa11JK4Pm3f", - "colab_type": "text" + "id": "TKa11JK4Pm3f" }, "source": [ "## Task 2: Determine **why** the loss curves differ\n", "\n", - "No matter how you split the training set and the validation set, the loss curves differ significantly. Evidently, the data in the training set isn't similar enough to the data in the validation set. Counterintuitive? Yes, but this problem is actually pretty common in machine learning. \n", + "No matter how you split the training set and the validation set, the loss curves differ significantly. Evidently, the data in the training set isn't similar enough to the data in the validation set. Counterintuitive? Yes, but this problem is actually pretty common in machine learning.\n", "\n", "Your task is to determine **why** the loss curves aren't highly similar. As with most issues in machine learning, the problem is rooted in the data itself. To solve this mystery of why the training set and validation set aren't almost identical, write a line or two of [pandas code](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb?utm_source=validation-colab&utm_medium=colab&utm_campaign=colab-external&utm_content=pandas_tf2-colab&hl=en) in the following code cell. Here are a couple of hints:\n", "\n", " * The previous code cell split the original training set into:\n", " * a reduced training set (the original training set - the validation set)\n", - " * the validation set \n", + " * the validation set\n", " * By default, the pandas [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method outputs the *first* 5 rows of the DataFrame. To see more of the training set, specify the `n` argument to `head` and assign a large positive integer to `n`." ] }, { "cell_type": "code", "metadata": { - "id": "VJQcAZkwJt_p", - "colab_type": "code", - "colab": {} + "id": "VJQcAZkwJt_p" }, "source": [ "# Write some code in this code cell." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "EnNvkFwwK8WY", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click for a possible solution to Task 2.\n", @@ -430,18 +401,17 @@ "# of the training set\n", "train_df.head(n=1000)\n", "\n", - "# The original training set is sorted by longitude. \n", + "# The original training set is sorted by longitude.\n", "# Apparently, longitude influences the relationship of\n", "# median_income to median_house_value." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "rw4xI1ZEckI8", - "colab_type": "text" + "id": "rw4xI1ZEckI8" }, "source": [ "## Task 3. Fix the problem\n", @@ -457,8 +427,8 @@ "2. Pass `shuffled_train_df` (instead of `train_df`) as the second argument to `train_model` (in the code call associated with Task 1) so that the call becomes as follows:\n", "\n", "```\n", - " epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature, \n", - " my_label, epochs, batch_size, \n", + " epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature,\n", + " my_label, epochs, batch_size,\n", " validation_split)\n", "```" ] @@ -467,9 +437,7 @@ "cell_type": "code", "metadata": { "id": "ncODhpv0h-LG", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click to view the complete implementation.\n", @@ -480,36 +448,35 @@ "batch_size = 100\n", "\n", "# Split the original training set into a reduced training set and a\n", - "# validation set. \n", + "# validation set.\n", "validation_split = 0.2\n", "\n", "# Identify the feature and the label.\n", "my_feature = \"median_income\" # the median income on a specific city block.\n", "my_label = \"median_house_value\" # the median house value on a specific city block.\n", - "# That is, you're going to create a model that predicts house value based \n", - "# solely on the neighborhood's median income. \n", + "# That is, you're going to create a model that predicts house value based\n", + "# solely on the neighborhood's median income.\n", "\n", "# Shuffle the examples.\n", - "shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index)) \n", + "shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index))\n", "\n", "# Invoke the functions to build and train the model. Train on the shuffled\n", "# training set.\n", "my_model = build_model(learning_rate)\n", - "epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature, \n", - " my_label, epochs, batch_size, \n", + "epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature,\n", + " my_label, epochs, batch_size,\n", " validation_split)\n", "\n", - "plot_the_loss_curve(epochs, history[\"root_mean_squared_error\"], \n", + "plot_the_loss_curve(epochs, history[\"root_mean_squared_error\"],\n", " history[\"val_root_mean_squared_error\"])" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "tKN239_miW8C", - "colab_type": "text" + "id": "tKN239_miW8C" }, "source": [ "Experiment with `validation_split` to answer the following questions:\n", @@ -522,30 +489,27 @@ "cell_type": "code", "metadata": { "id": "-UAJ3Q86iz31", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click for the answers to the questions\n", "\n", - "# Yes, after shuffling the original training set, \n", - "# the final loss for the training set and the \n", + "# Yes, after shuffling the original training set,\n", + "# the final loss for the training set and the\n", "# validation set become much closer.\n", "\n", "# If validation_split < 0.15,\n", "# the final loss values for the training set and\n", "# validation set diverge meaningfully. Apparently,\n", - "# the validation set no longer contains enough examples. " + "# the validation set no longer contains enough examples." ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "1PP-O8TOZOeo", - "colab_type": "text" + "id": "1PP-O8TOZOeo" }, "source": [ "## Task 4: Use the Test Dataset to Evaluate Your Model's Performance\n", @@ -556,9 +520,7 @@ { "cell_type": "code", "metadata": { - "id": "nd_Sw2cygOip", - "colab_type": "code", - "colab": {} + "id": "nd_Sw2cygOip" }, "source": [ "x_test = test_df[my_feature]\n", @@ -566,14 +528,13 @@ "\n", "results = my_model.evaluate(x_test, y_test, batch_size=batch_size)" ], - "execution_count": 0, + "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { - "id": "qoyQKvsjmV_A", - "colab_type": "text" + "id": "qoyQKvsjmV_A" }, "source": [ "Compare the root mean squared error of the model when evaluated on each of the three datasets:\n", @@ -589,18 +550,16 @@ "cell_type": "code", "metadata": { "id": "FxXtp-aVdIgJ", - "colab_type": "code", - "cellView": "form", - "colab": {} + "cellView": "form" }, "source": [ "#@title Double-click for an answer\n", "\n", - "# In our experiments, yes, the rmse values \n", - "# were similar enough. " + "# In our experiments, yes, the rmse values\n", + "# were similar enough." ], - "execution_count": 0, + "execution_count": null, "outputs": [] } ] -} +} \ No newline at end of file