{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "title"
   },
   "source": [
    "# Regularisation and Optimisation: House Price Regression\n",
    "\n",
    "## Overview\n",
    "\n",
    "In this exercise you will build a **regression neural network** in PyTorch to predict median house prices from census data. Along the way you will apply and compare the regularisation and optimisation techniques introduced in the lecture.\n",
    "\n",
    "**By the end of this exercise you will have practiced:**\n",
    "- Building a regression neural network (no sigmoid output — continuous prediction)\n",
    "- Applying dropout and L2 weight decay to reduce overfitting\n",
    "- Comparing three optimisers: SGD, Adam, and RMSprop\n",
    "- Evaluating regression quality with MSE and MAE\n",
    "- Visualising predictions against ground truth\n",
    "\n",
    "---\n",
    "\n",
    "## Dataset: California Housing\n",
    "\n",
    "The **California Housing dataset** is built into scikit-learn and requires no download. It contains 20,640 census block groups from the 1990 California census.\n",
    "\n",
    "| Feature | Description |\n",
    "|---|---|\n",
    "| `MedInc` | Median income in the block group (in tens of thousands of USD) |\n",
    "| `HouseAge` | Median house age in the block group |\n",
    "| `AveRooms` | Average number of rooms per household |\n",
    "| `AveBedrms` | Average number of bedrooms per household |\n",
    "| `Population` | Block group population |\n",
    "| `AveOccup` | Average household occupancy |\n",
    "| `Latitude` | Block group latitude |\n",
    "| `Longitude` | Block group longitude |\n",
    "\n",
    "**Target:** `MedHouseVal` — median house value in hundreds of thousands of USD.\n",
    "\n",
    "### Key characteristics to keep in mind\n",
    "- `Population` and `AveOccup` contain extreme outliers (some block groups have very unusual values). This makes feature scaling especially important.\n",
    "- `Latitude` and `Longitude` encode spatial structure (coastal vs inland, north vs south). A neural network can learn non-linear combinations of these, unlike linear models.\n",
    "- The target is **capped at 5.0** — the original dataset clips very high-value properties. This means the model will underestimate top-tier prices slightly, which is expected."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step1_header"
   },
   "source": [
    "## Step 1 — Imports\n",
    "\n",
    "Import everything you will need upfront. For this exercise you need:\n",
    "- `numpy`, `pandas`, `matplotlib.pyplot`\n",
    "- From `sklearn`: `fetch_california_housing`, `train_test_split`, `StandardScaler`\n",
    "- From `sklearn.metrics`: `mean_squared_error`, `mean_absolute_error`\n",
    "- `torch`, `torch.nn`, `torch.optim`\n",
    "- From `torch.utils.data`: `TensorDataset`, `DataLoader`\n",
    "\n",
    "Also fix random seeds for reproducibility: `torch.manual_seed(42)` and `np.random.seed(42)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "imports"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "from torch.utils.data import TensorDataset, DataLoader\n",
    "\n",
    "torch.manual_seed(42)\n",
    "np.random.seed(42)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step2_header"
   },
   "source": [
    "## Step 2 — Load and Explore the Data\n",
    "\n",
    "Load the dataset using `fetch_california_housing(as_frame=True)` from scikit-learn. This returns a `Bunch` object — access `.frame` to get a single pandas DataFrame with both features and target.\n",
    "\n",
    "**Tasks:**\n",
    "1. Load the dataset and display the first few rows.\n",
    "2. Print the shape and check for missing values.\n",
    "3. Print summary statistics (`.describe()`). Pay attention to `Population` and `AveOccup` — do their max values look unusual compared to the mean?\n",
    "4. Plot a histogram of the target column `MedHouseVal`. Notice the spike at 5.0 — this is the cap described above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "load_data"
   },
   "outputs": [],
   "source": [
    "housing = fetch_california_housing(as_frame=True)\n",
    "df = housing.frame\n",
    "print(df.head())\n",
    "print(f\"\\nShape: {df.shape}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "explore"
   },
   "outputs": [],
   "source": [
    "print(f\"Missing values:\\n{df.isnull().sum()}\")\n",
    "print(f\"\\nSummary statistics:\")\n",
    "print(df.describe())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "target_hist"
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8, 4))\n",
    "df['MedHouseVal'].hist(bins=50)\n",
    "plt.xlabel('MedHouseVal')\n",
    "plt.ylabel('Count')\n",
    "plt.title('Distribution of Median House Value (note spike at 5.0 cap)')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step3_header"
   },
   "source": [
    "## Step 3 — Preprocess the Data\n",
    "\n",
    "### Feature scaling\n",
    "\n",
    "All eight features are numeric, so no encoding is needed. However, they vary wildly in scale:\n",
    "- `MedInc` ranges from roughly 0.5 to 15.\n",
    "- `Population` ranges from 3 to over 35,000.\n",
    "\n",
    "Without scaling, the gradient updates for parameters connected to `Population` will dominate those connected to smaller-scale features, destabilising training. Use **`StandardScaler`** (zero mean, unit variance).\n",
    "\n",
    "> **Important:** fit the scaler only on training features, then apply the fitted scaler to the test set. Fitting on the full dataset would leak test-set statistics into training.\n",
    "\n",
    "**Tasks:**\n",
    "1. Separate features (`X`) and target (`y`) as NumPy arrays.\n",
    "2. Split into train (80 %) and test (20 %) sets with `random_state=42`.\n",
    "3. Fit `StandardScaler` on `X_train` and transform both `X_train` and `X_test`.\n",
    "4. Convert all four arrays to `torch.FloatTensor`.\n",
    "5. Wrap into `TensorDataset` objects and create `DataLoader`s with `batch_size=64`, `shuffle=True` for train and `shuffle=False` for test.\n",
    "6. Print the number of training and test samples to confirm the split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "preprocess"
   },
   "outputs": [],
   "source": [
    "X = df.drop(columns=['MedHouseVal']).values.astype(np.float32)\n",
    "y = df['MedHouseVal'].values.astype(np.float32).reshape(-1, 1)\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "print(f\"Train: {X_train.shape}, Test: {X_test.shape}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "split_scale"
   },
   "outputs": [],
   "source": [
    "scaler = StandardScaler()\n",
    "X_train = scaler.fit_transform(X_train)\n",
    "X_test  = scaler.transform(X_test)\n",
    "\n",
    "X_train_t = torch.FloatTensor(X_train)\n",
    "X_test_t  = torch.FloatTensor(X_test)\n",
    "y_train_t = torch.FloatTensor(y_train)\n",
    "y_test_t  = torch.FloatTensor(y_test)\n",
    "\n",
    "train_ds = TensorDataset(X_train_t, y_train_t)\n",
    "test_ds  = TensorDataset(X_test_t,  y_test_t)\n",
    "\n",
    "train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)\n",
    "test_loader  = DataLoader(test_ds,  batch_size=64, shuffle=False)\n",
    "\n",
    "print(f\"Training samples: {len(train_ds)}, Test samples: {len(test_ds)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step4_header"
   },
   "source": [
    "## Step 4 — Define the Neural Network\n",
    "\n",
    "### Regression vs classification — the key difference\n",
    "\n",
    "In the classification exercise, the output layer used a **Sigmoid** activation to squash the output to (0, 1). For regression we want an **unbounded continuous output**, so the output layer has **no activation function** (or equivalently, a linear activation). The model outputs a raw number that we interpret directly as the predicted house price.\n",
    "\n",
    "### Architecture requirements\n",
    "\n",
    "```\n",
    "Input (8 features)\n",
    "    → Linear(8  → 64)  → ReLU → Dropout(p)\n",
    "    → Linear(64 → 32)  → ReLU → Dropout(p)\n",
    "    → Linear(32 → 1)           ← no activation\n",
    "```\n",
    "\n",
    "**Tasks:**\n",
    "1. Define a class `HousePriceNN(nn.Module)` with:\n",
    "   - `__init__(self, dropout_rate=0.0, l1_lambda=0.0)` — builds the layers and stores `l1_lambda`.\n",
    "   - `forward(self, x)` — runs the forward pass.\n",
    "   - `l1_penalty(self)` — computes the L1 weight penalty over weight matrices only (skip biases). Return `torch.tensor(0.0)` when `l1_lambda == 0` so the method is always safe to add to the loss.\n",
    "2. Instantiate the model with default settings and print it.\n",
    "3. Count and print the total number of trainable parameters.\n",
    "\n",
    "> **Tip:** use `model.named_parameters()` in `l1_penalty` and filter for names containing `\"weight\"` to exclude biases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "model_def"
   },
   "outputs": [],
   "source": [
    "class HousePriceNN(nn.Module):\n",
    "    def __init__(self, dropout_rate=0.0, l1_lambda=0.0):\n",
    "        super().__init__()\n",
    "        self.l1_lambda = l1_lambda\n",
    "        self.network = nn.Sequential(\n",
    "            nn.Linear(8, 64),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout_rate),\n",
    "            nn.Linear(64, 32),\n",
    "            nn.ReLU(),\n",
    "            nn.Dropout(dropout_rate),\n",
    "            nn.Linear(32, 1)\n",
    "        )\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.network(x)\n",
    "\n",
    "    def l1_penalty(self):\n",
    "        if self.l1_lambda == 0:\n",
    "            return torch.tensor(0.0)\n",
    "        penalty = sum(p.abs().sum() for name, p in self.named_parameters() if 'weight' in name)\n",
    "        return self.l1_lambda * penalty\n",
    "\n",
    "model = HousePriceNN()\n",
    "print(model)\n",
    "total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
    "print(f\"\\nTrainable parameters: {total_params:,}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step5_header"
   },
   "source": [
    "## Step 5 — Training Utility\n",
    "\n",
    "Write a reusable `run_experiment` function so you can run multiple configurations cleanly without copy-pasting the training loop.\n",
    "\n",
    "**Function signature:**\n",
    "```python\n",
    "def run_experiment(label, train_loader, test_loader,\n",
    "                   dropout_rate=0.0, l1_lambda=0.0, l2_lambda=0.0,\n",
    "                   optimizer_name=\"adam\", lr=1e-3, epochs=50):\n",
    "```\n",
    "\n",
    "**The function should:**\n",
    "1. Create a fresh `HousePriceNN` with the given `dropout_rate` and `l1_lambda`.\n",
    "2. Instantiate the chosen optimiser:\n",
    "   - `\"adam\"` → `optim.Adam(..., weight_decay=l2_lambda)`\n",
    "   - `\"sgd\"` → `optim.SGD(..., momentum=0.9, weight_decay=l2_lambda)`\n",
    "   - `\"rmsprop\"` → `optim.RMSprop(..., weight_decay=l2_lambda)`\n",
    "3. Use `nn.MSELoss()` as the loss criterion.\n",
    "4. Run a training loop for the given number of epochs:\n",
    "   - Each batch: `zero_grad → forward → loss + l1_penalty → backward → step`\n",
    "   - Track average **train MSE** and **test MSE** per epoch.\n",
    "   - For test loss: use `model.eval()` and `torch.no_grad()`. Evaluate on **raw MSE only** (no penalty) for a fair cross-configuration comparison.\n",
    "5. After training, compute **test MAE** and **test RMSE**.\n",
    "6. Print a one-line summary: label, final train MSE, test MSE, test MAE.\n",
    "7. Return a dict with keys: `label`, `train`, `test`, `mae`, `rmse`.\n",
    "\n",
    "> **Why MSE for training but report RMSE?** MSE is the loss we optimise (it is differentiable and penalises large errors strongly). RMSE is reported because it is in the same units as the target (hundreds of thousands of USD), making it more interpretable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "train_fn"
   },
   "outputs": [],
   "source": [
    "def run_experiment(label, train_loader, test_loader,\n",
    "                   dropout_rate=0.0, l1_lambda=0.0, l2_lambda=0.0,\n",
    "                   optimizer_name=\"adam\", lr=1e-3, epochs=50):\n",
    "    model = HousePriceNN(dropout_rate=dropout_rate, l1_lambda=l1_lambda)\n",
    "    criterion = nn.MSELoss()\n",
    "\n",
    "    if optimizer_name == \"adam\":\n",
    "        optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=l2_lambda)\n",
    "    elif optimizer_name == \"sgd\":\n",
    "        optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=l2_lambda)\n",
    "    elif optimizer_name == \"rmsprop\":\n",
    "        optimizer = optim.RMSprop(model.parameters(), lr=lr, weight_decay=l2_lambda)\n",
    "    else:\n",
    "        raise ValueError(f\"Unknown optimizer: {optimizer_name}\")\n",
    "\n",
    "    train_losses = []\n",
    "    test_losses  = []\n",
    "\n",
    "    for epoch in range(1, epochs + 1):\n",
    "        model.train()\n",
    "        running = 0.0\n",
    "        for xb, yb in train_loader:\n",
    "            optimizer.zero_grad()\n",
    "            pred = model(xb)\n",
    "            loss = criterion(pred, yb) + model.l1_penalty()\n",
    "            loss.backward()\n",
    "            optimizer.step()\n",
    "            running += criterion(pred, yb).item() * len(xb)\n",
    "        train_losses.append(running / len(train_loader.dataset))\n",
    "\n",
    "        model.eval()\n",
    "        running = 0.0\n",
    "        with torch.no_grad():\n",
    "            for xb, yb in test_loader:\n",
    "                pred = model(xb)\n",
    "                running += criterion(pred, yb).item() * len(xb)\n",
    "        test_losses.append(running / len(test_loader.dataset))\n",
    "\n",
    "    # Final metrics\n",
    "    model.eval()\n",
    "    all_preds, all_true = [], []\n",
    "    with torch.no_grad():\n",
    "        for xb, yb in test_loader:\n",
    "            all_preds.append(model(xb).numpy())\n",
    "            all_true.append(yb.numpy())\n",
    "    preds = np.concatenate(all_preds).flatten()\n",
    "    truth = np.concatenate(all_true).flatten()\n",
    "    mae  = mean_absolute_error(truth, preds)\n",
    "    rmse = np.sqrt(mean_squared_error(truth, preds))\n",
    "\n",
    "    print(f\"{label:30s} | Train MSE: {train_losses[-1]:.4f} | Test MSE: {test_losses[-1]:.4f} | MAE: {mae:.4f}\")\n",
    "\n",
    "    return {\n",
    "        \"label\": label,\n",
    "        \"train\": train_losses,\n",
    "        \"test\":  test_losses,\n",
    "        \"mae\":   mae,\n",
    "        \"rmse\":  rmse,\n",
    "        \"model\": model\n",
    "    }\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step6_header"
   },
   "source": [
    "## Step 6 — Experiment A: Baseline vs Regularisation\n",
    "\n",
    "Run four configurations using **Adam** at `lr=1e-3` for **50 epochs**, varying only the regularisation:\n",
    "\n",
    "| Config | Dropout | L1 | L2 (weight decay) |\n",
    "|---|---|---|---|\n",
    "| Baseline | — | — | — |\n",
    "| Dropout | 0.3 | — | — |\n",
    "| L1 | — | 1e-4 | — |\n",
    "| L2 | — | — | 1e-4 |\n",
    "\n",
    "After running all four, write a helper function `plot_loss_curves(results, title)` that:\n",
    "- Creates one subplot per configuration (1 row, 4 columns, shared y-axis).\n",
    "- Plots train loss (solid) and test loss (dashed) on each subplot.\n",
    "- Shades the area between the two curves — a larger shaded area signals more overfitting.\n",
    "- Labels each subplot with the configuration name and its final test RMSE.\n",
    "- Sets a shared y-axis label (`\"MSE Loss\"`) and a figure title.\n",
    "\n",
    "**What to look for:** Which configuration shows the smallest gap between training and test loss? Does any regularisation method hurt performance (test MSE rises)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "expA_run"
   },
   "outputs": [],
   "source": [
    "results_a = []\n",
    "results_a.append(run_experiment(\"Baseline\",          train_loader, test_loader))\n",
    "results_a.append(run_experiment(\"Dropout(0.3)\",      train_loader, test_loader, dropout_rate=0.3))\n",
    "results_a.append(run_experiment(\"L1(1e-4)\",          train_loader, test_loader, l1_lambda=1e-4))\n",
    "results_a.append(run_experiment(\"L2(1e-4)\",          train_loader, test_loader, l2_lambda=1e-4))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "expA_plot"
   },
   "outputs": [],
   "source": [
    "def plot_loss_curves(results, title):\n",
    "    n = len(results)\n",
    "    fig, axes = plt.subplots(1, n, figsize=(4*n, 4), sharey=True)\n",
    "    for ax, r in zip(axes, results):\n",
    "        epochs = range(1, len(r['train']) + 1)\n",
    "        ax.plot(epochs, r['train'], label='train')\n",
    "        ax.plot(epochs, r['test'],  '--', label='val')\n",
    "        ax.fill_between(epochs, r['train'], r['test'], alpha=0.15)\n",
    "        ax.set_title(f\"{r['label']}\\nRMSE={r['rmse']:.3f}\")\n",
    "        ax.set_xlabel(\"Epoch\")\n",
    "        ax.legend(fontsize=8)\n",
    "        ax.grid(alpha=0.3)\n",
    "    axes[0].set_ylabel(\"MSE Loss\")\n",
    "    fig.suptitle(title)\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "\n",
    "plot_loss_curves(results_a, \"Experiment A: Baseline vs Regularisation\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step7_header"
   },
   "source": [
    "## Step 7 — Experiment B: Optimiser Comparison\n",
    "\n",
    "Now fix regularisation at **Dropout=0.3** (the most general method) and compare three optimisers at two learning rates each:\n",
    "\n",
    "| Config | Optimiser | LR |\n",
    "|---|---|---|\n",
    "| Adam lr=1e-3 | Adam | 0.001 |\n",
    "| Adam lr=1e-4 | Adam | 0.0001 |\n",
    "| SGD lr=1e-2 | SGD | 0.01 |\n",
    "| SGD lr=1e-3 | SGD | 0.001 |\n",
    "| RMSprop lr=1e-3 | RMSprop | 0.001 |\n",
    "| RMSprop lr=1e-4 | RMSprop | 0.0001 |\n",
    "\n",
    "Run all six and plot them.\n",
    "\n",
    "**What to look for:**\n",
    "- Does SGD at `lr=1e-2` converge, oscillate, or diverge?\n",
    "- How does RMSprop compare to Adam at the same learning rate?\n",
    "- Which optimiser reaches the lowest test MSE fastest (fewest epochs)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "expB_run"
   },
   "outputs": [],
   "source": [
    "results_b = []\n",
    "configs_b = [\n",
    "    (\"Adam lr=1e-3\",    \"adam\",    1e-3),\n",
    "    (\"Adam lr=1e-4\",    \"adam\",    1e-4),\n",
    "    (\"SGD lr=1e-2\",     \"sgd\",     1e-2),\n",
    "    (\"SGD lr=1e-3\",     \"sgd\",     1e-3),\n",
    "    (\"RMSprop lr=1e-3\", \"rmsprop\", 1e-3),\n",
    "    (\"RMSprop lr=1e-4\", \"rmsprop\", 1e-4),\n",
    "]\n",
    "for label, opt, lr in configs_b:\n",
    "    results_b.append(run_experiment(label, train_loader, test_loader, dropout_rate=0.3, optimizer_name=opt, lr=lr))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "expB_plot"
   },
   "outputs": [],
   "source": [
    "plot_loss_curves(results_b, \"Experiment B: Optimiser Comparison (Dropout=0.3)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step8_header"
   },
   "source": [
    "## Step 8 — Summary Table\n",
    "\n",
    "Collect all results from both experiments into a single pandas DataFrame with columns:\n",
    "- `Configuration`\n",
    "- `Final Train MSE`\n",
    "- `Final Test MSE`\n",
    "- `Train-Test Gap` (test MSE − train MSE)\n",
    "- `Test MAE`\n",
    "- `Test RMSE`\n",
    "\n",
    "Print the table. Then print two lines:\n",
    "- Which configuration achieved the smallest train-test gap.\n",
    "- Which configuration achieved the lowest test RMSE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "summary"
   },
   "outputs": [],
   "source": [
    "all_results = results_a + results_b\n",
    "summary = pd.DataFrame([{\n",
    "    \"Configuration\":    r[\"label\"],\n",
    "    \"Final Train MSE\":  r[\"train\"][-1],\n",
    "    \"Final Test MSE\":   r[\"test\"][-1],\n",
    "    \"Train-Test Gap\":   r[\"test\"][-1] - r[\"train\"][-1],\n",
    "    \"Test MAE\":         r[\"mae\"],\n",
    "    \"Test RMSE\":        r[\"rmse\"],\n",
    "} for r in all_results])\n",
    "print(summary.to_string(index=False))\n",
    "\n",
    "best_gap  = summary.loc[summary[\"Train-Test Gap\"].abs().idxmin(), \"Configuration\"]\n",
    "best_rmse = summary.loc[summary[\"Test RMSE\"].idxmin(), \"Configuration\"]\n",
    "print(f\"\\nSmallest train-test gap: {best_gap}\")\n",
    "print(f\"Lowest test RMSE:        {best_rmse}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "step9_header"
   },
   "source": [
    "## Step 9 — Visualise Predictions\n",
    "\n",
    "Take the best-performing configuration from your summary table and produce a **scatter plot of predicted vs actual house prices** on the test set.\n",
    "\n",
    "**Tasks:**\n",
    "1. Retrain (or reuse) the best model.\n",
    "2. Run inference on the full test set to collect predicted values.\n",
    "3. Create a scatter plot:\n",
    "   - x-axis: actual `MedHouseVal`\n",
    "   - y-axis: predicted `MedHouseVal`\n",
    "   - Draw a diagonal dashed line representing perfect prediction (`y = x`). Points close to this line indicate good predictions.\n",
    "   - Colour points by prediction error magnitude (use `c=abs(pred - actual)` and a diverging colormap like `\"coolwarm\"`).\n",
    "4. Add axis labels, a title, and a colourbar.\n",
    "\n",
    "> **What to look for:** Is the model systematically under- or over-predicting? Notice the cluster of points at actual value 5.0 — this is the cap in the data. The model likely underestimates these, which is expected and not a modelling failure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "scatter_plot"
   },
   "outputs": [],
   "source": [
    "best_result = all_results[summary[\"Test RMSE\"].idxmin()]\n",
    "best_model  = best_result[\"model\"]\n",
    "\n",
    "best_model.eval()\n",
    "preds_list, truth_list = [], []\n",
    "with torch.no_grad():\n",
    "    for xb, yb in test_loader:\n",
    "        preds_list.append(best_model(xb).numpy())\n",
    "        truth_list.append(yb.numpy())\n",
    "\n",
    "preds = np.concatenate(preds_list).flatten()\n",
    "truth = np.concatenate(truth_list).flatten()\n",
    "errors = np.abs(preds - truth)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(7, 6))\n",
    "sc = ax.scatter(truth, preds, c=errors, cmap='coolwarm', alpha=0.4, s=10)\n",
    "lim = [min(truth.min(), preds.min()), max(truth.max(), preds.max())]\n",
    "ax.plot(lim, lim, 'k--', lw=1.5, label='Perfect prediction')\n",
    "plt.colorbar(sc, ax=ax, label='|error|')\n",
    "ax.set_xlabel(\"Actual MedHouseVal\")\n",
    "ax.set_ylabel(\"Predicted MedHouseVal\")\n",
    "ax.set_title(f\"Predicted vs Actual — {best_result['label']}\")\n",
    "ax.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "reflection_space"
   },
   "source": [
    "## Answers to \"What to look for\" questions\n",
    "\n",
    "**Experiment A — Baseline vs Regularisation**\n",
    "\n",
    "- **Baseline** has the largest train–test gap: training loss drops steadily but the test curve flattens earlier, a clear sign of mild overfitting.\n",
    "- **Dropout (0.3)** reduces the gap the most: randomising neuron activation during training prevents co-adaptation and generalises better.\n",
    "- **L2 (weight decay 1e-4)** produces a similar effect to Dropout but acts directly on parameter magnitudes; it shows a slightly tighter gap than baseline with minimal performance loss.\n",
    "- **L1 (1e-4)** also closes the gap but can push some weights to near-zero, creating a sparser model; on this dataset the improvement is modest compared to Dropout.\n",
    "\n",
    "**Experiment B — Optimiser Comparison (Dropout=0.3)**\n",
    "\n",
    "- **SGD at lr=1e-2** either diverges or oscillates wildly in the first few epochs — the learning rate is too large for vanilla SGD without a momentum warm-up on this scale of data.\n",
    "- **SGD at lr=1e-3** converges but much more slowly than Adam or RMSprop; the curve is noisier and typically does not reach as low a test MSE within 50 epochs.\n",
    "- **RMSprop** at both learning rates is competitive with Adam and often converges in a similar number of epochs, reflecting that both methods adapt per-parameter learning rates.\n",
    "- **Adam at lr=1e-3** reaches the lowest test MSE the fastest and is the most stable across all configurations tried.\n",
    "\n",
    "**Step 9 — Predictions vs Ground Truth**\n",
    "\n",
    "- The model **underestimates** high-value houses: the scatter points above actual = 4.0 cluster noticeably below the perfect-prediction diagonal.\n",
    "- This is expected and not a modelling failure — the target is capped at 5.0, so every house truly worth more than $500k is labelled 5.0. The model has no signal to predict beyond that ceiling.\n",
    "- Predictions for mid-range values (1.0–3.5) are tightest around the diagonal, showing the model is most reliable in the middle of the distribution."
   ]
  }
 ]
}