{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Simple Neural Network Classifier in PyTorch\n",
    "\n",
    "## Overview\n",
    "\n",
    "In this exercise you will build, train, and evaluate a **binary classification neural network** using **PyTorch**.\n",
    "\n",
    "The business context is **customer purchase intention**: given a user's browsing behaviour on an e-commerce site, predict whether they will complete a purchase during that session.\n",
    "\n",
    "By the end of this notebook you will have practiced:\n",
    "- Loading and preprocessing a real-world tabular dataset\n",
    "- Encoding categorical features and normalising numerical ones\n",
    "- Building a feedforward neural network with `torch.nn.Module`\n",
    "- Writing a training loop with a loss function and optimiser\n",
    "- Evaluating model performance with accuracy, precision, recall, and F1-score\n",
    "- Visualising training progress\n"
   ],
   "metadata": {
    "id": "overview"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Dataset: Online Shoppers Purchasing Intention\n",
    "\n",
    "**Source:** [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset)\n",
    "\n",
    "The dataset contains 12,330 sessions collected from an e-commerce website. Each row represents one user session and includes:\n",
    "\n",
    "| Feature group | Examples |\n",
    "|---|---|\n",
    "| Page visit counts | `Administrative`, `Informational`, `ProductRelated` |\n",
    "| Time spent on pages | `Administrative_Duration`, `Informational_Duration`, `ProductRelated_Duration` |\n",
    "| Google Analytics metrics | `BounceRates`, `ExitRates`, `PageValues`, `SpecialDay` |\n",
    "| Session context | `Month`, `OperatingSystems`, `Browser`, `Region`, `TrafficType` |\n",
    "| Visitor type | `VisitorType` (Returning / New / Other) |\n",
    "| Timing flag | `Weekend` (Boolean) |\n",
    "\n",
    "**Target variable:** `Revenue` — `True` if the session ended in a purchase, `False` otherwise.\n",
    "\n",
    "The dataset is **imbalanced**: roughly 84 % of sessions do not result in a purchase.\n",
    "\n",
    "### Getting the data\n",
    "\n",
    "Run the cell below to download the CSV directly from the UCI repository."
   ],
   "metadata": {
    "id": "dataset_info"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "download_data"
   },
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "import os\n",
    "\n",
    "url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv\"\n",
    "filename = \"online_shoppers_intention.csv\"\n",
    "\n",
    "if not os.path.exists(filename):\n",
    "    urllib.request.urlretrieve(url, filename)\n",
    "    print(\"Dataset downloaded.\")\n",
    "else:\n",
    "    print(\"Dataset already present.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 1 — Imports\n",
    "\n",
    "Import all libraries you will need for this exercise.\n",
    "\n",
    "**You will need at minimum:**\n",
    "- `pandas` and `numpy` for data handling\n",
    "- `sklearn` utilities: `train_test_split`, `StandardScaler`, and classification metrics\n",
    "- `torch`, `torch.nn`, `torch.optim`\n",
    "- `torch.utils.data`: `TensorDataset`, `DataLoader`\n",
    "- `matplotlib.pyplot` for plotting"
   ],
   "metadata": {
    "id": "step1_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "imports"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "from torch.utils.data import TensorDataset, DataLoader\n",
    "import matplotlib.pyplot as plt\n"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 2 — Load and Explore the Data\n",
    "\n",
    "Load the CSV file into a pandas DataFrame.\n",
    "\n",
    "**Tasks:**\n",
    "1. Load the dataset and display the first few rows.\n",
    "2. Check the shape, column names, and data types.\n",
    "3. Check for missing values.\n",
    "4. Inspect the class balance: how many sessions resulted in a purchase vs. not?\n"
   ],
   "metadata": {
    "id": "step2_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "load_data"
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"online_shoppers_intention.csv\")\n",
    "print(df.head())\n",
    "print(f\"\\nShape: {df.shape}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "explore_data"
   },
   "outputs": [],
   "source": [
    "print(df.dtypes)\n",
    "print(\"\\nMissing values:\")\n",
    "print(df.isnull().sum())\n",
    "print(\"\\nClass balance:\")\n",
    "print(df['Revenue'].value_counts(normalize=True))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 3 — Preprocess the Data\n",
    "\n",
    "The dataset contains a mix of numerical and categorical columns. You need to prepare it before passing it to a neural network.\n",
    "\n",
    "**Tasks:**\n",
    "1. **Encode categorical columns.** The columns `Month`, `VisitorType`, and `Weekend` are not numeric. Use one-hot encoding (e.g., `pd.get_dummies`) or label encoding as appropriate. Drop the original columns after encoding.\n",
    "2. **Encode the target.** Convert the `Revenue` column (boolean) to integer (0/1) and separate it from the features.\n",
    "3. **Split the data** into training and test sets (80 % / 20 %). Use `random_state=42` and `stratify=y` to preserve class proportions.\n",
    "4. **Normalise numerical features** using `StandardScaler`. Fit the scaler on the training set only, then transform both train and test sets.\n",
    "\n",
    "> **Why stratify?** The dataset is imbalanced. Stratified splitting ensures both train and test sets have the same proportion of positive examples as the full dataset."
   ],
   "metadata": {
    "id": "step3_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "encode_categoricals"
   },
   "outputs": [],
   "source": [
    "# One-hot encode categorical columns\n",
    "df_encoded = pd.get_dummies(df, columns=['Month', 'VisitorType'], drop_first=False)\n",
    "df_encoded['Weekend'] = df_encoded['Weekend'].astype(int)\n",
    "\n",
    "# Encode target\n",
    "y = df_encoded['Revenue'].astype(int).values\n",
    "X = df_encoded.drop(columns=['Revenue']).values.astype(np.float32)\n",
    "\n",
    "print(f\"Features shape: {X.shape}\")\n",
    "print(f\"Target shape: {y.shape}\")\n",
    "print(f\"Feature columns: {df_encoded.drop(columns=['Revenue']).columns.tolist()}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "split_and_scale"
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42, stratify=y\n",
    ")\n",
    "\n",
    "scaler = StandardScaler()\n",
    "X_train = scaler.fit_transform(X_train)\n",
    "X_test = scaler.transform(X_test)\n",
    "\n",
    "print(f\"Train size: {X_train.shape}, Test size: {X_test.shape}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 4 — Create PyTorch Datasets and DataLoaders\n",
    "\n",
    "PyTorch models consume data through `DataLoader` objects, which handle batching and shuffling automatically.\n",
    "\n",
    "**Tasks:**\n",
    "1. Convert your NumPy arrays `X_train`, `X_test`, `y_train`, `y_test` to `torch.FloatTensor` (features) and `torch.FloatTensor` (labels — keep as float for `BCELoss` compatibility).\n",
    "2. Wrap each pair into a `TensorDataset`.\n",
    "3. Create a `DataLoader` for the training set with `batch_size=64` and `shuffle=True`, and one for the test set with `shuffle=False`.\n",
    "4. Print the number of batches in each loader to verify."
   ],
   "metadata": {
    "id": "step4_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "create_dataloaders"
   },
   "outputs": [],
   "source": [
    "X_train_t = torch.FloatTensor(X_train)\n",
    "X_test_t  = torch.FloatTensor(X_test)\n",
    "y_train_t = torch.FloatTensor(y_train)\n",
    "y_test_t  = torch.FloatTensor(y_test)\n",
    "\n",
    "train_ds = TensorDataset(X_train_t, y_train_t)\n",
    "test_ds  = TensorDataset(X_test_t,  y_test_t)\n",
    "\n",
    "train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)\n",
    "test_loader  = DataLoader(test_ds,  batch_size=64, shuffle=False)\n",
    "\n",
    "print(f\"Train batches: {len(train_loader)}, Test batches: {len(test_loader)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 5 — Define the Neural Network\n",
    "\n",
    "Build a feedforward neural network by subclassing `torch.nn.Module`.\n",
    "\n",
    "**Architecture requirements:**\n",
    "- **Input layer:** size equal to the number of features after preprocessing.\n",
    "- **Hidden layer 1:** 64 neurons, ReLU activation.\n",
    "- **Hidden layer 2:** 32 neurons, ReLU activation.\n",
    "- **Output layer:** 1 neuron, Sigmoid activation (outputs a probability between 0 and 1).\n",
    "\n",
    "**Tasks:**\n",
    "1. Define the class `CustomerClassifier(nn.Module)` with an `__init__` method that builds the layers and a `forward` method that defines the data flow.\n",
    "2. Instantiate the model and print it to verify the architecture.\n",
    "\n",
    "> **Tip:** `nn.Sequential` can help you chain layers cleanly inside `__init__`."
   ],
   "metadata": {
    "id": "step5_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "define_model"
   },
   "outputs": [],
   "source": [
    "input_dim = X_train.shape[1]\n",
    "\n",
    "class CustomerClassifier(nn.Module):\n",
    "    def __init__(self, input_dim):\n",
    "        super().__init__()\n",
    "        self.network = nn.Sequential(\n",
    "            nn.Linear(input_dim, 64),\n",
    "            nn.ReLU(),\n",
    "            nn.Linear(64, 32),\n",
    "            nn.ReLU(),\n",
    "            nn.Linear(32, 1),\n",
    "            nn.Sigmoid()\n",
    "        )\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.network(x).squeeze(1)\n",
    "\n",
    "model = CustomerClassifier(input_dim)\n",
    "print(model)\n",
    "print(f\"\\nInput dimension: {input_dim}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 6 — Define Loss Function and Optimiser\n",
    "\n",
    "For binary classification with a sigmoid output, the standard choice is **Binary Cross-Entropy loss**.\n",
    "\n",
    "**Tasks:**\n",
    "1. Instantiate `nn.BCELoss()` as your loss function.\n",
    "2. Instantiate `torch.optim.Adam` with a learning rate of `0.001` as your optimiser, passing `model.parameters()`.\n",
    "\n",
    "> **Why Adam?** Adam adapts the learning rate per parameter and generally converges faster than plain SGD on tabular data."
   ],
   "metadata": {
    "id": "step6_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "loss_optimizer"
   },
   "outputs": [],
   "source": [
    "criterion = nn.BCELoss()\n",
    "optimizer = optim.Adam(model.parameters(), lr=0.001)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 7 — Write the Training Loop\n",
    "\n",
    "Train the model for **30 epochs**. For each epoch you should:\n",
    "\n",
    "1. Set the model to training mode with `model.train()`.\n",
    "2. Iterate over batches from `train_loader`.\n",
    "3. For each batch:\n",
    "   - Zero the gradients with `optimizer.zero_grad()`.\n",
    "   - Run a **forward pass** to get predictions.\n",
    "   - Compute the **loss**.\n",
    "   - Run a **backward pass** with `loss.backward()`.\n",
    "   - Update the weights with `optimizer.step()`.\n",
    "4. After all batches, record the average epoch training loss.\n",
    "5. Run a **validation pass** (no gradient computation) over `test_loader` and record the average test loss.\n",
    "6. Print the losses every 5 epochs.\n",
    "\n",
    "Store training and test losses in lists so you can plot them in the next step."
   ],
   "metadata": {
    "id": "step7_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "training_loop"
   },
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 30\n",
    "train_losses = []\n",
    "test_losses  = []\n",
    "\n",
    "for epoch in range(1, NUM_EPOCHS + 1):\n",
    "    model.train()\n",
    "    epoch_loss = 0.0\n",
    "    for X_batch, y_batch in train_loader:\n",
    "        optimizer.zero_grad()\n",
    "        preds = model(X_batch)\n",
    "        loss  = criterion(preds, y_batch)\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "        epoch_loss += loss.item() * len(X_batch)\n",
    "    train_losses.append(epoch_loss / len(train_loader.dataset))\n",
    "\n",
    "    model.eval()\n",
    "    val_loss = 0.0\n",
    "    with torch.no_grad():\n",
    "        for X_batch, y_batch in test_loader:\n",
    "            preds    = model(X_batch)\n",
    "            val_loss += criterion(preds, y_batch).item() * len(X_batch)\n",
    "    test_losses.append(val_loss / len(test_loader.dataset))\n",
    "\n",
    "    if epoch % 5 == 0:\n",
    "        print(f\"Epoch {epoch:3d} | Train Loss: {train_losses[-1]:.4f} | Test Loss: {test_losses[-1]:.4f}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 8 — Visualise Training Progress\n",
    "\n",
    "Plot the training and test loss curves over epochs.\n",
    "\n",
    "**Tasks:**\n",
    "1. Create a line plot with epochs on the x-axis and loss on the y-axis.\n",
    "2. Show both training loss and test loss on the same plot with a legend.\n",
    "3. Add axis labels and a title.\n",
    "\n",
    "> **What to look for:** If training loss decreases but test loss plateaus or rises, the model may be overfitting."
   ],
   "metadata": {
    "id": "step8_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "plot_losses"
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8, 4))\n",
    "plt.plot(range(1, NUM_EPOCHS + 1), train_losses, label='Train Loss')\n",
    "plt.plot(range(1, NUM_EPOCHS + 1), test_losses,  label='Test Loss')\n",
    "plt.xlabel(\"Epoch\")\n",
    "plt.ylabel(\"Loss\")\n",
    "plt.title(\"Training and Test Loss over Epochs\")\n",
    "plt.legend()\n",
    "plt.grid(alpha=0.3)\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Step 9 — Evaluate the Model\n",
    "\n",
    "A single accuracy number is not enough for an imbalanced dataset. Evaluate using a full set of metrics.\n",
    "\n",
    "**Tasks:**\n",
    "1. Set the model to evaluation mode with `model.eval()`.\n",
    "2. Run inference over the entire test set (disable gradient tracking with `torch.no_grad()`).\n",
    "3. Convert predicted probabilities to binary labels using a threshold of 0.5.\n",
    "4. Compute and print:\n",
    "   - **Accuracy**\n",
    "   - **Precision**\n",
    "   - **Recall**\n",
    "   - **F1-score**\n",
    "5. Print a **classification report** using `sklearn.metrics.classification_report`.\n",
    "\n",
    "> **Tip:** Use `sklearn.metrics` functions. Remember to move tensors to CPU and convert to NumPy before passing to sklearn."
   ],
   "metadata": {
    "id": "step9_header"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "evaluate"
   },
   "outputs": [],
   "source": [
    "model.eval()\n",
    "all_preds = []\n",
    "all_true  = []\n",
    "\n",
    "with torch.no_grad():\n",
    "    for X_batch, y_batch in test_loader:\n",
    "        probs = model(X_batch)\n",
    "        preds = (probs >= 0.5).float()\n",
    "        all_preds.append(preds.cpu().numpy())\n",
    "        all_true.append(y_batch.cpu().numpy())\n",
    "\n",
    "all_preds = np.concatenate(all_preds)\n",
    "all_true  = np.concatenate(all_true)\n",
    "\n",
    "acc  = accuracy_score(all_true, all_preds)\n",
    "prec = precision_score(all_true, all_preds)\n",
    "rec  = recall_score(all_true, all_preds)\n",
    "f1   = f1_score(all_true, all_preds)\n",
    "\n",
    "print(f\"Accuracy : {acc:.4f}\")\n",
    "print(f\"Precision: {prec:.4f}\")\n",
    "print(f\"Recall   : {rec:.4f}\")\n",
    "print(f\"F1-Score : {f1:.4f}\")\n",
    "print(\"\\nClassification Report:\")\n",
    "print(classification_report(all_true, all_preds, target_names=['No Purchase', 'Purchase']))\n"
   ]
  }
 ]
}