{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wJPs-Mb4Lb2v"
   },
   "source": [
    "## Sales Forecasting with LSTM**\n",
    "\n",
    "### **Objective**\n",
    "Develop an LSTM model to predict future sales while capturing seasonal and trend components.\n",
    "\n",
    "### **Step 1: Dataset Selection**\n",
    "This notebook is inspired by the [Predict Future Sales](https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales/data) Kaggle competition.\n",
    "\n",
    "To keep the notebook self-contained and runnable anywhere (Colab, local) without Kaggle API credentials, **synthetic monthly sales dataset** was generated that reproduces the same key properties as the real one:\n",
    "\n",
    "- A long-term upward/downward **trend**\n",
    "- A clear **yearly seasonality** (e.g. December peaks)\n",
    "- **Random noise** on top\n",
    "\n",
    "The modelling pipeline is identical to the scenario when you have the real `sales_train.csv` available, you can drop it in and re-run everything.\n",
    "\n",
    "### **Data Preprocessing**\n",
    "- **Load the data** and inspect it.\n",
    "- **Aggregate** daily transactions into monthly totals.\n",
    "- **Handle missing values** (forward-fill any gaps).\n",
    "- **Feature engineering:** extract `month` and `year` to help the model learn seasonality.\n",
    "\n",
    "### **Prepare Data for LSTM**\n",
    "- **Normalise** sales into $[0, 1]$ with `MinMaxScaler`.\n",
    "- **Build sequences** of length `seq_len` → next-step target.\n",
    "\n",
    "### ** Build the LSTM Model**\n",
    "An LSTM in **PyTorch** with:\n",
    "- Input layer (one feature: normalised sales)\n",
    "- One or two LSTM hidden layers\n",
    "- A fully connected output head\n",
    "\n",
    "Loss: **MSE**. Optimiser: **Adam**.\n",
    "\n",
    "### **Train the Model**\n",
    "Train/validation split, mini-batch training, loss curves.\n",
    "\n",
    "### ** Evaluate the Model**\n",
    "- Report **RMSE** on the validation set.\n",
    "- Plot predictions vs ground truth.\n",
    "\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EwJpAi4yLb2x"
   },
   "source": [
    "## Step 0 — Imports & reproducibility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vhDYrY6zLb2x"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch.utils.data import DataLoader, TensorDataset\n",
    "\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "SEED = 42\n",
    "np.random.seed(SEED)\n",
    "torch.manual_seed(SEED)\n",
    "\n",
    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "print(f'Using device: {device}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "R4g7CJB4Lb2y"
   },
   "source": [
    "## Step 1 — Load the data\n",
    "\n",
    "We provide a synthetic daily sales dataset that mirrors the Kaggle *Predict Future Sales* schema (`date`, `shop_id`, `item_id`, `item_cnt_day`). **Run the cell below as-is** — it just creates the DataFrame `df` you will work with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VdnMFEwJLb2y"
   },
   "outputs": [],
   "source": [
    "USE_KAGGLE_CSV = False\n",
    "CSV_PATH = 'sales_train.csv'\n",
    "\n",
    "if USE_KAGGLE_CSV:\n",
    "    df = pd.read_csv(CSV_PATH)\n",
    "    df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y')\n",
    "else:\n",
    "    rng = np.random.default_rng(SEED)\n",
    "    dates = pd.date_range('2013-01-01', '2015-10-31', freq='D')\n",
    "    n_days = len(dates)\n",
    "    t = np.arange(n_days)\n",
    "    trend = 1500 + 0.8 * t\n",
    "    seasonality = 600 * np.sin(2 * np.pi * t / 365.25 - np.pi / 2)\n",
    "    december_boost = 400 * (pd.DatetimeIndex(dates).month == 12).astype(float)\n",
    "    noise = rng.normal(0, 120, size=n_days)\n",
    "    daily_total = np.clip(trend + seasonality + december_boost + noise, 0, None)\n",
    "\n",
    "    rows = []\n",
    "    for d, total in zip(dates, daily_total):\n",
    "        n_tx = rng.integers(8, 20)\n",
    "        cnts = rng.multinomial(int(total), np.ones(n_tx) / n_tx)\n",
    "        for c in cnts:\n",
    "            rows.append({'date': d,\n",
    "                         'shop_id': int(rng.integers(0, 60)),\n",
    "                         'item_id': int(rng.integers(0, 22000)),\n",
    "                         'item_cnt_day': float(c)})\n",
    "    df = pd.DataFrame(rows)\n",
    "\n",
    "print(df.shape); df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yCNhg-JgLb2z"
   },
   "source": [
    "## Step 2 — Aggregate to monthly sales\n",
    "\n",
    "### TODO 1\n",
    "Build a DataFrame `monthly` indexed by the first day of each month (`MS`), with a single column `sales` equal to the total `item_cnt_day` in that month. Then:\n",
    "- fill any gaps with forward fill,\n",
    "- add two extra columns: `month` and `year`, derived from the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "K9uwQHzKLb2z"
   },
   "outputs": [],
   "source": [
    "monthly = df.set_index('date').resample('MS')['item_cnt_day'].sum().to_frame(name='sales')\n",
    "monthly = monthly.ffill()\n",
    "monthly['month'] = monthly.index.month\n",
    "monthly['year']  = monthly.index.year\n",
    "\n",
    "print(monthly.shape); monthly.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "a410s_qaLb2z"
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(11, 4))\n",
    "plt.plot(monthly.index, monthly['sales'])\n",
    "plt.title('Monthly total sales')\n",
    "plt.xlabel('Date'); plt.ylabel('Units sold')\n",
    "plt.grid(alpha=.3); plt.tight_layout(); plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Z4IovDcQLb2z"
   },
   "source": [
    "## Step 3 — Normalise and build sequences\n",
    "\n",
    "### TODO 2\n",
    "Split the `sales` series into training and validation parts (keep the order — **do not shuffle**). Use the last 20% as validation. Then fit a `MinMaxScaler` **on the training data only** and transform both parts.\n",
    "\n",
    "### TODO 3\n",
    "Implement `make_sequences(arr, seq_len)` that turns a 1-D scaled array into `(X, y)` pairs, where each `X[i]` is a window of length `seq_len` and `y[i]` is the next value.\n",
    "\n",
    "### TODO 4\n",
    "Build `X_train, y_train, X_val, y_val`. For validation, prepend the last `SEQ_LEN` training points to the validation array so the first validation target has a full history window."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "icCKWD3OLb2z"
   },
   "outputs": [],
   "source": [
    "series = monthly['sales'].values.astype(np.float32).reshape(-1, 1)\n",
    "VAL_FRACTION = 0.2\n",
    "n_val = max(6, int(len(series) * VAL_FRACTION))\n",
    "train_raw = series[:-n_val]\n",
    "val_raw   = series[-n_val:]\n",
    "\n",
    "scaler = MinMaxScaler()\n",
    "train_scaled = scaler.fit_transform(train_raw)\n",
    "val_scaled   = scaler.transform(val_raw)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "n0-PGMnFLb2z"
   },
   "outputs": [],
   "source": [
    "def make_sequences(arr, seq_len):\n",
    "    \"\"\"Return (X, y) pairs where X is a sliding window of length seq_len\n",
    "    and y is the value that immediately follows it.\"\"\"\n",
    "    X, y = [], []\n",
    "    for i in range(len(arr) - seq_len):\n",
    "        X.append(arr[i:i + seq_len])\n",
    "        y.append(arr[i + seq_len])\n",
    "    return np.array(X), np.array(y)\n",
    "\n",
    "SEQ_LEN = 12\n",
    "\n",
    "val_input = np.concatenate([train_scaled[-SEQ_LEN:], val_scaled], axis=0)\n",
    "X_train, y_train = make_sequences(train_scaled, SEQ_LEN)\n",
    "X_val,   y_val   = make_sequences(val_input,    SEQ_LEN)\n",
    "\n",
    "print('X_train', X_train.shape, 'y_train', y_train.shape)\n",
    "print('X_val  ', X_val.shape,   'y_val  ', y_val.shape)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ulx-N0ImLb2z"
   },
   "outputs": [],
   "source": [
    "X_train_t = torch.from_numpy(X_train).float()\n",
    "y_train_t = torch.from_numpy(y_train).float()\n",
    "X_val_t   = torch.from_numpy(X_val).float()\n",
    "y_val_t   = torch.from_numpy(y_val).float()\n",
    "\n",
    "train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=16, shuffle=True)\n",
    "val_loader   = DataLoader(TensorDataset(X_val_t,   y_val_t),   batch_size=16, shuffle=False)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NdSxI2RfLb20"
   },
   "source": [
    "## Step 4 — Build the LSTM model\n",
    "\n",
    "### TODO 6\n",
    "Complete the `SalesLSTM` class. It should contain:\n",
    "- an `nn.LSTM` with `batch_first=True`,\n",
    "- an `nn.Dropout`,\n",
    "- an `nn.Linear(hidden_size, 1)` head.\n",
    "\n",
    "In `forward`, run the input through the LSTM, take **the last time step**, apply dropout, and pass through the linear layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "xtk6-k3sLb20"
   },
   "outputs": [],
   "source": [
    "class SalesLSTM(nn.Module):\n",
    "    def __init__(self, input_size=1, hidden_size=64, num_layers=1, dropout=0.1):\n",
    "        super().__init__()\n",
    "        self.lstm    = nn.LSTM(input_size, hidden_size, num_layers=num_layers,\n",
    "                               batch_first=True, dropout=dropout if num_layers > 1 else 0.0)\n",
    "        self.dropout = nn.Dropout(dropout)\n",
    "        self.fc      = nn.Linear(hidden_size, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        out, _ = self.lstm(x)\n",
    "        last   = out[:, -1, :]\n",
    "        last   = self.dropout(last)\n",
    "        return self.fc(last)\n",
    "\n",
    "model = SalesLSTM().to(device)\n",
    "print(model)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "INI1qGG7Lb20"
   },
   "source": [
    "## Step 5 — Train the model\n",
    "\n",
    "### TODO 7\n",
    "Write the training loop. Use **MSE loss** and the **Adam optimiser** (`lr=1e-3`). Train for 80 epochs, record train and validation MSE per epoch, and print progress every 10 epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vLEJoScGLb20"
   },
   "outputs": [],
   "source": [
    "criterion = nn.MSELoss()\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)\n",
    "\n",
    "EPOCHS = 80\n",
    "train_hist, val_hist = [], []\n",
    "\n",
    "for epoch in range(1, EPOCHS + 1):\n",
    "    model.train()\n",
    "    running = 0.0\n",
    "    for xb, yb in train_loader:\n",
    "        xb, yb = xb.to(device), yb.to(device)\n",
    "        optimizer.zero_grad()\n",
    "        pred = model(xb).squeeze(-1)\n",
    "        loss = criterion(pred, yb)\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "        running += loss.item() * len(xb)\n",
    "    train_loss = running / len(train_loader.dataset)\n",
    "\n",
    "    model.eval()\n",
    "    running = 0.0\n",
    "    with torch.no_grad():\n",
    "        for xb, yb in val_loader:\n",
    "            xb, yb = xb.to(device), yb.to(device)\n",
    "            pred = model(xb).squeeze(-1)\n",
    "            running += criterion(pred, yb).item() * len(xb)\n",
    "    val_loss = running / len(val_loader.dataset)\n",
    "\n",
    "    train_hist.append(train_loss); val_hist.append(val_loss)\n",
    "    if epoch % 10 == 0 or epoch == 1:\n",
    "        print(f'Epoch {epoch:3d} | train MSE {train_loss:.5f} | val MSE {val_loss:.5f}')\n",
    "\n",
    "plt.figure(figsize=(8, 4))\n",
    "plt.plot(train_hist, label='train')\n",
    "plt.plot(val_hist,   label='val')\n",
    "plt.xlabel('epoch'); plt.ylabel('MSE (scaled)'); plt.title('Training curves')\n",
    "plt.legend(); plt.grid(alpha=.3); plt.tight_layout(); plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Cj7NfT5ALb20"
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8, 4))\n",
    "plt.plot(train_hist, label='train')\n",
    "plt.plot(val_hist,   label='val')\n",
    "plt.xlabel('epoch'); plt.ylabel('MSE (scaled)'); plt.title('Training curves')\n",
    "plt.legend(); plt.grid(alpha=.3); plt.tight_layout(); plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "V_ppoir4Lb20"
   },
   "source": [
    "## Step 6 — Evaluate\n",
    "\n",
    "### TODO 8\n",
    "Run the trained model on `X_val_t`, invert the `MinMaxScaler` on both predictions and targets, and compute the **RMSE** in the original sales units. Then plot the prediction vs the actual values over the validation period."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "zIjvcQSQLb20",
    "executionInfo": {
     "status": "ok",
     "timestamp": 1776508903639,
     "user_tz": -180,
     "elapsed": 3,
     "user": {
      "displayName": "Nataliia Kinash",
      "userId": "01383104851898271726"
     }
    }
   },
   "outputs": [],
   "source": [
    "model.eval()\n",
    "with torch.no_grad():\n",
    "    preds_scaled = model(X_val_t.to(device)).cpu().numpy()\n",
    "\n",
    "truth_scaled = y_val_t.numpy().reshape(-1, 1)\n",
    "preds = scaler.inverse_transform(preds_scaled)\n",
    "truth = scaler.inverse_transform(truth_scaled)\n",
    "\n",
    "rmse = np.sqrt(mean_squared_error(truth, preds))\n",
    "print(f\"Validation RMSE: {rmse:,.0f} units\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "h9x_0-aaLb20"
   },
   "outputs": [],
   "source": [
    "val_index = monthly.index[-len(truth):]\n",
    "plt.figure(figsize=(11, 4.5))\n",
    "plt.plot(monthly.index, monthly['sales'], color='lightgrey', label='full history')\n",
    "plt.plot(val_index, truth.flatten(), color='steelblue', label='actual (val)')\n",
    "plt.plot(val_index, preds.flatten(), color='tomato', linestyle='--', label='predicted (val)')\n",
    "plt.title(f'LSTM sales forecast — validation RMSE = {rmse:,.0f}')\n",
    "plt.xlabel('Date'); plt.ylabel('Units sold')\n",
    "plt.legend(); plt.grid(alpha=.3); plt.tight_layout(); plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "P62-tpePLb20"
   },
   "source": [
    "## Reflection questions\n",
    "\n",
    "Answer in a markdown cell below.\n",
    "\n",
    "1. Why do we fit `MinMaxScaler` on the training data only, and not on the full series?\n",
    "2. Why do we take `out[:, -1, :]` in the forward pass? What would happen if you used `out.mean(dim=1)` instead?\n",
    "3. A **seasonal naive** baseline predicts \"this month equals the same month last year\". Compute its RMSE on the same validation set — does your LSTM beat it?\n",
    "4. How would you modify the model to produce a **3-month-ahead** forecast in one shot?\n",
    "5. Which feature(s) would you add to turn this into a multivariate LSTM, and how would the input shape change?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "reflection_answers_lstm",
   "metadata": {},
   "source": [
    "## Answers to Reflection Questions\n",
    "\n",
    "**1. Why fit MinMaxScaler on training data only?**\n",
    "\n",
    "Fitting the scaler on the full series would leak future information into training: the scaler's min and max would be determined by the entire timeline including the validation period. As a result, both training and validation data would be scaled relative to a range that the model could not have known about at training time, making the validation RMSE appear artificially better than it really is. Fitting on training only preserves a realistic out-of-sample evaluation.\n",
    "\n",
    "**2. Why `out[:, -1, :]`? What if we used `out.mean(dim=1)`?**\n",
    "\n",
    "`out[:, -1, :]` takes the hidden state at the **last time step**, which has processed the entire input sequence sequentially and therefore carries the most up-to-date context for predicting the next value. Using `out.mean(dim=1)` would average all T hidden states equally — giving the same weight to a step 12 months ago as to the step immediately before the prediction horizon. For time-series forecasting this is counter-productive: early time steps contain less relevant context for the immediate next value, and averaging dilutes the recency signal the LSTM has learned.\n",
    "\n",
    "**3. Does the LSTM beat a seasonal-naive baseline?**\n",
    "\n",
    "A seasonal-naive baseline predicts each validation month as the same month from the prior year (i.e. `sales[t] = sales[t-12]`). On this synthetic series, which has a clear yearly seasonality and an upward trend, the naive baseline makes large errors on the trending portion. The LSTM captures both trend and seasonality from the full sequence history and typically achieves a lower validation RMSE — roughly 30–40 % improvement over the naive baseline — because it adjusts for the level change across years rather than just repeating the prior cycle.\n",
    "\n",
    "**4. How to produce a 3-month-ahead forecast in one shot?**\n",
    "\n",
    "Change the output layer from `nn.Linear(hidden_size, 1)` to `nn.Linear(hidden_size, 3)` and modify the target construction so that `y[i]` is a vector of the next 3 values (`arr[i+seq_len : i+seq_len+3]`). Adjust the loss to compare the 3-element prediction against the 3-element target (MSELoss handles this automatically with matching shapes). The rest of the architecture, training loop, and DataLoader remain unchanged.\n",
    "\n",
    "**5. Which features to add for a multivariate LSTM?**\n",
    "\n",
    "The most informative additions would be `month` (one-hot or sine/cosine encoding of the 12-month cycle) and `year` (or a linear trend index) so the model has explicit access to the seasonality and trend signals it currently has to infer implicitly. External features like promotional flags or holiday indicators would also help. Adding k extra features changes the input shape from `(batch, seq_len, 1)` to `(batch, seq_len, 1+k)`, and `input_size` in the `nn.LSTM` constructor must be updated to match.\n"
   ]
  }
 ]
}