Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Testing Richard Sutton's Bitter Lesson in the Markets with a Toy Deep Learning Model

In this tutorial, we implement the paper “Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks” by Lawrence Takeuchi and Yu-Ying (Albert) Lee. Originally published in 2013, this early and influential work explores the use of deep learning—specifically stacked restricted Boltzmann machines (RBMs) and feedforward neural networks—to extract return-predictive features from historical stock price data and improve momentum-based trading performance.

Why implementing this paper

This paper is one of the earliest to explore the use of deep learning in financial markets, well before it became a mainstream tool in quant finance. Rather than relying on hand-crafted features or traditional factor models, the authors use a stacked autoencoder built from restricted Boltzmann machines (RBMs) to extract predictive signals directly from raw return data. These features are then passed to a feedforward neural network to classify stocks based on future return potential.

The reported results are striking: the deep learning model delivers an annualized return of 45.93% over the 1990–2009 test period, vastly outperforming the basic momentum strategy, which returned just 10.53%. It also achieved significantly higher Sharpe ratios and outperformance across deciles ranked by model confidence.

While our implementation uses a more modern approach (end-to-end training, no RBMs), and our dataset differs significantly in scope and period, the core motivation remains the same: to test whether deep learning can extract meaningful alpha from raw price signals, validating the broader claim of Richard Sutton’s Bitter Lesson—that general-purpose, computation-heavy methods outperform manually engineered ones over time.

For a full breakdown of the original paper and our experimental results, see my article on quantitativo.com. Here, we’ll focus on how to implement the model step by step.

Data

In our replication, we will use daily stock prices from January 1, 1990, to today, obtained from Norgate data. Norgate provides a high-quality survivorship bias-free daily data for the US stock market that is very affordable. For more information on how to acquire a Norgate data subscription, please check Norgate website.

The first step is to retrieve the data for a given symbol:

def get_data(symbol):
    # Get raw price data (adjusted and unadjusted closes)
    df = norgatedata.price_timeseries(
        symbol,
        padding_setting=norgatedata.PaddingType.ALLMARKETDAYS,
        stock_price_adjustment_setting=norgatedata.StockPriceAdjustmentType.CAPITALSPECIAL,
        timeseriesformat='pandas-dataframe'
    )[['Close', 'Unadjusted Close']]
    df = df.ffill()
    df['date'] = df.index

    # Get end-of-month data: last trading day of each month
    eom = df.groupby([df.index.year, df.index.month]).last()\
        .set_index('date').iloc[:-1, :]
    eom['is_next_jan'] = (eom.index.month == 12).astype(int)  # Flag if the next month is January
    eom['next_month_ret'] = eom['Close'].shift(-1) / eom['Close'] - 1  # Forward return for next month

    # Compute monthly cumulative returns over the past 12 months (ret-m1 to ret-m12)
    monthly = pd.DataFrame(index=eom.index)
    for m in range(13, 0, -1):
        monthly[f'ret-m{m}'] = eom['Close'].shift(m - 1)
    monthly = (monthly.T / monthly.T.iloc[0] - 1).iloc[1:, :].T.astype(float)

    # Compute daily cumulative returns over the past 20 trading days (ret-d1 to ret-d20)
    dates = pd.DataFrame(index=eom.index)
    for d in range(21, 0, -1):
        dates[f'ret-d{d}'] = df['date'].shift(d - 1)
    dates = dates.dropna()
    daily = pd.DataFrame(index=dates.index, columns=dates.columns)
    
    # Map historical closing prices based on shifted dates
    for date in daily.index:
        daily.loc[date] = df.loc[dates.loc[date].values, 'Close'].values
    daily = (daily.T / daily.T.iloc[0] - 1).iloc[1:, :].T.astype(float)

    # Merge monthly and daily features with EOM targets and metadata
    df = monthly.join(daily, how='inner')\
        .join(eom[['Unadjusted Close', 'is_next_jan', 'next_month_ret']], how='inner')
    df = df.dropna().rename(columns={'Unadjusted Close': 'unadj_close'})

    # Return None if no usable rows
    if len(df) == 0:
        return None

    # Create MultiIndex (date, symbol)
    df.index = pd.MultiIndex.from_tuples([(d, symbol) for d in df.index])

    # Drop any rows with infinite values
    df = df[~df.isin([np.inf, -np.inf]).any(axis=1)]
    return df

The function get_data(symbol) retrieves daily price data for a given stock symbol using Norgate, then constructs a feature-rich DataFrame indexed by (date, symbol) for use in predictive models. It calculates monthly cumulative returns (ret-m1 to ret-m12) based on end-of-month closes and daily cumulative returns (ret-d1 to ret-d20) over the last 20 trading days. It also includes the unadjusted close price, a binary flag is_next_jan indicating if the next month is January, and the forward return for the next month (next_month_ret). The function ensures no missing or infinite values and returns a clean, multi-indexed dataset ready for modeling.

We can check it with a call like get_data('AAPL'):

Now, retrieving all data is straightforward:

# Get list of current and delisted US equity symbols from Norgate
s1 = norgatedata.database_symbols('US Equities')
s2 = norgatedata.database_symbols('US Equities Delisted')
symbols = s1 + s2
symbols = sorted(list(set(symbols)))  # Remove duplicates and sort

raw = []
# Loop through each symbol and collect its processed data
for symbol in tqdm(symbols):
    df = get_data(symbol)
    if df is not None:
        raw.append(df)

# Concatenate all individual DataFrames and sort by index (date, symbol)
raw = pd.concat(raw).sort_index()

# Save the full dataset to disk as a pickle file
raw.to_pickle('/tmp/takeuchi-data-raw.pkl')

In about 30 minutes, we are ready to continue.

Pre-processing

Before training the model, we apply several pre-processing steps to clean the data, standardize features, and define the target variable:

# Filter out penny stocks: keep only rows where unadjusted close is 
# greater than $5
raw = raw[raw['unadj_close'] > 5]

# Drop the unadjusted close column—it’s no longer needed
raw = raw.drop(columns=['unadj_close'])

# Standardize features (z-score) within each date (cross-sectionally)
features_std = raw.iloc[:, :-2].groupby(level=0)\
    .transform(lambda x: (x - x.mean()) / x.std())

# Extract the raw (non-standardized) version of the last feature column (is_next_jan)
feature_raw = raw.iloc[:, -2]

# Binary target: 1 if next_month_ret is above the cross-sectional median 
# for that date, else 0
target = raw.iloc[:, -1]
target = (target > target.groupby(level=0).transform('median')).astype(int)
target.name = 'target'

# Preserve the original next_month_ret values for reference or evaluation
next_month_ret = raw.iloc[:, -1]

# Concatenate standardized features, raw feature, target, and 
# forward return into final dataset
data = pd.concat([features_std, feature_raw, target, next_month_ret], axis=1)

We first filter out low-priced stocks (unadjusted close ≤ $5), which are often illiquid and noisy. Then, we apply cross-sectional z-score standardization to all features (excluding the last two columns) on each date to normalize them. The target variable is defined as a binary label: 1 if the next month’s return is above the median return for that date, and 0 otherwise. We also preserve the raw (unstandardized) version of the last feature (is_next_jan) and the original forward return for later analysis. All components are then combined into a single DataFrame for model input.

Cross-validation splits

To evaluate model performance over time, we implement a rolling-window cross-validation framework tailored for time series data:

def train_val_test_split(data, look_back_years, val_years, validation_first=False):
    # Create a DataFrame with unique dates and assign each a "year" label
    # If the month is December, increment the year by 1 to avoid look-ahead bias
    dt = pd.DataFrame(index=data.index.get_level_values(0).unique())
    dt['year'] = dt.index
    dt['year'] = dt['year'].apply(
        lambda x: x.year if x.month != 12 else x.year + 1)

    current_start_year = dt['year'].iloc[0]
    max_year = dt['year'].max()

    while True:
        # Define the start years for train/val depending on whether validation 
        # comes first
        if validation_first:
            val_start_year = current_start_year
            train_start_year = current_start_year + val_years
        else:
            train_start_year = current_start_year
            val_start_year = current_start_year + (look_back_years - val_years)

        test_year = current_start_year + look_back_years
        if test_year > max_year:
            break

        # Determine the actual date ranges for train, val, and test splits
        val_start = dt[dt['year'] == val_start_year].index.min()
        train_start = dt[dt['year'] == train_start_year].index.min()
        test_start = dt[dt['year'] == test_year].index.min()
        test_end = dt[dt['year'] == test_year].index.max()

        if validation_first:
            # If validation comes first: Validation → Training → Test
            val_mask = (data.index.get_level_values(0) > val_start) & (data.index.get_level_values(0) <= train_start)
            train_mask = (data.index.get_level_values(0) > train_start) & (data.index.get_level_values(0) <= test_start)
        else:
            # Default order: Training → Validation → Test
            train_mask = (data.index.get_level_values(0) > train_start) & (data.index.get_level_values(0) <= val_start)
            val_mask = (data.index.get_level_values(0) > val_start) & (data.index.get_level_values(0) <= test_start)

        # Define test set mask
        test_mask = (data.index.get_level_values(0) > test_start) & (data.index.get_level_values(0) <= test_end)

        # Yield the data subsets for current window
        train_data = data.loc[train_mask]
        val_data = data.loc[val_mask]
        test_data = data.loc[test_mask]

        yield train_data, val_data, test_data

        # Move the rolling window forward by 1 year
        current_start_year += 1

The train_val_test_split function generates chronological train/validation/test splits by sliding a multi-year window forward one year at a time. For each iteration, it defines a training period, a validation period, and a test year—ensuring no data leakage. We can choose whether to place the validation set before or after the training set using the validation_first flag. This approach mimics a realistic backtesting setup and provides out-of-sample evaluation across multiple time periods.

We can verify that the datasets are being generated correctly with the following code:

df = pd.DataFrame(columns=['val_start', 'val_end', 'train_start', 'train_end', 
                           'test_start', 'test_end', 'training_sampes'])
for train_data, val_data, test_data in train_val_test_split(data, 10, 1):
    df.loc[len(df)] = [
        train_data.index.get_level_values(0).min(),
        train_data.index.get_level_values(0).max(),
        val_data.index.get_level_values(0).min(),
        val_data.index.get_level_values(0).max(),
        test_data.index.get_level_values(0).min(),
        test_data.index.get_level_values(0).max(),
        len(train_data)
    ]

df

Now, we are ready to work on the model.

The model

We define a simple feedforward neural network (FFNN) architecture to classify the pre-processed features into two classes, as specified in the paper:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader


class FFNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(33, 40)
        self.fc2 = nn.Linear(40, 4)
        self.fc3 = nn.Linear(4, 50)
        self.out = nn.Linear(50, 2)  # logits for 2 classes

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        return self.out(x)  # pass logits to CrossEntropyLoss

The FFNN model consists of three hidden layers with ReLU activations, followed by a final output layer that produces logits for binary classification. The input layer expects 33 features, which are transformed through progressively deeper representations: from 40 units, to 4, then to 50, before reaching the final output layer with 2 neurons. The model returns raw logits, which are suitable for use with CrossEntropyLoss, a standard loss function for multi-class classification tasks.

In the original 2013 paper, Takeuchi and Lee used a stacked autoencoder built from restricted Boltzmann machines (RBMs) for pretraining, which was common at the time. They pre-trained the encoder (33→40→4) to extract features and then attached a feedforward classifier (4→50→2) on top. Our model directly implements the final architecture they found optimal through cross-validation, bypassing the pretraining phase.

The reason we can now skip RBMs and pretraining is due to major advances in hardware (GPUs/TPUs), software (PyTorch, TensorFlow), initialization methods, and optimization algorithms (like Adam). These allow deep neural networks to be trained end-to-end from scratch efficiently, even on noisy financial datasets. In contrast, a decade ago, training deep nets from scratch often resulted in poor convergence, which is why unsupervised layer-wise pretraining with RBMs was widely used to help the network learn useful representations before supervised fine-tuning.

We can verify that our architecture works using the following code:

def create_dataloader(df, batch_size=512, shuffle=False):
    # Convert input features to float32 tensor (all columns except the last two)
    X = torch.tensor(df.iloc[:, :-2].astype('float32').values)
    # Convert labels (second-to-last column) to int64 tensor
    y = torch.tensor(df.iloc[:, -2].astype('int64').values)
    # Wrap tensors in a TensorDataset and return a DataLoader
    dataset = TensorDataset(X, y)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
    

# Create a DataLoader for training with shuffling
dataloader = create_dataloader(train_data, shuffle=True)

# Fetch one mini-batch from the DataLoader
batch_X, batch_y = next(iter(dataloader))

# Inspect the shapes of the input features and labels
print(batch_X.shape)  # e.g., torch.Size([512, 33])
print(batch_y.shape)  # e.g., torch.Size([512])

# Initialize the model and run a forward pass on one batch
model = FFNN()
model(batch_X)

The code builds a PyTorch DataLoader from a DataFrame, converting features and labels to tensors for training. It retrieves one mini-batch, checks the shapes, and runs a forward pass through the FFNN to verify the model works as expected. We should see something like this:

torch.Size([512, 33])
torch.Size([512])
tensor([[-0.0487, -0.0296],
        [-0.1395,  0.0053],
        [-0.1096, -0.0065],
        ...,
        [-0.0890, -0.0169],
        [-0.0882, -0.0097],
        [-0.1142, -0.0052]], grad_fn=<AddmmBackward0>)

Now, let's move to training.

Training

We now define the training loop used to optimize the model parameters using the training and validation sets:

def train(train_data, val_data, num_epochs=100, lr=1e-4, batch_size=512, silent=True):
    # Create DataLoaders for training and validation sets
    train_loader = create_dataloader(train_data, batch_size=batch_size, shuffle=True)
    val_loader = create_dataloader(val_data, batch_size=batch_size, shuffle=False)

    # Initialize model, optimizer, and loss function
    model = FFNN()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    # Helper function to evaluate model on a given DataLoader
    def evaluate(loader):
        model.eval()
        total_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for X, y in loader:
                logits = model(X)
                loss = criterion(logits, y)
                total_loss += loss.item() * X.size(0)
                preds = torch.argmax(logits, dim=1)
                correct += (preds == y).sum().item()
                total += y.size(0)
        avg_loss = total_loss / total
        accuracy = correct / total
        return avg_loss, accuracy

    # Track the best model based on validation loss
    best_val_loss = float('inf')
    best_accuracy = 0
    best_model_weights = None
    best_epoch = -1

    # Training loop
    for epoch in range(1, 1 + num_epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for batch_x, batch_y in train_loader:
            optimizer.zero_grad()
            logits = model(batch_x)
            loss = criterion(logits, batch_y)
            loss.backward()
            optimizer.step()

            total_loss += loss.item() * batch_x.size(0)
            preds = torch.argmax(logits, dim=1)
            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)

        avg_train_loss = total_loss / total
        train_acc = correct / total

        # Evaluate on validation set
        val_loss, val_acc = evaluate(val_loader)

        # Print training progress if not in silent mode
        if not silent:
            print(f"[Epoch {epoch:02d}] "
                  f"Train Loss: {avg_train_loss:.4f}, Train Acc: {train_acc:.4f} | "
                  f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        # Update best model if validation loss improves
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_accuracy = val_acc
            best_model_weights = copy.deepcopy(model.state_dict())
            best_epoch = epoch

    # Load weights from the best-performing model
    model.load_state_dict(best_model_weights)
    if not silent:
        print(f"Best validation loss was {best_val_loss:.4f} at epoch {best_epoch}")

    return {
        'model': model,
        'best_val_loss': best_val_loss,
        'best_epoch': best_epoch,
        'best_accuracy': best_accuracy,
    }

The train function performs mini-batch training of the FFNN model using the Adam optimizer and cross-entropy loss. It tracks performance on both the training and validation sets, saving the model weights that yield the lowest validation loss. After training for a specified number of epochs, it returns the best-performing model along with key metrics such as validation loss, accuracy, and the epoch at which the best performance was achieved.

We now generate predictions and rank stocks by model confidence:

def generate_predictions(test_data, model, num_quantiles=10):
    # Create DataLoader for test data (no shuffling)
    test_loader = create_dataloader(test_data, batch_size=512, shuffle=False)

    model.eval()
    all_preds = []
    all_probs = []

    # Perform forward pass on test data without computing gradients
    with torch.no_grad():
        for batch_x, _ in test_loader:
            logits = model(batch_x)  # Output logits: shape [batch_size, 2]
            probs = torch.softmax(logits, dim=1)  # Convert to probabilities
            preds = torch.argmax(probs, dim=1)  # Predicted class labels

            all_preds.append(preds.cpu())
            all_probs.append(probs.cpu())

    # Concatenate all mini-batch results into full arrays
    all_preds = torch.cat(all_preds).numpy()
    all_probs = torch.cat(all_probs).numpy()

    # Create a copy of the test data and add predictions
    td = test_data.copy()
    td["predicted_class"] = all_preds
    td["predicted_prob"] = all_probs[:, 1]  # Probability of class 1 (positive class)

    # Assign quantile labels (1 to num_quantiles) based on predicted probabilities
    # NOTE: duplicates='drop' avoids qcut errors when there are tied probabilities
    td["quantile"] = td.groupby(level=0)["predicted_prob"]\
        .transform(lambda x: pd.qcut(x, num_quantiles, labels=False, duplicates='drop') + 1)

    return td

The method above generates model predictions on the test set by applying the trained network to each mini-batch and collecting the predicted classes and class probabilities. It appends these outputs to the original test DataFrame and assigns each example to a quantile bucket based on its predicted probability of belonging to class 1. This enables cross-sectional ranking of stocks by confidence level, which is useful for evaluating strategy performance by signal strength.

We now run rolling-window training and evaluation to collect predictions and track model performance over time:

# Initialize summary DataFrame to store training details for each test year
df = pd.DataFrame(columns=['train_start', 'train_end', 'val_start', 'val_end', 'test_start', 'test_end', 
                           'training_sampes', 'best_val_loss', 'best_epoch', 'best_accuracy'])

preds = []

# Perform rolling-window training and evaluation
for train_data, val_data, test_data in train_val_test_split(data, 10, 1):
    year = test_data.index[0][0].year  # Extract test year for logging

    # Train model on training and validation sets
    result = train(train_data, val_data, silent=True)
    
    # Generate predictions on test set
    td = generate_predictions(test_data, result['model'])
    preds.append(td)

    # Record metadata and training results for this fold
    df.loc[len(df)] = [
        train_data.index.get_level_values(0).min(),
        train_data.index.get_level_values(0).max(),
        val_data.index.get_level_values(0).min(),
        val_data.index.get_level_values(0).max(),
        test_data.index.get_level_values(0).min(),
        test_data.index.get_level_values(0).max(),
        len(train_data),
        result['best_val_loss'],
        result['best_epoch'],
        result['best_accuracy'],
    ]
    
    # Print training result summary for current year
    print(f"[{year}] Best val loss {result['best_val_loss']:.4f}, acc {result['best_accuracy']:.2%} at epoch {result['best_epoch']}")

# Combine all test set predictions across folds
preds = pd.concat(preds)

Running the code above will take a bit over an hour. We should see something like this:

[2001] Best val loss 0.6868, acc 54.12% at epoch 39
[2002] Best val loss 0.6878, acc 54.50% at epoch 90
[2003] Best val loss 0.6894, acc 53.97% at epoch 2
[2004] Best val loss 0.6967, acc 50.30% at epoch 16
[2005] Best val loss 0.6921, acc 51.90% at epoch 4
[2006] Best val loss 0.6930, acc 50.19% at epoch 1
[2007] Best val loss 0.6913, acc 52.55% at epoch 13
[2008] Best val loss 0.6917, acc 51.82% at epoch 3
[2009] Best val loss 0.6883, acc 54.00% at epoch 34
[2010] Best val loss 0.6954, acc 51.86% at epoch 25
[2011] Best val loss 0.6910, acc 52.95% at epoch 7
[2012] Best val loss 0.6871, acc 55.44% at epoch 22
[2013] Best val loss 0.6925, acc 51.27% at epoch 2
[2014] Best val loss 0.6907, acc 53.64% at epoch 48
[2015] Best val loss 0.6925, acc 52.69% at epoch 14
[2016] Best val loss 0.6910, acc 53.87% at epoch 2
[2017] Best val loss 0.6949, acc 49.43% at epoch 1
[2018] Best val loss 0.6909, acc 53.28% at epoch 4
[2019] Best val loss 0.6925, acc 51.58% at epoch 4
[2020] Best val loss 0.6932, acc 50.65% at epoch 1
[2021] Best val loss 0.6938, acc 49.74% at epoch 1
[2022] Best val loss 0.6876, acc 55.17% at epoch 9
[2023] Best val loss 0.6917, acc 51.53% at epoch 1
[2024] Best val loss 0.6891, acc 53.95% at epoch 4
[2025] Best val loss 0.6907, acc 53.19% at epoch 2

The DataFrame that tracks the performance should record something like this:

Now, we are ready to see the results.

Results

Let's start reviewing the model's classification performance by analyzing the confusion matrix:

import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# True and predicted labels
y_true = preds["target"]
y_pred = preds["predicted_class"]

# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Normalize columns to sum to 1
cm_col_normalized = cm.astype(np.float64) / cm.sum(axis=0, keepdims=True)
cm_percent = cm_col_normalized #* 100

# Display
disp = ConfusionMatrixDisplay(confusion_matrix=cm_percent, display_labels=sorted(test_data["target"].unique()))
disp.plot(cmap="Blues", values_format=".1%")

This code compares the true labels and the model’s predicted classes across the entire test set to compute a confusion matrix, which is then column-normalized to show the distribution of predictions per class. This helps assess whether the model is biased toward one class and how well it distinguishes between them. The results are visualized using ConfusionMatrixDisplay, with values shown as proportions for easier interpretation.

We should see something like this:

Highlights:

The model is slightly better than random guessing, with ~52% accuracy for both predicted classes.
It shows no strong bias toward one class—performance is symmetrical.
This aligns with expectations for noisy financial datasets with low signal-to-noise ratios, where even a small edge (e.g., >50% accuracy) can be economically meaningful when scaled properly in a trading strategy.

We now compute the monthly long-short returns based on the spread between the top and bottom quantile predictions:

# Compute average forward return by quantile for each month
qrets = preds.groupby([preds.index.get_level_values(0), 'quantile'], observed=True)
qrets = qrets["next_month_ret"].mean().unstack()

# Check how many months have missing quantile buckets (due to duplicates='drop' in qcut)
print(qrets.isna().sum())

This snippet computes the average forward return for each quantile bucket on a monthly basis by grouping predictions by date and quantile. It then unstacks the results into a table where each column represents a quantile. Finally, it prints how many months are missing one or more quantiles, which can occur due to duplicates='drop' in qcut, highlighting potential data sparsity in certain months.

We now visualize the annualized average return by quantile to assess the relationship between predicted confidence and realized performance:

r = (qrets.mean() + 1) ** 12 - 1
r = r * 100
r.index.name = 'Quantiles'

fig, ax = plt.subplots(1, 1, dpi=150, figsize=(10, 6))
r.plot(kind='bar', ax=ax, color=['#6833C9'], zorder=10)

ax.set_ylabel('Mean Return (%) (annualized)')

for spine in ax.spines.values():
    spine.set_edgecolor('#E6E6E6')
    spine.set_linewidth(0.5)

ax.tick_params(axis='x', colors='#666666')
ax.tick_params(axis='y', colors='#666666')
ax.tick_params(axis='y', which='both', color='#e6e6e6', labelsize=6)
ax.tick_params(axis='x', which='both', color='#e6e6e6')
ax.grid(which='major', linestyle='-', linewidth='0.5', color='#e6e6e6', zorder=0, axis='y')

This code calculates the mean return for each quantile across all months, annualizes it, and plots it as a bar chart. The chart highlights how returns vary by model confidence:

Highlights:

Returns increase monotonically from quantile 1 to quantile 10.
Quantile 1 has a negative return (-0.64%), indicating the model is correctly identifying poor-performing stocks.
Quantile 10 shows the highest return at 13.0%, suggesting strong predictive power in selecting top-performing stocks.
The spread between quantile 10 and quantile 1 is approximately 13.6 percentage points, which is economically meaningful.
The results are not as good as described in the paper, but still, worth looking into it.

This pattern strongly supports the idea that the model’s confidence scores are informative, as shown by the authors in the paper. A long-short strategy (long top quantile, short bottom) would likely be profitable and consistent, as we can verify:

# Compute long-short return for each month: top quantile minus bottom quantile
monthly_rets = pd.Series()
for date, quantiles in qrets.iterrows():
    quantiles = quantiles.dropna()  # Skip if quantile bins are incomplete
    monthly_rets[date] = quantiles.iloc[-1] - quantiles.iloc[0]

# Compute cumulative returns over time from monthly long-short strategy
cumulative_rets = (1 + monthly_rets).cumprod()

The snippet above generates the cumulative returns for the strategy. We can visualize it with one line, cumulative_rets.plot(logy=True), or if we add a bit more formatting and the benchmark:

To get a summary of the backtest main stats, here's what we can do:

init_capital = 100_000
summary = pd.DataFrame()
# Strategy
summary.loc['Start', 'Strategy'] = '2000-12-29'
summary.loc['End', 'Strategy'] = cumulative_rets.index[-1].strftime('%Y-%m-%d')
summary.loc['Duration [months]', 'Strategy'] = len(cumulative_rets)
summary.loc['Exposure Time [%]', 'Strategy'] = 100
summary.loc['Start [$]', 'Strategy'] = 1 * init_capital
summary.loc['Final [$]', 'Strategy'] = cumulative_rets.iloc[-1] * init_capital
summary.loc['Peak [$]', 'Strategy'] = cumulative_rets.max() * init_capital
summary.loc['Return [%]', 'Strategy'] = (cumulative_rets.iloc[-1] - 1) * 100
summary.loc['Return (Ann.) [%]', 'Strategy'] = ((cumulative_rets.iloc[-1]) ** (12/len(cumulative_rets)) - 1) * 100
summary.loc['Volatility (Ann.) [%]', 'Strategy'] = monthly_rets.std() * np.sqrt(12) * 100
summary.loc['Sharpe Ratio', 'Strategy'] = (monthly_rets.mean() / monthly_rets.std() * np.sqrt(12))
summary.loc['Correlation', 'Strategy'] = 1.
summary.loc['Max. Drawdown [%]', 'Strategy'] = get_drawdown(cumulative_rets).min() * 100
# Benchmark
summary.loc['Start', 'S&P 500'] = '1994-12-01'
summary.loc['End', 'S&P 500'] = res['Benchmark'].index[-1].strftime('%Y-%m-%d')
summary.loc['Duration [months]', 'S&P 500'] = len(cumulative_rets)
summary.loc['Exposure Time [%]', 'S&P 500'] = 100
summary.loc['Start [$]', 'S&P 500'] = 1 * init_capital
summary.loc['Final [$]', 'S&P 500'] = res['Benchmark'].iloc[-1] * init_capital
summary.loc['Peak [$]', 'S&P 500'] = res['Benchmark'].max() * init_capital
summary.loc['Return [%]', 'S&P 500'] = (res['Benchmark'].iloc[-1] - 1) * 100
summary.loc['Return (Ann.) [%]', 'S&P 500'] = ((res['Benchmark'].iloc[-1]) ** (12/len(cumulative_rets)) - 1) * 100
bmr = res['Benchmark'].pct_change().dropna()
summary.loc['Volatility (Ann.) [%]', 'S&P 500'] = bmr.std() * np.sqrt(12) * 100
summary.loc['Sharpe Ratio', 'S&P 500'] = (bmr.mean() / bmr.std() * np.sqrt(12))
summary.loc['Correlation', 'S&P 500'] = monthly_rets.corr(bmr)
summary.loc['Max. Drawdown [%]', 'S&P 500'] = get_drawdown(res['Benchmark']).min() * 100

We should see something like this:

Overall, the strategy outperforms the benchmark across all key dimensions:

Annualized return is 12.8%, nearly double the S&P 500’s 7.0%;
Sharpe Ratio is 1.03, more than twice that of the benchmark, indicating superior risk-adjusted performance;
Maximum drawdown is limited to 24%, less than half of the S&P 500’s 52.6%, reflecting strong downside protection;
Correlation to the market is negative, offering valuable diversification benefits in a broader portfolio.

Conclusion

While our implementation successfully demonstrated the viability of applying deep learning to momentum-based stock prediction, the results we obtained were a far cry from those reported by Takeuchi and Lee (2013). There are several likely reasons for this discrepancy. First, we used a different and significantly smaller dataset, both in terms of the number of stock-month observations and market breadth. Second, our time horizon was more recent and extended into a different market regime, potentially less favorable to momentum strategies. Third, unlike the authors’ original approach which involved unsupervised pretraining using stacked RBMs, we trained the feedforward network end-to-end using modern initialization and optimization techniques. Additionally, we adopted a rolling-window training and evaluation protocol, whereas the original study trained once on a large historical sample and tested on a fixed out-of-sample window. These methodological and structural differences, combined with evolving market dynamics, likely account for the performance gap.

Nevertheless, replicating the exact results reported by Takeuchi and Lee was never the true goal. The purpose of this project was to explore, in a simplified and more accessible setting, the core insight behind Richard Sutton’s Bitter Lesson: that general methods powered by computation outperform systems infused with human-crafted rules and domain expertise. Our toy example, despite its simplicity and the constraints we faced, served as a proof of concept for this idea. By throwing a clean pipeline, raw price data, and compute at the return prediction problem—without relying on hand-engineered financial indicators—we were able to extract signal and generate meaningful returns. In that sense, the experiment achieved what it set out to do: demonstrate that scaling learning systems and leaning into computation-first approaches can produce viable results even in challenging, noisy domains like financial markets.

NextA trend factor

Last updated 2 months ago