Page cover

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Testing Richard Sutton's Bitter Lesson in the Markets with a Toy Deep Learning Model

In this tutorial, we implement the paper Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks by Lawrence Takeuchi and Yu-Ying (Albert) Lee. Originally published in 2013, this early and influential work explores the use of deep learning—specifically stacked restricted Boltzmann machines (RBMs) and feedforward neural networks—to extract return-predictive features from historical stock price data and improve momentum-based trading performance.

Why implementing this paper

This paper is one of the earliest to explore the use of deep learning in financial markets, well before it became a mainstream tool in quant finance. Rather than relying on hand-crafted features or traditional factor models, the authors use a stacked autoencoder built from restricted Boltzmann machines (RBMs) to extract predictive signals directly from raw return data. These features are then passed to a feedforward neural network to classify stocks based on future return potential.

The reported results are striking: the deep learning model delivers an annualized return of 45.93% over the 1990–2009 test period, vastly outperforming the basic momentum strategy, which returned just 10.53%. It also achieved significantly higher Sharpe ratios and outperformance across deciles ranked by model confidence.

While our implementation uses a more modern approach (end-to-end training, no RBMs), and our dataset differs significantly in scope and period, the core motivation remains the same: to test whether deep learning can extract meaningful alpha from raw price signals, validating the broader claim of Richard Sutton’s Bitter Lesson—that general-purpose, computation-heavy methods outperform manually engineered ones over time.

For a full breakdown of the original paper and our experimental results, see my article on quantitativo.com. Here, we’ll focus on how to implement the model step by step.

Data

In our replication, we will use daily stock prices from January 1, 1990, to today, obtained from Norgate data. Norgate provides a high-quality survivorship bias-free daily data for the US stock market that is very affordable. For more information on how to acquire a Norgate data subscription, please check Norgate website.

The first step is to retrieve the data for a given symbol:

def get_data(symbol):
    # Get raw price data (adjusted and unadjusted closes)
    df = norgatedata.price_timeseries(
        symbol,
        padding_setting=norgatedata.PaddingType.ALLMARKETDAYS,
        stock_price_adjustment_setting=norgatedata.StockPriceAdjustmentType.CAPITALSPECIAL,
        timeseriesformat='pandas-dataframe'
    )[['Close', 'Unadjusted Close']]
    df = df.ffill()
    df['date'] = df.index

    # Get end-of-month data: last trading day of each month
    eom = df.groupby([df.index.year, df.index.month]).last()\
        .set_index('date').iloc[:-1, :]
    eom['is_next_jan'] = (eom.index.month == 12).astype(int)  # Flag if the next month is January
    eom['next_month_ret'] = eom['Close'].shift(-1) / eom['Close'] - 1  # Forward return for next month

    # Compute monthly cumulative returns over the past 12 months (ret-m1 to ret-m12)
    monthly = pd.DataFrame(index=eom.index)
    for m in range(13, 0, -1):
        monthly[f'ret-m{m}'] = eom['Close'].shift(m - 1)
    monthly = (monthly.T / monthly.T.iloc[0] - 1).iloc[1:, :].T.astype(float)

    # Compute daily cumulative returns over the past 20 trading days (ret-d1 to ret-d20)
    dates = pd.DataFrame(index=eom.index)
    for d in range(21, 0, -1):
        dates[f'ret-d{d}'] = df['date'].shift(d - 1)
    dates = dates.dropna()
    daily = pd.DataFrame(index=dates.index, columns=dates.columns)
    
    # Map historical closing prices based on shifted dates
    for date in daily.index:
        daily.loc[date] = df.loc[dates.loc[date].values, 'Close'].values
    daily = (daily.T / daily.T.iloc[0] - 1).iloc[1:, :].T.astype(float)

    # Merge monthly and daily features with EOM targets and metadata
    df = monthly.join(daily, how='inner')\
        .join(eom[['Unadjusted Close', 'is_next_jan', 'next_month_ret']], how='inner')
    df = df.dropna().rename(columns={'Unadjusted Close': 'unadj_close'})

    # Return None if no usable rows
    if len(df) == 0:
        return None

    # Create MultiIndex (date, symbol)
    df.index = pd.MultiIndex.from_tuples([(d, symbol) for d in df.index])

    # Drop any rows with infinite values
    df = df[~df.isin([np.inf, -np.inf]).any(axis=1)]
    return df

The function get_data(symbol) retrieves daily price data for a given stock symbol using Norgate, then constructs a feature-rich DataFrame indexed by (date, symbol) for use in predictive models. It calculates monthly cumulative returns (ret-m1 to ret-m12) based on end-of-month closes and daily cumulative returns (ret-d1 to ret-d20) over the last 20 trading days. It also includes the unadjusted close price, a binary flag is_next_jan indicating if the next month is January, and the forward return for the next month (next_month_ret). The function ensures no missing or infinite values and returns a clean, multi-indexed dataset ready for modeling.

We can check it with a call like get_data('AAPL'):

Data for AAPL

Now, retrieving all data is straightforward:

In about 30 minutes, we are ready to continue.

Pre-processing

Before training the model, we apply several pre-processing steps to clean the data, standardize features, and define the target variable:

We first filter out low-priced stocks (unadjusted close ≤ $5), which are often illiquid and noisy. Then, we apply cross-sectional z-score standardization to all features (excluding the last two columns) on each date to normalize them. The target variable is defined as a binary label: 1 if the next month’s return is above the median return for that date, and 0 otherwise. We also preserve the raw (unstandardized) version of the last feature (is_next_jan) and the original forward return for later analysis. All components are then combined into a single DataFrame for model input.

Cross-validation splits

To evaluate model performance over time, we implement a rolling-window cross-validation framework tailored for time series data:

The train_val_test_split function generates chronological train/validation/test splits by sliding a multi-year window forward one year at a time. For each iteration, it defines a training period, a validation period, and a test year—ensuring no data leakage. We can choose whether to place the validation set before or after the training set using the validation_first flag. This approach mimics a realistic backtesting setup and provides out-of-sample evaluation across multiple time periods.

We can verify that the datasets are being generated correctly with the following code:

Train, validation, and test datasets

Now, we are ready to work on the model.

The model

We define a simple feedforward neural network (FFNN) architecture to classify the pre-processed features into two classes, as specified in the paper:

The FFNN model consists of three hidden layers with ReLU activations, followed by a final output layer that produces logits for binary classification. The input layer expects 33 features, which are transformed through progressively deeper representations: from 40 units, to 4, then to 50, before reaching the final output layer with 2 neurons. The model returns raw logits, which are suitable for use with CrossEntropyLoss, a standard loss function for multi-class classification tasks.

In the original 2013 paper, Takeuchi and Lee used a stacked autoencoder built from restricted Boltzmann machines (RBMs) for pretraining, which was common at the time. They pre-trained the encoder (33→40→4) to extract features and then attached a feedforward classifier (4→50→2) on top. Our model directly implements the final architecture they found optimal through cross-validation, bypassing the pretraining phase.

The reason we can now skip RBMs and pretraining is due to major advances in hardware (GPUs/TPUs), software (PyTorch, TensorFlow), initialization methods, and optimization algorithms (like Adam). These allow deep neural networks to be trained end-to-end from scratch efficiently, even on noisy financial datasets. In contrast, a decade ago, training deep nets from scratch often resulted in poor convergence, which is why unsupervised layer-wise pretraining with RBMs was widely used to help the network learn useful representations before supervised fine-tuning.

We can verify that our architecture works using the following code:

The code builds a PyTorch DataLoader from a DataFrame, converting features and labels to tensors for training. It retrieves one mini-batch, checks the shapes, and runs a forward pass through the FFNN to verify the model works as expected. We should see something like this:

Now, let's move to training.

Training

We now define the training loop used to optimize the model parameters using the training and validation sets:

The train function performs mini-batch training of the FFNN model using the Adam optimizer and cross-entropy loss. It tracks performance on both the training and validation sets, saving the model weights that yield the lowest validation loss. After training for a specified number of epochs, it returns the best-performing model along with key metrics such as validation loss, accuracy, and the epoch at which the best performance was achieved.

We now generate predictions and rank stocks by model confidence:

The method above generates model predictions on the test set by applying the trained network to each mini-batch and collecting the predicted classes and class probabilities. It appends these outputs to the original test DataFrame and assigns each example to a quantile bucket based on its predicted probability of belonging to class 1. This enables cross-sectional ranking of stocks by confidence level, which is useful for evaluating strategy performance by signal strength.

We now run rolling-window training and evaluation to collect predictions and track model performance over time:

Running the code above will take a bit over an hour. We should see something like this:

The DataFrame that tracks the performance should record something like this:

Results from training the model

Now, we are ready to see the results.

Results

Let's start reviewing the model's classification performance by analyzing the confusion matrix:

This code compares the true labels and the model’s predicted classes across the entire test set to compute a confusion matrix, which is then column-normalized to show the distribution of predictions per class. This helps assess whether the model is biased toward one class and how well it distinguishes between them. The results are visualized using ConfusionMatrixDisplay, with values shown as proportions for easier interpretation.

We should see something like this:

Confusion matrix

Highlights:

  • The model is slightly better than random guessing, with ~52% accuracy for both predicted classes.

  • It shows no strong bias toward one class—performance is symmetrical.

  • This aligns with expectations for noisy financial datasets with low signal-to-noise ratios, where even a small edge (e.g., >50% accuracy) can be economically meaningful when scaled properly in a trading strategy.

We now compute the monthly long-short returns based on the spread between the top and bottom quantile predictions:

This snippet computes the average forward return for each quantile bucket on a monthly basis by grouping predictions by date and quantile. It then unstacks the results into a table where each column represents a quantile. Finally, it prints how many months are missing one or more quantiles, which can occur due to duplicates='drop' in qcut, highlighting potential data sparsity in certain months.

We now visualize the annualized average return by quantile to assess the relationship between predicted confidence and realized performance:

This code calculates the mean return for each quantile across all months, annualizes it, and plots it as a bar chart. The chart highlights how returns vary by model confidence:

Mean return for each quantile

Highlights:

  • Returns increase monotonically from quantile 1 to quantile 10.

  • Quantile 1 has a negative return (-0.64%), indicating the model is correctly identifying poor-performing stocks.

  • Quantile 10 shows the highest return at 13.0%, suggesting strong predictive power in selecting top-performing stocks.

  • The spread between quantile 10 and quantile 1 is approximately 13.6 percentage points, which is economically meaningful.

  • The results are not as good as described in the paper, but still, worth looking into it.

This pattern strongly supports the idea that the model’s confidence scores are informative, as shown by the authors in the paper. A long-short strategy (long top quantile, short bottom) would likely be profitable and consistent, as we can verify:

The snippet above generates the cumulative returns for the strategy. We can visualize it with one line, cumulative_rets.plot(logy=True), or if we add a bit more formatting and the benchmark:

Equity and drawdown curves

To get a summary of the backtest main stats, here's what we can do:

We should see something like this:

Summary of backtest main stats

Overall, the strategy outperforms the benchmark across all key dimensions:

  • Annualized return is 12.8%, nearly double the S&P 500’s 7.0%;

  • Sharpe Ratio is 1.03, more than twice that of the benchmark, indicating superior risk-adjusted performance;

  • Maximum drawdown is limited to 24%, less than half of the S&P 500’s 52.6%, reflecting strong downside protection;

  • Correlation to the market is negative, offering valuable diversification benefits in a broader portfolio.

Conclusion

While our implementation successfully demonstrated the viability of applying deep learning to momentum-based stock prediction, the results we obtained were a far cry from those reported by Takeuchi and Lee (2013). There are several likely reasons for this discrepancy. First, we used a different and significantly smaller dataset, both in terms of the number of stock-month observations and market breadth. Second, our time horizon was more recent and extended into a different market regime, potentially less favorable to momentum strategies. Third, unlike the authors’ original approach which involved unsupervised pretraining using stacked RBMs, we trained the feedforward network end-to-end using modern initialization and optimization techniques. Additionally, we adopted a rolling-window training and evaluation protocol, whereas the original study trained once on a large historical sample and tested on a fixed out-of-sample window. These methodological and structural differences, combined with evolving market dynamics, likely account for the performance gap.

Nevertheless, replicating the exact results reported by Takeuchi and Lee was never the true goal. The purpose of this project was to explore, in a simplified and more accessible setting, the core insight behind Richard Sutton’s Bitter Lesson: that general methods powered by computation outperform systems infused with human-crafted rules and domain expertise. Our toy example, despite its simplicity and the constraints we faced, served as a proof of concept for this idea. By throwing a clean pipeline, raw price data, and compute at the return prediction problem—without relying on hand-engineered financial indicators—we were able to extract signal and generate meaningful returns. In that sense, the experiment achieved what it set out to do: demonstrate that scaling learning systems and leaning into computation-first approaches can produce viable results even in challenging, noisy domains like financial markets.

Last updated