Powering Quant Finance with Qlib’s PyTorch MLP on Alpha360

December 22, 2024 · 5 min read

Senior Software Engineer at Vitrifi

Introduction

Qlib is an AI-oriented, open-source platform from Microsoft that simplifies the entire quantitative finance process. By leveraging PyTorch, Qlib can seamlessly integrate modern neural networks—like Multi-Layer Perceptrons (MLPs)—to process large datasets, engineer alpha factors, and run flexible backtests. In this post, we focus on a PyTorch MLP pipeline for Alpha360 data in the US market, examining a single YAML configuration that unifies data ingestion, model training, and performance evaluation.

PyTorch MLP on Alpha360: Overview

Below is the relevant YAML from workflow_config_mlp_Alpha360.yaml. A single qrun command triggers Qlib to:

Initialize a US market environment.
Load and preprocess daily data factors from Alpha360 (2008–2020).
Train a PyTorch MLP (DNNModelPytorch) with around 360 input features.
Backtest a top-k strategy with transaction costs.
Analyze signals, rank IC, and other alpha metrics.

The Command

qrun workflow_config_mlp_Alpha360.yaml

YAML Breakdown

qlib_init:
  provider_uri: "/Users/vadimnicolai/Public/work/qlib-cookbook/.data/us_data"
  region: us
  kernels: 1

market: &market sp500
benchmark: &benchmark ^GSPC

data_handler_config: &data_handler_config
  start_time: 2008-01-01
  end_time: 2020-08-01
  fit_start_time: 2008-01-01
  fit_end_time: 2014-12-31
  instruments: *market
  infer_processors:
    - class: RobustZScoreNorm
      kwargs:
        fields_group: feature
        clip_outlier: true
    - class: Fillna
      kwargs:
        fields_group: feature
  learn_processors:
    - class: DropnaLabel
    - class: CSRankNorm
      kwargs:
        fields_group: label
  label: ["Ref($close, -2) / Ref($close, -1) - 1"]

port_analysis_config: &port_analysis_config
  strategy:
    class: TopkDropoutStrategy
    module_path: qlib.contrib.strategy
    kwargs:
      signal: <PRED>
      topk: 50
      n_drop: 5
  backtest:
    start_time: 2017-01-01
    end_time: 2020-08-01
    account: 100000000
    benchmark: *benchmark
    exchange_kwargs:
      limit_threshold: 0.095
      deal_price: close
      open_cost: 0.0005
      close_cost: 0.0015
      min_cost: 5

task:
  model:
    class: DNNModelPytorch
    module_path: qlib.contrib.model.pytorch_nn
    kwargs:
      batch_size: 1024
      max_steps: 4000
      loss: mse
      lr: 0.002
      optimizer: adam
      GPU: 0
      pt_model_kwargs:
        input_dim: 360

  dataset:
    class: DatasetH
    module_path: qlib.data.dataset
    kwargs:
      handler:
        class: Alpha360
        module_path: qlib.contrib.data.handler
        kwargs: *data_handler_config
      segments:
        train: [2008-01-01, 2014-12-31]
        valid: [2015-01-01, 2016-12-31]
        test: [2017-01-01, 2020-08-01]

  record:
    - class: SignalRecord
      module_path: qlib.workflow.record_temp
      kwargs:
        model: <MODEL>
        dataset: <DATASET>

    - class: SigAnaRecord
      module_path: qlib.workflow.record_temp
      kwargs:
        ana_long_short: False
        ann_scaler: 252

    - class: PortAnaRecord
      module_path: qlib.workflow.record_temp
      kwargs:
        config: *port_analysis_config

Detailed Explanation

Initialization (qlib_init)
- region: us → Enforces US-like trading rules and cost assumptions.
- kernels: 1 → Minimizes concurrency confusion with PyTorch’s own parallelism.
Data Handler
- Alpha360: A large factor set for daily bar data, typically 360 features.
- Time Windows:
  - Full data: start_time=2008-01-01 ~ end_time=2020-08-01
  - Fit set: fit_start_time=2008-01-01 ~ fit_end_time=2014-12-31
- Infer Processors:
  - RobustZScoreNorm → Scales features robustly (less sensitive to outliers).
  - Fillna → Eliminates missing data.
- Learn Processors:
  - DropnaLabel → Removes rows if target is NaN.
  - CSRankNorm → Normalizes labels cross-sectionally.
Task Model: DNNModelPytorch
- loss: mse → Minimizes mean squared error for next-day relative returns.
- batch_size: 1024, max_steps: 4000 → Medium-scale training loop.
- GPU: 0 → CPU usage only (helpful for debugging or Mac M1 contexts).
- pt_model_kwargs.input_dim=360 → Matches the factor dimension from Alpha360.
Dataset
- Splits:
  - Train: 2008–2014
  - Valid: 2015–2016
  - Test: 2017–2020
- This forward-time partition prevents look-ahead bias.
Record
- SignalRecord: Saves the MLP’s predictions for further analysis.
- SigAnaRecord: Analyzes correlation (IC, Rank IC) with real returns.
- PortAnaRecord: Runs a top-50 “dropout” strategy with cost modeling.

Running & Logs

When running:

qrun workflow_config_mlp_Alpha360.yaml

You might see logs like:

[8262:MainThread](...) INFO - qlib.Initialization - ...
ModuleNotFoundError. XGBModel...
[8262:MainThread](...) INFO - DNNModelPytorch - DNN parameters setting:
  lr : 0.002
  max_steps : 4000
  batch_size : 1024
  ...
Time cost: 114.175s | Loading data Done
RuntimeWarning: All-NaN slice encountered
RobustZScoreNorm Done
DropnaLabel Done
CSRankNorm Done
Time cost: 85.667s | fit & process data Done
Time cost: 199.844s | Init data Done
[1]  8262 segmentation fault  qrun ...

Why Segfault?

Possibly a concurrency/memory conflict:
1. Large batch size and big factor set.
2. Mac ARM + PyTorch concurrency issues.
3. Minimizing concurrency (kernels: 1) sometimes fixes it, but you might still need to reduce batch_size or update drivers.

If it completes:

{'IC': 0.0045, 'ICIR': 0.0559, 'Rank IC': 0.0048, 'Rank ICIR': 0.0607}
'The following are analysis results of the excess return with cost(1day).'
mean              -0.000280
annualized_return -0.066664
information_ratio -0.940238
...

IC ~ 0.0045: Very low predictive correlation → might need more advanced networks or features.
Negative annualized return: Trading costs or frequent portfolio turnover can overshadow small alpha signals.

Conclusion

Utilizing Qlib’s PyTorch MLP on Alpha360 is a powerful demonstration of an end-to-end AI quant workflow—from robust data transformations to neural net training and a straightforward top-k backtest. The single YAML config (workflow_config_mlp_Alpha360.yaml) encapsulates every step, streamlining reproducibility and collaboration.

Despite potential concurrency hiccups or segmentation faults, this pipeline highlights how advanced deep-learning methods can be integrated into classical alpha factor strategies. By fine-tuning hyperparameters, expanding feature sets, and managing data concurrency, researchers can push the boundaries of machine-driven alpha discovery—and tackle real-world challenges in quantitative finance more effectively.

Further Resources:

Pro Tip: If segmentation faults persist, consider installing PyTorch in a conda environment with pinned versions, or run with smaller subsets of data/batch sizes. This keeps your pipeline stable and ready for iteration on more complex deep-learning models.

Introduction​

PyTorch MLP on Alpha360: Overview​

The Command​

YAML Breakdown​

Detailed Explanation​

Running & Logs​

Why Segfault?​

Conclusion​