ML models need better training data: Genai Solution

Our understanding of financial markets is inherently constrained by historical experience. This is a single realized timeline among the myriad possibilities that could be unfolded. Each market cycle, geopolitical event, or policy decision is just one manifestation of potential outcomes.

This limitation becomes particularly severe when training machine learning (ML). This can be misleaded from historical artifacts rather than underlying market dynamics. As complex ML models become more common in investment management, the tendency to over-employ on certain historical conditions increases the risk of investment outcomes.

Generated AI-based synthetic data (Genai synthetic data) has emerged as a potential solution to this challenge. While Genai has focused primarily on natural language processing, its ability to generate sophisticated synthetic data may prove even more valuable for quantitative investment processes. By creating data that effectively represents a “parallel timeline”, this approach can be designed and designed to provide a richer training dataset that explores counterfactual scenarios while maintaining critical market relationships.

Challenge: Move beyond a single timeline training

Traditional quantitative models face inherent limitations. They learn from a single sequence of events that lead to current conditions. This creates what is called “empirical bias.” This challenge is more pronounced in complex machine learning models, particularly vulnerable to overfitting for limited historical data, due to their ability to learn complex patterns. Another approach is to consider counterfactual scenarios. Perhaps it was developed when an arbitrary event, decision, or shock unfolded in a different way.

To illustrate these concepts, consider an active international equity portfolio based on MSCI EAFE. Figure 1 shows the performance characteristics of multiple portfolios (upside capture, drawback capture, and overall relative returns) over the past five years ending January 31, 2025.

Figure 1: Empirical data. EAFE benchmarked portfolio, performance characteristics over the five years, up to January 31, 2025.

This empirical dataset represents a small sample of possible portfolios, and even smaller samples of potential outcomes may have had different events unfolding. The traditional approach to extending this dataset has significant limitations.

Figure 2. Instance-based approach: k-nearest Neighbors (left), Small (right).

Traditional synthetic data: Understanding limitations

Traditional methods of synthetic data generation attempt to address data limitations, but often do not acquire the complex dynamics of financial markets. You can use examples from the EAFE portfolio to explore how different approaches can be implemented.

Instance-based methods such as K-NN and Small extend existing data patterns through local sampling, but remain fundamentally constrained by observed data relationships. It cannot generate scenarios well beyond the training examples, limiting utilities for understanding potential future market conditions.

Figure 3: A more flexible approach generally improves results, but struggles to grasp complex market relationships: GMM (left), KDE (right).

Traditional synthetic data generation approaches face fundamental limitations, whether they are instance-based methods or density estimation. These approaches can scale patterns in stages, but do not generate realistic market scenarios that explore completely different market conditions while maintaining complex interrelationships. This limitation becomes particularly clear when examining the density estimation approach.

Density estimation approaches such as GMM and KDE offer the flexibility to expand data patterns, but still struggle to capture the complex, interconnected dynamics of financial markets. These methods are particularly shaking during regime changes when historical relations may evolve.

Genai Synthetic Data: More Powerful Training

Recent research at City St Georges and Warwick University was published at the NYU ACM International Conference on AI in Finance (ICAIF) to show how Genai could potentially better approximate the underlying data generation capabilities of the market. Through neural network architecture, this approach aims to learn conditional distributions while maintaining sustained market relationships.

Research and Policy Center (RPC) will soon publish a report outlining the generator AI approach that can be used to define and create synthetic data. This report highlights the best ways to assess the quality of synthetic data and use references to existing academic literature to highlight potential use cases.

Figure 4: Diagram of genai synthetic data that expands the space of realistic possible outcomes while maintaining important relationships.

This approach to synthetic data generation can be extended to provide several potential benefits.

Expanded training set: Realistic enhancements to limited financial datasets
Scenario Exploration: Generating plausible market conditions while maintaining sustainable relationships
Tail Event Analysis: Creating diverse but realistic stress scenarios

As shown in Figure 4, the Genai synthetic data approach aims to expand the space of possible portfolio performance characteristics while respecting basic market relationships and realistic boundaries. This provides a rich training environment for machine learning models, potentially reducing vulnerabilities to historical artifacts and improving the ability to generalize across market conditions.

Implementing security selection

Genai synthetic data offers three potential benefits, especially for stock selection models that are prone to learning false historical patterns.

Excess fitting reduced: Training on various market conditions allows the model to better distinguish between permanent signals and temporary artifacts.
Enhanced tail risk management: A more diverse scenario in training data may improve the robustness of the model during market stress.
Better generalization: Expansion of training data to maintain realistic market relationships can help the model adapt to changing circumstances.

The implementation of effective Genai synthetic data generation presents unique technical challenges and potentially surpasses the complexity of the investment model itself. However, our study suggests that addressing these challenges can significantly improve risk-adjusted returns through more robust model training.

Genai Pass to Better Model Training

Genai synthetic data could provide stronger, forward-looking insights into investment and risk models. It aims to better approximate the market's data generation capabilities through a neural network-based architecture, allowing it to maintain sustained interactions while more accurately representing future market conditions.

While this could benefit most investment and risk models, the main reason for representing important innovations today is due to the increased adoption of machine learning in investment management and excessive risk. Genai synthetic data can generate plausible market scenarios that maintain complex relationships while exploring a variety of conditions. This technology offers a pathway to a more robust investment model.

However, even the most sophisticated synthetic data cannot compensate for naive machine learning implementations. There is no safe modification to excessive complexity, opaque models, or weak investment grounds.

Research and Policy Center will host a webinar tomorrow, March 18th, featuring Marcos López De Prado, a world-renowned expert in financial machine learning and quantitative research.

Source link

What's Hot