Article Categories

Synthetic Backtesting Data


One of the techniques we use at MCI to help develop robust strategies is the use of backtesting over extended periods of time, which allows us to model their performance over a wider variety of market conditions. This article explains the method we use to create synthetic data sets for backtesting purposes.

The Problem

Several of the ETFs (SPXL, TQQQ, TNA, for example) that we use for trading in our strategies did not exist prior to 2008, which limits our ability to perform backtesting over as wide a range of time as we would like.

The Solution: Synthetic Data

Fortunately, the behavior of the ETFs we are interested in simulating is well understood and easily modeled because they all seek to multiply the daily percentage moves in the underlying market index they are tied to. So, we will employ a commonly-used mathematical technique - linear regression - to create a simulated set of data by analyzing the relationship between the ETF price and its associated market index for the available market data, and then use that relationship to extrapolate the data points that we need to create prior to the existence of the ETF.

Creating the Data

We will demonstrate the technique we used to create the synthetic data for SPXL, which is a 3X leveraged ETF that tracks the S&P 500 index. We use SPXL extensively in our trading strategies, so there was a definite need to have backtesting data available for this product. This same technique will be used to build data sets for some of the other ETFs we use that also came into existence relatively recently. SPXL came into existence on November 5th, 2008, so we will want to generate a set of synthetic data prior to that date, and as far back as we deem necessary to capture a wide variety of market conditions for our backtesting scenarios.

S&P500 market data is downloaded into a spreadsheet with open, high, low, and closing prices. We add calculations for % change from open to high (%Chg_OH), % change from open to low (%Chg_OL), the % range as a % of open price (%Range), % change from close to previous close (%Chg_CC), and %change from open to close (%Chg_OC). For example, the data for 11/5/2008 looks like this:


Market data for SPXL is also downloaded into the same spreadsheet. The SPXL data for 11/5/2008 looks like this:


We calculate the %Chg_CC of SPXL and of the S&P500 data for each date beginning on 11/5/2008 and continuing to the present date. The %Chg_CC for SP500 and SPXL are then used to approximate a best straight-line fit for the data so that a multiplier factor can be created to simulate the closing prices of SPXL prior to 11/5/2008 when applied to the SP500 %Chg_CC values for each date back to 1/2/1980.

A linear regression fit of the spreadsheet data produced the following:

Slope of SPXL Fit = 2.887873892 Intercept of SPXL Fit = -0.000285083

We see that the slope is fairly close to the nominal 3X relationship of SPXL to the S&P500. However, it should be noted that the 3X leveraging of SPXL is maintained only on a daily-close basis and is not guaranteed to be maintained over an extended period of time. For more information on SPXL, please see the Direxion funds website

For each row in the spreadsheet, starting with 1/2/1980, we calculate an estimated %Chg_CC value for SPXL based on the %Chg_CC of the S&P500, using the equation:

%Chg_CC_SPXL =%Chg_CC_SP500 * 2.887873892 - 0.000285083

Next, for each row in the spreadsheet, starting with 11/5/2008, and working backwards, we put in a formula in each SPXL opening price cell setting it equal to the previous day’s close.

Then, we estimate a closing price for SPXL using the equation: SPXL_C = SPXL_C(n+1) / (1 + %Chg_CC_SPXL ) for each row starting with 11/4/2008 and working backwards.

The daily high and low values for SPXL will be simulated as follows for each row of data:

SPXL_H = (1 + 2.887873892 * %Chg_SP500_OH) * SPXL_O SPXL_L = (1 + 2.887873892 * %Chg_SP500_OL) * SPXL_O

The adjusted close price is used for SPXL because it accounts for dividends, splits, etc. So to calculate that, we determine the pre-11/5/2008 adjustment factor by dividing the closing price on 11/5/2008 by the downloaded adjusted close price on 11/5/2008. That value is 3.531635595. So each adjusted close price prior to 11/5/2008 is determined by taking the simulated close price and dividing by 3.531635595.


The use of a simple mathematical technique allows us to create a synthetic set of data for several popular leveraged ETFs that are extensively used in our trading strategies. Because the relationship between these ETFs and their underlying index is well understood, we can create a reliable proxy for data that would have existed prior to the inception of these funds, thereby allowing us to do more exhaustive backtesting of our strategies over a larger set of market conditions.

Copyright © 2018, Market Chronologix, Inc.

Article details:
synthetic | backtesting |