Alternative Data for Trading – Statistical Arbitrage

There are lots of ways to trade but few ways to gain the edges. It is either through access to information fast or acting fast on it (think HFTs). The other way is being able to process and extract signals better (MFT). Alternate Data for Trading allows for an informational edge and forms the basis for a new age of statistical arbitrage.

There is no single market secret to discover, no single correct way to trade the markets. Those seeking the one true answer to the markets haven’t even gotten as far as asking the right question, let alone getting the right answer.
Jack Schwager – Author of Market Wizard

Statistical Edge

The second part, which is the focus area for this article is about processing the existing data better or gaining access to alternative sources aka non-price action related data for trading.

Either, process the information better.
Access information not widely used to gain an edge.

As trading edges have gotten harder over the years. So, it is getting difficult to squeeze more juice out of the same price action data and the growth of digitisation. There are a bunch of really interesting alternative data sets available. This is a comprehensive collection of several vendors selling alternative data sources here.

So, you have received a signal and need to check its efficacy against a well-known instrument. How do go about doing it? This is an experimental case study and does not use any data I worked on commercially.

This is what our sample data set looks like

We just need to verify and determine if the signal is anyway useful. There are a couple of steps and key things to remember before embarking on checking the signal quality. A quick data verification and we should be able to move ahead.

Data Quality Checks

The data looks fine except for a couple of obvious errors such as the -150 closing price. We just need an automated way to fix such obvious and any other errors.

To deal with the data quality checks, I computed a couple of extra metrics

The rolling mean of last 5-20 data points.
The standard deviation of the last 5-20 data points.
We create a couple of lower and upper barriers using 6 sigmas as the threshold.
Flag any points lying beyond the six sigma from the average.

Interestingly, we seemed to have been able to catch these errors right. We will go ahead fix the errors by replacing the erroneous data points with the average values.

We go ahead and do a similar process for Signal as well, as extreme outliers are possible for this data vector as well. The data is a lot more compressed and we are good to go to the next steps.

Efficacy Check

Now, that we want to understand if the signal has any predictive power. We need to be able to define a metric and measure the same. We use the percentage change in signal and if it has any predictive power over computing 1-day forward-looking price change or returns.

It’s necessary to do this over something like returns since these are stationary data points and it’s a requirement for lots of statistical assumptions and independence between data points.

Beware Forward Bias

We map the X(Signal Change) and Y (The forward-looking 1-day returns). This is necessary so, the signal change today is used to forecast the returns by end of tomorrow.

A quick peek into the distribution of X and Y is a good visual to have. The returns have a distribution ranging from -6% to around 4%.

On the other hand, the signal is very polarising in terms of its change. It therefore can use a gate with +1 or -1 output.

Error Metrics

Now, to measure the efficacy. I tried to compute the RMSE (Root Mean Square Error) between the signal change and the actual returns. This should be as low a number as possible.

This RMSE turns out to be ~0.055. This roughly translates to a 5.5% error and is quite significant in nature. Such a large error will wipe out any trading possibility and is validated by the cumulative performance graph.

As can be visualised, we don’t have a profitable strategy and this is even before, we have accounted for any kind of trading costs.

Alternative data, in this case, was used as a direct input for trading the instrument. Besides a simple model with one day change as input, we could have explored building a more involved model using sequential inputs of this signal besides just the one day change.

This signal could have further been combined with the price change of the previous data for predictive purposes. Some of these concepts and the use of sequential neural networks are explored here.

Conclusion

This is one of the ways to go about measuring the efficacy and looks into the use of alternative data for trading instead of trying to use the same data points and processing them better aka use more complex models/data transformations.

While LSTMs and RNNs have been very useful across language/text problems, their usage has been a challenge in the market space. They are more useful as a means to extract signals from alternative data sources than trying to use the raw data to build complex models.