I want to share a popular and misguided idea that might make it difficult to develop very profitable systems. However, in order to understand why the idea is misguided; we first have to understand why and how it makes sense and the limitations of it. The idea stems from the desire to avoid “data snooping” bias. And, the idea is basically some form that the system trader should be “data blind”. The goal of the idea is to avoid data mining spurious noise.
In order to understand where the idea originates, we need to turn to the scientific method and what a backtest really means. A backtest doesn’t prove anything. But, it can eliminate the ideas that clearly don’t work and even those are only of a specific conjecture and form. The form is “If i traded this way consistently in the past, based on my recent past performance, then I’d make money in the future”. First, if you didn’t trade that way “consistently” you could still make money- even if the backtest failed. The second point is a bit involved but if you didn’t use your recent past performance to determine the optimal way to trade then you might have still traded something very similar and been profitable. So, all we’re really saying is: Hey I have this idea and think if I traded this way consistently in the past then I’d make money. A backtest disproves that form of conjecture.
Now, on statistical significance, this is not too relevant too backtest simply because you can look at the profit and decide if it is something you want to trade. However, it does come into play here. Because, if you take something that would be significant at the 5% level 5 out of 100 simply means you’d expect to find 5 such results out of 100 that were “that profitable” simply due to chance. If you run 1,000 backtest the number rises to 50.
So, we see a problem because if we run enough backtest then by pure chance we expect to generate many profitable results. And our methods can only eliminate the unprofitable ideas. You can see now why the idea that you should be blind to the data comes into play. The idea is that if you have a logical basis for your trade/system first then it’s more likely to be meaningful or maybe it’s just a numbers game.
This is all very well and good. And, it is essentially correct in many ways. However, there’s a problem. The problem originates from the motive to “be datablind”. A true data blindness means you don’t know the properties of your data. It means you can’t learn the properties and what is likely or not likely to work. True data blindness implies that your data is random. If your data is random then you should not be making hypothesis that utilize random data for trading decisions! Moreover, if you embrace the data blind idea then it’s going to make it extremely difficult to produce very profitable systems because you are just throwing ideas in the dark. That’s not likely to produce anything in markets that are mostly efficient. Think about it, if there are any patterns to exploit in the market, who’s going to be more profitable? A trader who studies the market and whatever patterns it offers or a trader throwing ideas in the dark? Moreover, if you do like most developers and you run a single backtest your results are likely to be random anyway. See my blog post, under my profile, for why that’s the case.
The way out is, of course, to seek to understand your data. Because, really if you’re making systems on price data then you’ve already taken the conjecture your data isn’t random. If your data isn’t random then the greatest advantage a system developer has is to study, understand, and optimize systems based on the data. If your data is truly random then no amount of “data blindness” or attempt to avoid data snooping bias will help you avoid producing junk systems.
What traders really want to avoid are two types of results that won’t lead to future profits and they originate due to different reasons: overfitting and fleeting temporal market biases. Overfitting derives from using too many variables for too little data. Completely avoiding temporal market biases would be unavoidable and unprofitable because all very profitable market biases are expected to have some temporality (life shelf). But, some market biases may be so fleeting that they can’t even be reproduced. Those are the ones that aren’t even worth trading. The best way to avoid these undesired results is to have the best understanding of the data.
But, it is reasonable to preserve or keep a small section of “hold out” data that you don’t test against. The only problem is that if we preserve the most recent data and we believe the most recent data is the most valuable then we’re not able to use the best data for the strategy. A second strategy that can be used in conjunction with the previous or independently, and has the same motive of utilizing the data to make better decisions, is to treat the parameters of the system as data to understand.