The vexing problem facing every system developer is the need to validate their backtest. One rigorous way to do that is to use walk forward optimization. However, an argument can be made that the alternative approach of taking all of the data into consideration can also make sense, and, in fact, some highly experienced system developers prefer that approach to WFA. The most commonly used way to validate a system is to use out-of-sample data (OOS data).
Most often, some percentage, from 15% to 30%, of the most recent data is withheld from the optimizer, referred to as hold out. The performance on the OOS sample data validates or invalidates the results of the system. The problem with that technique is that the most recent data is most likely indicative of current market conditions. The most recent data is both the most valuable and in the shortest supply. One might think that holding out the start of the data, the oldest data, would be a solution. However, the problem isn’t resolved because a strategy might not perform as well on the older data but perform very well on the most recent data. Given that the most recent data is most relevant, it is reasonable to question whether or not the older performance data should be discounted in favor of the more recent performance. Unfortunately, the confounding factor is that the most recent data is in-sample.
Regardless of whether one holds out a block of data at the beginning or end of their backtest, we know that markets have various regimes which both persist and change over time. Sometimes markets are very volatile and choppy and, at other times they trend smoothly. Holding out a continuous block of market data introduces the risk that important regimes might be lost. While it is debatable if such a strategy might be warranted in certain cases, it is clearly wrong when it happens merely as the result of chance and with no intention on the system developer’s part. Clearly, we need a better solution, and a solution developed for the training of neural networks to make them more robust presents itself as an ideal solution, and that solution is drop out.
Drop out is exactly what it sounds like, data is randomly dropped out during the training process to make the network more robust to changes. In our case, we will drop out a percentage of market data for our hold out data. We will optimize our system on the data that doesn’t contain the withheld data and finally we will validate the system against the withheld data. Drop out may also be used to speed up the optimization process and introduce robustness but we don’t explore that in this article.
Intraday vs End of Day Concerns
In our example, we will be developing an intraday system. Dropout is especially well suited for intraday systems because we do not have to concern ourselves with how to handle trades that may already be open. We simply calculate any indicators over all data but do not allow any trading on the dropped out days preventing the optimizer from taking advantage of those days. Applying dropout to multi-day systems, depending on how they work, might be more challenging. However, one solution might be to apply the dropout to the entry signal only and allow exits to happen normally. Another valid concern might be data snooping biases in some cases where indicators might be optimized over certain days even though they weren’t traded. There are some mitigation strategies that might be helpful in those cases such as replicating indicator values across dropped out days or possibly dropping out batches of several days.
How to Drop Out Data: Easy Method vs. Correct Method
First, the easy method to drop out the data is to take advantage of the mod function. The mod function returns the remainder for whole number division. We can find every Nth bar by modding the current bar number with our whole number. We can mod the current bar with, for example, 5 to drop out 20% of our data. In the Multichart’s Powerlanguage (very similar Tradestation’s Easylanguage) example below, we create a proxy for the current bar number that only updates every new day– because our primary series is intraday:
If d <> d then begin
Dropout = Mod(DailyBar, 5)=0;
The problem with the method above is that there might be some patterns in the data that could skew or impact our results. The correct method is to use a seeded random function which would also be easy too except that Multichart’s Powerlanguage doesn’t have a random with seed function. As a result, we need to create a list of random numbers and persist them because we want the numbers to remain the same on subsequent test runs. There are many ways of creating a list of random numbers. We could use Easylanguage and persist the results or use Excel: However, in this case because I’m trying to learn Python, I used Python via IPython Notebook:
# Seed value so randoms will be the same next time
for num in range(1,2000):
#% is Python Mod, 3 per row to save space
if num % 3 != False:
The code prints EasyLanguage array assignments for 2000 values or enough for approximately 8 years of data. As long as we have enough values, the exact amount isn’t important. We will be testing over approximately 3 years of data, and so we will have more than enough.
RandomList=7; RandomList=7; RandomList=1;
RandomList=5; RandomList=9; RandomList=8;
RandomList=7; RandomList=5; RandomList=8;
RandomList=6; RandomList=10; RandomList=4;
RandomList=9; RandomList=3; RandomList=5;
RandomList=3; RandomList=2; RandomList=10;
RandomList=5; RandomList=9; RandomList=10;
I simply copy and paste the data into the code. The code generates random values from 1 to 10 inclusive. This means the probability of picking any specific number on any given bar will be 10%.
Placing it into practice
For this example, we be examining the E-mini EMD (S&P MidCap 400) futures contract. The EMD has a point value of $100 per point and should track closely with other equity indexes likes the E-mini S&P 500. We will be looking to see if there are any 5 minute periods over the normal trading day that has an exploitable bullish bias. We are not attempting to build a complete system but rather to find some filters that may be useful at a later stage and with additional components to produce a tradeable system.
We hold out 20% of the data using the “logical or” operator.
If d <> d then begin
DailyBar = DailyBar + 1;
Dropout = (RandomList[DailyBar] = 1) or (RandomList[DailyBar] = 2);
In order to optimize the 5 minute periods, we create an intraday bar count that resets at the start of our session.
If time = 0925 then begin
IntraDayBar = 0;
For the exhaustive optimization, we allow the optimizer to see all the days where dropout is false:
If IntraDayBar = IntraDayBarStart and Dropout = false
Buy next bar open;
All trades are closed on the same bar at the close.
The optimizer finds that the 5 minute period starting at 9:40 to be bullish. This is not too surprising because the morning is the most volatile.
Finally, we verify against the OOS data by setting dropout to true.
The difference is stark and somewhat surprising. We can compare the trade statistics. Our in-sample optimized trades averaged $12.70 per trade while the OOS averaged a loss of ($15.21) per trade.
The annual returns indicates no trade clustering or grouping to explain the discrepancy.
Reality Check Part 2
A question worth asking is whether or not the bad OOS results could just be due to chance. One way to test is to look at the Monte Carlo analysis of the in-sample data. But, another somewhat novel way is to compare our 20% OOS sample against other 20% samples that we know are in-sample. We can do that by changing the dropout comparison numbers to those that we did not exclude. We can make the dropout variable itself an optimization parameter and let the optimizer generate various combinations for us.
Dropout = (RandomList[DailyBar] = Dropstart) or (RandomList[DailyBar] = Dropstart+1);
We only optimize the dropout being careful to leave our optimized start time unchanged. The first DropStart is our OOS data. The second, highlighted row, is a mix of in and OOS sample data. The rest are all in sample.
Even though surprisingly one 20% randomly distributed in-sample test was net negative, no samples fared anywhere nearly as poorly as our OOS.
The dropped out OOS results provide a sanity check against our optimized results, and in this case it reveals we should be cautious. Of course, it is possible that our chosen OOS data just by chance happened to be somewhat unique. Barring any evidence of that, it is probably best to just accept the results. But, if we suspected that then we could re-run the optimization using another hold out, and see if the new optimal proved any more stable. Of note, there were other profitable values found during the initial optimization and those non-best values might be more stable. Another question worth asking is whether or not 5 minutes might too brief of an amount time for reveal a time of day bias because even small fluctuations could introduce noise or swamp the effect and searching over a longer period of time should help to dampen any noise.
Of course, the real purpose and intent of this article was to introduce the concept of random drop out. I was somewhat surprised by the results because the overall market was bullish over the period, and once it found the start of the day as most bullish then it made sense to me that the most volatile time over a bullish period might be most bullish. And, so I fully expected that the OOS results would be similar.
Since I first wrote this, I have come across the term “stratified” which might be another term for random sampling. In addition, I have been thinking whether or not small differences in the uniformity of the random sampling approach combined with the low profit per trade might have had some impact on the results. In addition, a metric that maximized consistency might be more appropriate for a test like this over maximum profit, which was how the optimal parameter was determined. From a mathematical perspective, there isn’t enough parameters for an “overfit”, thus we may be dealing with an “underfit” situation.