The vexing problem facing every system developer is the need to validate their backtest. One rigorous way to do that is to use walk forward optimization. However, an argument can be made that the alternative approach of taking all of the data into consideration can also make sense, and, in fact, some highly experienced system developers prefer that approach to WFA. The most commonly used way to validate a system is to use out-of-sample data (OOS data).

Most often, some percentage, from 15% to 30%, of the most recent data is withheld from the optimizer, referred to as hold out. The performance on the OOS sample data validates or invalidates the results of the system. The problem with that technique is that the most recent data is most likely indicative of current market conditions. The most recent data is both the most valuable and in the shortest supply. One might think that holding out the start of the data, the oldest data, would be a solution. However, the problem isn’t resolved because a strategy might not perform as well on the older data but perform very well on the most recent data. Given that the most recent data is most relevant, it is reasonable to question whether or not the older performance data should be discounted in favor of the more recent performance. Unfortunately, the confounding factor is that the most recent data is in-sample.

Regardless of whether one holds out a block of data at the beginning or end of their backtest, we know that markets have various regimes which both persist and change over time. Sometimes markets are very volatile and choppy and, at other times they trend smoothly. Holding out a continuous block of market data introduces the risk that important regimes might be lost. While it is debatable if such a strategy might be warranted in certain cases, it is clearly wrong when it happens merely as the result of chance and with no intention on the system developer’s part. Clearly, we need a better solution, and a solution developed for the training of neural networks to make them more robust presents itself as an ideal solution, and that solution is drop out.

Drop out is exactly what it sounds like, data is randomly dropped out during the training process to make the network more robust to changes. In our case, we will drop out a percentage of market data for our hold out data. We will optimize our system on the data that doesn’t contain the withheld data and finally we will validate the system against the withheld data. Drop out may also be used to speed up the optimization process and introduce robustness but we don’t explore that in this article.

In our example, we will be developing an intraday system. Dropout is especially well suited for intraday systems because we do not have to concern ourselves with how to handle trades that may already be open. We simply calculate any indicators over all data but do not allow any trading on the dropped out days preventing the optimizer from taking advantage of those days. Applying dropout to multi-day systems, depending on how they work, might be more challenging. However, one solution might be to apply the dropout to the entry signal only and allow exits to happen normally. Another valid concern might be data snooping biases in some cases where indicators might be optimized over certain days even though they weren’t traded. There are some mitigation strategies that might be helpful in those cases such as replicating indicator values across dropped out days or possibly dropping out batches of several days.

First, the easy method to drop out the data is to take advantage of the mod function. The mod function returns the remainder for whole number division. We can find every Nth bar by modding the current bar number with our whole number. We can mod the current bar with, for example, 5 to drop out 20% of our data. In the Multichart’s Powerlanguage (very similar Tradestation’s Easylanguage) example below, we create a proxy for the current bar number that only updates every new day– because our primary series is intraday:

`If d <> d[1] then begin`

` Dropout = Mod(DailyBar, 5)=0;`

` end;`

The problem with the method above is that there might be some patterns in the data that could skew or impact our results. The correct method is to use a seeded random function which would also be easy too except that Multichart’s Powerlanguage doesn’t have a random with seed function. As a result, we need to create a list of random numbers and persist them because we want the numbers to remain the same on subsequent test runs. There are many ways of creating a list of random numbers. We could use Easylanguage and persist the results or use Excel: However, in this case because I’m trying to learn Python, I used Python via IPython Notebook:

`# Seed value so randoms will be the same next time`

` random.seed(0)`

` for num in range(1,2000):`

` #% is Python Mod, 3 per row to save space`

` if num % 3 != False:`

` print("RandomList[{0}]={1};".format(num,random.randint(1,10)), end='')`

` else:`

` print("RandomList[{0}]={1};".format(num,random.randint(1,10)))`

The code prints EasyLanguage array assignments for 2000 values or enough for approximately 8 years of data. As long as we have enough values, the exact amount isn’t important. We will be testing over approximately 3 years of data, and so we will have more than enough.

`RandomList[1]=7; RandomList[2]=7; RandomList[3]=1;`

` RandomList[4]=5; RandomList[5]=9; RandomList[6]=8;`

` RandomList[7]=7; RandomList[8]=5; RandomList[9]=8;`

` RandomList[10]=6; RandomList[11]=10; RandomList[12]=4;`

` RandomList[13]=9; RandomList[14]=3; RandomList[15]=5;`

` RandomList[16]=3; RandomList[17]=2; RandomList[18]=10;`

` RandomList[19]=5; RandomList[20]=9; RandomList[21]=10;`

I simply copy and paste the data into the code. The code generates random values from 1 to 10 inclusive. This means the probability of picking any specific number on any given bar will be 10%.

For this example, we be examining the E-mini EMD (S&P MidCap 400) futures contract. The EMD has a point value of $100 per point and should track closely with other equity indexes likes the E-mini S&P 500. We will be looking to see if there are any 5 minute periods over the normal trading day that has an exploitable bullish bias. We are not attempting to build a complete system but rather to find some filters that may be useful at a later stage and with additional components to produce a tradeable system.

We hold out 20% of the data using the “logical or” operator.

`If d <> d[1] then begin`

`DailyBar = DailyBar + 1;`

`Dropout = (RandomList[DailyBar] = 1) or (RandomList[DailyBar] = 2);`

`end;`

In order to optimize the 5 minute periods, we create an intraday bar count that resets at the start of our session.

`If time = 0925 then begin`

` IntraDayBar = 0;`

` end;`

For the exhaustive optimization, we allow the optimizer to see all the days where dropout is false:

`If IntraDayBar = IntraDayBarStart and Dropout = false`

`then begin`

`Buy next bar open;`

`end;`

All trades are closed on the same bar at the close.

The optimizer finds that the 5 minute period starting at 9:40 to be bullish. This is not too surprising because the morning is the most volatile.

Finally, we verify against the OOS data by setting dropout to true.

The difference is stark and somewhat surprising. We can compare the trade statistics. Our in-sample optimized trades averaged $12.70 per trade while the OOS averaged a loss of ($15.21) per trade.

The annual returns indicates no trade clustering or grouping to explain the discrepancy.

A question worth asking is whether or not the bad OOS results could just be due to chance. One way to test is to look at the Monte Carlo analysis of the in-sample data. But, another somewhat novel way is to compare our 20% OOS sample against other 20% samples that we know are in-sample. We can do that by changing the dropout comparison numbers to those that we did not exclude. We can make the dropout variable itself an optimization parameter and let the optimizer generate various combinations for us.

`Dropout = (RandomList[DailyBar] = Dropstart) or (RandomList[DailyBar] = Dropstart+1);`

We only optimize the dropout being careful to leave our optimized start time unchanged. The first DropStart is our OOS data. The second, highlighted row, is a mix of in and OOS sample data. The rest are all in sample.

Even though surprisingly one 20% randomly distributed in-sample test was net negative, no samples fared anywhere nearly as poorly as our OOS.

The dropped out OOS results provide a sanity check against our optimized results, and in this case it reveals we should be cautious. Of course, it is possible that our chosen OOS data just by chance happened to be somewhat unique. Barring any evidence of that, it is probably best to just accept the results. But, if we suspected that then we could re-run the optimization using another hold out, and see if the new optimal proved any more stable. Of note, there were other profitable values found during the initial optimization and those non-best values might be more stable. Another question worth asking is whether or not 5 minutes might too brief of an amount time for reveal a time of day bias because even small fluctuations could introduce noise or swamp the effect and searching over a longer period of time should help to dampen any noise.

Of course, the real purpose and intent of this article was to introduce the concept of random drop out. I was somewhat surprised by the results because the overall market was bullish over the period, and once it found the start of the day as most bullish then it made sense to me that the most volatile time over a bullish period might be most bullish. And, so I fully expected that the OOS results would be similar.

Additional thoughts:

Since I first wrote this, I have come across the term “stratified” which might be another term for random sampling. In addition, I have been thinking whether or not small differences in the uniformity of the random sampling approach combined with the low profit per trade might have had some impact on the results. In addition, a metric that maximized consistency might be more appropriate for a test like this over maximum profit, which was how the optimal parameter was determined. From a mathematical perspective, there isn’t enough parameters for an “overfit”, thus we may be dealing with an “underfit” situation.

Curtis is passionate about markets. He has developed top ranked futures strategies. His core focus is (1) applying machine learning and developing systematic strategies, and (2) solving the toughest problems of discretionary trading by applying quantitative tools, machine learning, and performance discipline. You can contact him at curtis@beyondbacktesting.com.

**Session expired**

Please log in again. The login page will open in a new tab. After logging in you can close it and return to this page.

[…] Drop Out for OOS Sanity [Beyond Backtesting] […]

How do you deal with a signal that is true, but never would have traded because you were still in a trade from a previous signal?

For example, Monday is a randomly selected “skip” day and you have entry signals for both Monday and Tuesday. In normal testing Tuesday would not be traded as we would have entered Monday and be in a position on Tuesday because we normally hold for 3 days (for example); however, now we skip Monday and are flat going into Tuesday. Now do we trade Tuesday as part of the in-sample?

Additionally, how do you go about combining the in-sample and out of sample results? For example, did Tuesday’s trade actually happen? I love this idea of random/selective OOS and use something similar, but this problem of combining results tends to generate a few different solutions and I’d be curious if you’ve settled on the same ones or some that I have.

Thanks and all the best,

Dave

I think there are a few different sorts of problems, and you have to look at the particulars for a system. You could treat the offset as a type of robustness test– similar to sensitivity analysis. But, that might not make sense in every case. Something I suggested would be to allow both the random selection and the parameters to vary — and that gives you 10 samples with 20% hold outs. The parameter that performs best across all would be most stable. You could also post retroactively take out the trades. That’s a bit more work. But, that could be used for example if you have a no trade day on an exit.

I plan to do some more theoretical sampling tests from non normal distributions to see what I can discover in near future. Because, when we are talking about a random sampling, we’re using a uniform sampling.

Completely agree. I’d be curious to share some findings down the road. Varying selection (and parameters) or the “Monte Carlo” approach combined with this random OOS has been quite interesting in my own work – nice to see we’re both using this without ever communicating until now.

I am starting to believe this is more part of a robustness/validation technique than part of the discovery process, if that makes sense.