“Quantile Tracking Errors (QuTE)” by Aguilar, Chengan, and Custovic

4+

Epsilon Theory will occasionally publish academic research of merit pertaining to financial and political markets.

You can read about our reasons and our guidelines here: Why Publish Academic Research?

If you have publishable academic research that you think expands our collective understanding of financial or political markets, and you’d like to give it access to our network of 100,000+ investment professionals, asset owners, academics and market enthusiasts, please send it to us at info@epsilontheory.com.

Will making academic journals irrelevant save the world? No.

But it’s a good start.


PDF Download: Quantile Tracking Errors (QuTE)


Authors

Mike Aguilar is a Teaching Associate Professor at the Department of Economics, University of North Carolina at Chapel Hill and the Chief Investment Officer at Cardinal Retirement Planning, Inc. in Durham, NC. Email: maguilar@unc.edu

Ruyang Chengan is an Analyst at Dimensional Fund Advisors in Charlotte, NC. Email: chenganruyang@gmail.com

Anessa Custovic is a Quantitative Research Analyst at Cardinal Retirement Planning, Inc in Durham, NC. Email: anessa@planwithcardinal.com


Abstract

The tracking error is a ubiquitous tool among active and passive portfolio managers, used widely for fund selection, risk management, and manager compensation. In this paper we show that traditional measures of tracking error are incapable of detecting variations in higher order moments (e.g. skewness and kurtosis). As a solution, we introduce a new class of Quantile Tracking Errors (QuTE), which measures differences in the quantiles of return distributions between a tracking portfolio and its benchmark. Through an extensive simulation study we show that QuTE can detect variations in higher order moments. We also offer guidance on the granularity of the quantile grid and weighting schemes for the relative importance of various quantiles. A case study illustrates the benefits of QuTE during the Dot Com Bubble and the Great Recession


INTRODUCTION

Traditional measures of tracking error are inadequate. Although there are several variants, most commonly tracking errors are cast as squared deviations between a tracking portfolio and benchmark over some period of time. However, this type of quadratic structure is inconsistent with the linear performance fees through which most managers are compensated (see Kritzman [16]). Instead, managers are incentivized to avoid extreme return deviations (Rudolf, Wolter, and Zimmermann [22]), which implies that higher order moments, such as kurtosis, are relevant. Moreover, Beasley, Meade and Chang [3] suggest that managers are incentivized to avoid consistently underperforming their benchmark, suggesting that skewness is also relevant.

Dorockov [8] and Blume and Edelen [5] point out that the goal of a tracking error is to measure how closely a portfolio can exactly replicate its associated benchmark. There is a preponderance of evidence that asset returns are non-Gaussian. Mills [18] documents excess skewness and kurtosis in daily asset returns, while Chung, Johnson and Schill [7] document it for monthly [1] asset returns as well. Therefore, tracking only the first two moments, as do conventional measures, is insufficient.

Other shortcomings of traditional tracking error measures have been cited. For instance, Pope, and Yadav [19] illustrate the bias in tracking error due to serial correlation in returns. Moreover, Ammann and Tobler [1] recognize that tracking error variance is subject to sampling error.

This paper makes two contributions to the literature on portfolio tracking. First, we detail a previously undocumented shortcoming of traditional tracking errors. Through a simulation study we show that traditional tracking errors (such as average tracking error and tracking error volatility) fail to detect situations in which the skewness (and/or kurtosis) of the tracking portfolio differs from that of the associated benchmark.

The second contribution of this paper is to introduce a class of quantile based tracking errors (QuTE). As we will discuss in Section 2.2, there are many variants of tracking error. Some have symmetric loss functions, structured via absolute or squared deviations. Meanwhile other variants incorporate asymmetries visa vis semi standard deviations, which are aligned with downside risk. Each have an analogue within our quantile based measures. We show that even the most basic of these QuTE measures is able to detect deviations in higher order moments of returns.

We begin with a detailed accounting of the traditional measures of tracking error alongside the newly proposed quantile based measures. We then conduct an extensive simulation study to explore the relative merits of QuTE. Finally, we document historical episodes where QuTE was able to detect important differences between a tracking portfolio and it’s benchmark, while the traditional measures were unresponsive.


Portfolio Tracking

In this section we detail the lineage of tracking errors and provide a compendium of its variants. We complement with an introduction of the new QuTE class of tracking errors.

Tracking Errors

Equation (1) was seen first in the academic literature in Franks [10], which defined it simply “excess of benchmark returns”. Among practitioners, the object in Equation (1) is sometimes referred to as Tracking Difference.[2] Roll [20] refers to this object as “Tracking Error”, which we find to be commonly applied within the proceeding academic literature, and as such reserve that terminology throughout the balance of this paper. Note that the object in Equation (2) is simply an average of the Tracking Error over a period of time.

The object in Equation (3) is the next most commonly used variant of the term Tracking Error. Franks [10] refers to this object as Tracking Error, whereas Roll [20] refers to this as Tracking Error Volatility (TEV). Many proceeding academic studies (see Jorion [14]) use the TEV terminology. Moreover, Equation (3) is commonly referred to as Tracking Error among practitioners.[3] Often this is reported as an annualized value.[4] Equation (4) is subtly distinct, but is less often used in the literature than is Equation (3). Used by Ammann and Tobler [1], it captures the square root of the sum of the squared tracking error. Root Mean Squared Tracking Error (RMSTE) in Equation (5) was used by Chincarini and Kim [6] as a way to capture both the variability and the level of the tracking errors.

As noted by Kritzman [16], portfolio managers are rewarded by linear performance fees based upon the differences between their portfolio and the corresponding benchmark. Rudolf, Wolter and Zimmermann [22] argue, that due to this fact, linear deviations between the portfolio and benchmark give a more accurate description of the investors’ risk attitudes than do squared deviations. As such, tracking measures based off of absolute, rather than squared differences, such as those in Equation (6) and Equation (10) are sometimes advocated.

Both the quadratic and absolute measures heretofore are inconsistent with investor loss aversion. Rudolf, Wolter and Zimmermann [22] advocate the use of semi-variances for downside risk measurement. Equations (7) – (10) reflect this downside risk.

Finally, Beasley, Meade and Chang [3] introduce a generalized tracking error written as

QuTE

Intuitively, QuTE compares two assets via differences in the quantiles of their respective return distributions. This is especially useful in finance given the preponderance of returns with excess skew and kurtosis, and quantile-based methods’ ability to capture these distributions (see Rostek [21]). Moreover, a quantile based approach is consistent with the utility maximization via quantile maximization of Rostek [21], as well as with Giovannetti [12], who builds an asset pricing model consistent with CRRA preferences via quantile maximization.

Since the Value-at-Risk (VaR) is merely a quantile of a return distribution, we can see QuTE as matching on the space of VaR’s are various levels. Yamai and Yoshiba [24] show us that portfolio ranking via VaR is consistent with expected utility maximization and is free of tail risk. We adapt the findings of Rostek [21], who characterizes the behavior of an agent evaluating different (investment) alternatives by the -th quantile of the implied (return) distributions and selects the one with the highest quantile payoff. We can represent an investor’s preferences via the quantiles of the associated return distribution. In the context of benchmark tracking, we can then cast the investor’s preferences for deviations from their benchmark via the differences in the quantiles of the portfolio and benchmark. Portfolio construction with VaR based objective functions is increasingly common (see Gaivoronski and Pug [11] for recent examples). Moreover, a quantile based approach is especially attractive given the prevalence of VaR for portfolio risk management. For instance, Follmer and Leukert [9] uses VaR in the context of dynamic hedging.

Note that a natural analogue to QuTE is moment based matching, rather than quantile based. One could use a method of moments type estimator to match a select set of empirical moments between the benchmark and optimal portfolio. Although potentially attractive, a moment based approach lacks the flexibility of a nonparameteric quantile based method.

Notice the similarities with the tracking error measures defined in Section 2.1. Importantly, the averaging in the QuTE class is not done over time , but rather across quantile levels . The QuTE measures never force the portfolio managers to compare his/her portfolio to the benchmark on a daily basis. This might mitigate the problem of “short termism” as indicated by Ma, Tang and Gomez [17]. Specifically, short evaluation periods for performance based compensation may damage fund performance by incentivizing managers to engage in such activities as risk shifting and window dressing to boost short-term performance.

Since there is a one-to-one mapping between the quantiles (returns) and the quantile levels (probabilities), portfolio tracking via QuTE can be cast within the wide literature of distribution matching. Cast this way, QuTER falls within the Fidelity Family of similarity measures. These types of measures are used in a wide variety of fields.

Beasley, Meade and Chang [3] expand their tracking error to accommodate for the case where someone might want to weigh the importance of the return deviations differently over time. Analogously, we introduce a quantile weighted version of QuTE. We illustrate below for the case of QuTER, but this approach can easily be extended to any of the measures within the QuTE family.

Simulation Study

In this section we explore the differences between QuTE and traditional TE tracking measures. Of particular importance, in subsection 3.1, is the sensitivity of each measure to differences in the empirical distributions of the benchmark and tracking portfolio. Subsections 3.2 and 3.3 focus on robustness of QuTE to various calibrations.

Sensitivity to Differences in Return Distributions

In this subsection we conduct a simulation study to evaluate the traditional tracking error measures of Section 2.1 as well as the QuTE based measures of Section 2.2. We craft a toy exercise that, while simple in nature, permits us to highlight the sensitivity of the tracking errors to differences in the underlying return distributions. Given the preponderance of evidence citing skewness and kurtosis (see Chung, Johnson and Schill [7], Mills [18], among others) in asset returns, coupled with the calls for linear performance measures a la Rudolf, Wolter and Zimmermann [22] and Kritzman [16], we consider deviations in these “higher order” moments.

We begin by creating a benchmark portfolio. For simplicity, we assume the returns of the benchmark follow a standard Normal distribution. We calibrate the length and empirical moments of the benchmark to match that of the monthly returns on Dow Jones Industrial Average over the period 1985 through 2019. This same index is used in a Case Study detailed in Section 4. Our simulations contain 10,000 paths, each of length 414 months.

Next, we generate a tracking portfolio that follows one of five distinct distributions, which are depicted in Table 1. In Case 0, the tracking portfolio has the same distribution as the benchmark portfolio. In Case 1, they differ only in the mean. Similarly, Case 2 varies in terms of variance, Case 3 in terms of skewness, and Case 4 in terms of kurtosis.[5]

We explore the ability of the various traditional tracking measures to detect differences in the mean (standard deviation, skewness, kurtosis) of the tracking portfolio and benchmark. As noted in Section 2.1, the TEV depicted in Equation (3) is the most commonly used tracking measure among academics and practitioners. We compare the TEV to ATE, TER, and RMSTE.[6]

First, we vary the mean return of the tracking portfolio in excess of the benchmark (i.e. excess mean) in the range.[7] Next, we compute the ATE, TER, RMSTE and TEV for each of these values of excess mean, simulated and averaged over 10,000 paths. Finally, we scale [8] the values for each of the cases for ease of visual comparison. Panel A of Figure 1 depicts the ATE, TER, RMSTE and TEV values over the range of excess mean values. Panels B, C, and D similarly reflect excess standard deviation, skewness, and kurtosis.

A desirable measure of tracking error should achieve a minimum at an excess mean (standard deviation, skewness, kurtosis) of 0, i.e. when there is no difference between the tracking portfolio and benchmark, the tracking error measure should be at its low point. We find that ATE is unable to detect changes in any of the four moments. Meanwhile, TEV performs similarly to TER and RMSTE across Cases 2 through 4. In this sense, TEV is roughly equivalent to TER and RMSTE.

Next, we compare the traditional and quantile based tracking measures in terms of their abilities to detect differences in the underlying statistical distributions of the benchmark and tracking portfolios. Our comparison is centered around the TER of Equation (4) and the QuTER of Equation (12). We note our prior findings that TER is roughly equivalent to the popular TEV, which makes this comparison relevant. Moreover, we note that QuTER is a direct analogue of QuTER, providing a fair comparison.

In Table 2 we explore these relative sensitivities by computing the percent change in the (Qu)TER statistic relative to Case 0. The greater is the percent change in the (Qu)TER in Case 1 relative to Case 0, the more sensitive is that measure to variations in the means of the two series.

The p-value of 0 for Case 1 in Table 2 implies that the percent change in the QuTER statistic for Case 1 relative to Case 0 is not equal to the percent change in the TER statistic for Case 1 relative to Case 0. In fact, we find that QuTER and TER have unequal sensitivities to differences in each of the first four statistical moments. Moreover, one-tailed t-tests suggest that the QuTER is in fact more sensitive than TER in all Cases.

We explore these findings further by conducting a sensitivity analysis as we did above. Again, we vary the degree of mean returns in the tracking portfolio in excess of the benchmark (i.e. excess mean) in the range.[9] Next, we compute the TER and QuTER for each of these values of excess mean, simulated and averaged over 10,000 paths. Finally, we scale [10] the values for each of the cases for visual comparison. Panel A of Figure 2 depicts the TER and QuTER values over the range of excess mean values. Panels B, C, and D similarly reflect excess standard deviation, skewness, and kurtosis. Again, a desirable measure of tracking error should achieve a minimum at an excess mean (standard deviation, skewness, kurtosis) of 0, i.e. when there is no difference between the tracking portfolio and benchmark, the tracking error measure should be at its low point.

Panel A of Figure 2 suggests that TER and QuTER are both sensitive to variations in the mean return of the tracking portfolio and benchmark. They each reach minimum values near 0 excess mean, and rise at values above and below that amount. Similarly, Panel B illustrates that both TER and QuTER appear sensitive to deviations in excess standard deviation. However, Panels C and D illustrate that TER is not sensitive to deviations in skewness nor kurtosis. Meanwhile QuTER continues to respond to these excess variations. We note that these findings are consistent for ATE/AQuTE, AATE/AAQuTE, and ATR/AQuTER.

We can see from Table 3 that the estimated  is positive and statistically significant for Cases 1, 3, and 4. This finding aligns with Figure 2, where QuTER appears to detect changes in the third and fourth moment, while TER is unable to do so. In terms of kurtosis, it appears that QuTER grows at least twice as fast per unit of change in excess kurtosis as TER does. Overall, we find that the sensitivities of the quantile based tracking errors are different, and in most cases larger, than the sensitivities of the traditional tracking errors.

Robustness to Granularity of Quantile Grid

 In this subsection we explore whether the granularity of the quantile grid for the QuTE statistics impacts their ability to detect differences in the distributions of the tracking portfolio and the benchmark.

We repeat the exercise of Section 3.1 by simulating the benchmark returns as simple Gaussian noise and then varying the tracking portfolio in four ways; Case 1 alters the mean, Case 2 alters the variance, Case 3 alters the skewness, and Case 4 alters the kurtosis. Figure 3 depicts the percentage change in the QuTER statistic in a given Case relative to Case 0. The x-axis varies the size of the quantile grid (). The reported values are the median across 10,000 simulated paths.

We find that the percentage change in the QuTER statistic falls as the number of quantiles in the grid rises. The relationship appears to plateau near 10 quantiles. This stability is important, indicating that the QuTER measure is robust to choice of quantile grid.

Impact of Varying Quantile Weights

In this subsection we explore whether variations in the quantile weighting scheme impact QuTE’s ability to detect deviations between the distributions of the tracking portfolio and benchmark.

Blitz and Hottinga [4] illustrate how to compare various investment strategies via a Tracking Error framework. They consider weighting strategies by several methods of importance, such as tracking error, information ratio, and the like. In a similar vein, we can weight various quantiles by whatever criterion is most important to the investor. In the following, we consider four weighting schemes: equal weight, tail risk weight, down side risk weight, and total return attribution.

Finally, we consider a total return attribution weighting scheme, wherein each quantile is weighted according to its contribution to the portfolio’s total return. Specifically, using the equally spaced 100 quantile grid (i.e. percentiles), we compute the midpoint between each grid point to signify the average return in that return bin. We then compute the relative frequency of return observations that fall within that bin. Notice that the average return in each bin times the relative frequency of observations occurring within that bin is approximately equal to the total return. To compute the attribution of any given bin, we take the average bin return times relative frequency and divide by the total portfolio return.[11] By design these attributions sum to 1, and thus are viable choices for quantile weights .

In Figure 4 we illustrate how the QuTER objective function varies with the four aforementioned weighting schemes. Specifically, we repeat the exercise from Section 3.1 by simulating the tracking portfolio and benchmark. Each case varies one of the first four moments of the return distribution for the tracking portfolio. The height of each bar is the associated QuTER averaged over 10,000 paths. The number above each bar is the gross change of that average QuTER statistic relative to Case 0. For instance, the 1.1 above the first bar in Case 1 implies that the QuTER value for the equal weight scheme in Case 1 is 1.1 times as large as the equal weighting scheme QuTER statistic for Case 0. The legend can be read as follows: EW = Equal Weight, TR = Total Return Attribution, Tail = Tail Risk, and Down = Downside Risk.

Within Case 1, we find that all of the weighting schemes are roughly equally (in)sensitive to excess mean returns. Gross changes are 1.1 for equal weighting, tail risk weighting, total return attribution, and for downside risk weighting. Within Case 2, total return and tail risk are again equally sensitive to variations in excess standard deviation, while downside is slightly more sensitive and equal weight is slightly less sensitive. For excess skewness, we find that tail risk is the most sensitive, downside is the least sensitive, while equal weight and total return have similar sensitivities. For excess kurtosis, equal weight and total return attribution are again similarly sensitive, with tail risk and downside risk being less so. In summary, a quantile weighting scheme of equal weight or total return attribution is robust to a wide array of differences in the underlying return distributions of the benchmark and tracking portfolio.[12]

Case Study

 In this section we conduct two small case studies in order to illustrate the behavior of QuTE alongside a traditional measure of tracking error. The first case regards tracking the DJIA, while the second focuses on tracking the MSCI Emerging Markets index. We apply the QuTER and TER measures in both an unconditional and conditional setting.

Tracking the DJIA

In our first case study we use the Dow Jones Industrial Average (DJIA) as a benchmark and the DIA SPDR ETF as a tracking portfolio. The DJIA is a leading index of equity market returns in the U.S., being launched in May 26, 1896 and with approximately 1,876.70 dollars indexed to it’s performance. The DIA is among the largest of the DJIA ETF tracking portfolios, with an average of 7,102,449 USD in daily volume since the inception date. It is also one of the oldest ETFs to track the DJIA portfolio, with an inception date of January 13, 1998.

Our dataset contains monthly simple returns for both the DJIA (benchmark) and the DIA (tracking portfolio) over the period January 1998 to June 2019. Figure 5 depicts the time variation of the two return series overlayed upon one another. Simple visual inspection suggests they are quite similar. In fact, the correlation between the two return series is 0.99. Table 4 contains basic descriptive statistics such as mean, standard deviation, skewness, and kurtosis, as well as select quantiles of the two series. The last row contains the p-value for tests of equality between these various measures. A standard t-test is used for equal means. A standard F-test is used for equal variances. A two-way Kolmogorov-Smirnoff test is used to compare all of the four moments jointly. Finally, to compare the quantiles we use employ the Wilcox et al [23] test with a quantile estimator proposed by Harrell and Davis [13].

Figure 4 complements the comparisons in Table 4 by overlaying histograms of the tracking portfolio and benchmark in Panel A, and presenting a two-way QQ plot in Panel B. In addition, Table 5 presents various measures of (quantile) tracking errors. Note that the TE and QuTE values are not directly comparable given the different scaling of each measure.

Taken together, the above results reveal that the DIA has distributional properties that are remarkably similar to the DJIA, thereby supporting our visual inspection. Each of the moments and quantiles examined are statistically identical across the two portfolios.

Nonetheless, the two series can differ over time that are important to portfolio managers and investors. Figure 7 charts the difference in returns (TE) for each month. Deviations between the two series are particularly visible during the aftermath of the dot-com bubble in 2001 as well as during the Great Recession of 2008-2010. Of particular note is the variability in the TE over time. Figure 5 depicts the time variation in the difference in the first four moments of the tracking portfolio and benchmark. For the benchmark, we compute the mean return over a trailing three year window. We repeat for the tracking portfolio. Then we subtract those two values. That is a single point in Panel A of Figure 8. We then roll each sample forward by one month, recompute the means, and subtract. We continue that process for the rest of the times series, and repeat that exercise for the standard deviation (Panel B), skewness (Panel C), and kurtosis (Panel D).

In a similar fashion we compute the TER and QuTER statistics between the benchmark and tracking portfolio. Panel A of Figure 5 depicts the rolling tracking measures computed over rolling three year windows, while Panel B depicts the month to month percent change in each tracking measure.

The statistical properties of the tracking portfolio differ from that of the benchmark over time. Our findings from Section 3 suggest that the QuTER statistic might be able to detect these differences when the TER cannot. For instance, as you can see from Figure 7, there is a large spike in the TE during 2001, followed by volatility of the TE until 2004. Figure 5 Panel A shows us that the differences in mean returns between DJIA and DIA was small and steady during this episode, while Panel D shows high differences in kurtosis. The TER is steady near 1.05 during this period, while the QuTER rises from 1 to 1.175, then falls back down to 1 by February 2004. These movements in the QuTER reflect its sensitivity to differences in return distributions that were not detected by TER.

Another episode of interest is the Great Recession. The TE swings wildly from 0.70 to 0.86 over the period 2008 to 2009. The mean return differences, as depicted in Panel A of Figure 5, vary between 0.17 and 0.20, and with it TER rises from 0.70 to 0.86. Notice that skewness changed from -0.01 to 0.11 and kurtosis from -0.05 to -0.10 over that period.[13] QuTER captured these movements, by increasing by almost 50 percent over that period, rising from 0.79 to almost 1.20, outpacing the roughly 22% change in TER.

Tracking the MSCI Emerging Markets Index

In our second case study we use the MSCI Emerging Markets Index (MSCI-EM) as a benchmark and the EEM iShares ETF as a tracking portfolio. We focus in on a recent episode that exemplifies the differences between TER and QuTER. Our dataset consists of monthly simple returns over the period January 2013 through November 2019.

The correlation between the two return series is 0.97 during this sample period. As depicted in Figure 10 the empirical distributions are similar. Nonetheless, as depicted in Figure 11, there are differences between the two series. Analogous to Figure 5 in Section 4.1, Figure 5 illustrates the time variation in the differences of the first four empirical moments of the benchmark and tracking portfolio. Panels B, C, and D show stark time variation in the differences of standard deviation, skewness, and kurtosis.

TER is little changed during this period, as seen in Figure 5, ranging from approximately 1 to 1.3. Meanwhile, QuTER is able to detect these variations in the series, ranging between .8 and 1.7. The relative sensitivity of QuTER is even more stark in Panel B of Figure 5.

Conclusion

In this paper we document a shortcoming of traditional tracking error measures. Cast as a quadratic norm of return differences between a tracking portfolio and benchmark, traditional tracking error measures like TEV and TER are focused on only the first two moments of the underlying return distributions. As such, they are inconsistent with the manner with which most portfolio managers are compensated. If the portfolio and benchmark differ in ways other than the mean or variance, traditional measures are insufficient.

As a remedy, we introduce a new class of tracking errors that are based on the differences in the quantiles of the tracking portfolio and benchmark, namely QuTE. Just as there are myriad variants of tracking error, so too are there variants of QuTE (see Section 2 for a complete listing).

We show via simulation that a simple quadratic summary statistic (QuTER) is more sensitive to differences in higher order moments than is its TER counterpart. We also document in two cases studies situations wherein the QuTER statistic is able to detect important differences in tracking portfolios from their benchmarks, which the TER missed.

Our findings are directly relevant for ex-post performance measurement as well as risk evaluation. Differences in higher order moments matter, and quantile based measures of portfolio tracking provide a useful complement to traditional measures.


References

[1] Ammann, M. and Tobler, J. (2000). Measurement and decomposition of tracking error variance. Working paper, University of St. Fallen.

[2] Barro, D. and Canestrelli, E. (2009). Tracking error: a multistage portfolio model. Annals of Operations Research, 165(1):47-66.

[3] Beasley, J. E., Meade, N., and Chang, T. J. (2003). An evolutionary heuristic for the index tracking problem. European Journal of Operational Research, 148(3):621-643.

[4] Blitz, D. and Hottinga, J. (2001). Tracking error allocation. The Journal of Portfolio Management, 27.

[5] Blume, M. and Edelen, R. (2004). S&p 500 indexers, tracking error, and liquidity. The Journal of Portfolio Management, 30:37-46.

[6] Chincarini, L. and Kim, D. (2006). Quantitative Equity Portfolio Management An Active Approach to Portfolio Construction and Management: An Active Approach to Portfolio Construction and Management. McGraw- Hill.

[7] Chung, Y. P., Johnson, H., and Schill, M. J. (2006). Asset pricing when returns are non-normal: Famafrench factors versus higher order systematic comoments. The Journal of Business, 79(2):923-940.

[8] Dorockov, M. (2017). Comparison of etf’s performance related to the tracking error. Journal of International Studies, 10:154-165.

[9] Follmer, H. and Leukert, P. (1999). Quantile hedging. Finance and Stochastics, 3(3):251-273.

[10] Franks, E. C. (1992). Targeting excess-of-benchmark returns. The Journal of Portfolio Management, 18(4):6-12.

[11] Gaivoronski, A. and Pug, G. (2005). Value-at-Risk in Portfolio Optimization: Properties and Computational Approach. Journal of Risk, 7(2):1-31.

[12] Giovannetti, B. C. (2013). Asset pricing under quantile utility maximization. Review of Financial Economics, 22(4):169 – 179.

[13] Harrell, F. E. and Davis, C. E. (1982). A new distribution-free quantile estimator. Biometrika, 69(3):635-640.

[14] Jorion, P. (2004). Portfolio optimization with tracking-error constraints. Financial Analysts Journal, 59.

[15] Kahneman, D. and Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263-291.

[16] Kritzman, M. P. (1987). Incentive fees: Some problems and some solutions. FinancialAnalysts Journal, 43(1):21-26.

[17] Ma, L., Tang, Y., and Gomez, J. (2019). Portfolio manager compensation in the U.S. mutual fund industry. The Journal of Finance, 74(2):587-638.

[18] Mills, T. C. (1995). Modelling skewness and kurtosis in the london stock exchange ftse index return distributions. Journal of the Royal Statistical Society. Series D (The Statistician), 44(3):323-332.

[19] Pope, P. F. and Yadav, P. K. (1994). Discovering errors in tracking error. The Journal of Portfolio Management, 20(2):27-32.

[20] Roll, R. (1992). A mean/variance analysis of tracking error. The Journal of Portfolio Management, 18(4):13-22.

[21] Rostek, M. (2010). Quantile Maximization in Decision Theory*. The Review of Economic Studies, 77(1):339-371.

[22] Rudolf, M., Wolter, H.-J., and Zimmermann, H. (1999). A linear model for tracking error minimization. Journal of Banking & Finance, 23(1):85 – 103.

[23] Wilcox, R., Erceg-Hurn, D., Clark, F., and Carlson, M. (2014). Comparing two independent groups via the lower and upper quantiles. Journal of Statistical Computation and Simulation, N/A:9pp.

[24] Yamai, Y. and Yoshiba, T. (2002). On the validity of value-at-risk: Comparative analyses with expected shortfall. Monetary and Economic Studies, 20(1):57-85.


Notes

[1] [7] documents excess skewness and kurtosis for cross-sectional daily, weekly, monthly, quarterly and semi-annual asset returns.

[2] See for example, the ESMA https://www.esma.europa.eu/sites/default/files/library/2015/11/2012-832en_guidelines_on_etfs_and_other_ucits_issues.pdf, Morningstar https://media.morningstar.com/uk/MEDIA/Research_Paper/Morningstar_Report_Measuring_Tracking_Efficiency_in_ETFs_February_2013.pdf, and Vanguard https://www.vanguard.com.hk/documents/understanding-td-and-te-en.pdf

[3] CFA Institute https://www.cfainstitute.org/-/media/documents/support/programs/investment-foundations/19-performance-evaluation.ashx?la=en hash=F7FF3085AAFADE241B73403142AAE0BB1250B311, International Organization of Securities Commissions and European Securities and Markets Authority https://www.iosco.org/library/pubdocs/pdf/IOSCOPD414.pdf

[4] Zephyr https://www.styleadvisor.com/content/tracking-error, Vanguard https://www.vanguard.co.uk/documents/adv/literature/understand-excess.pdf, Envestnet https://www.envestnet.com/sites/default/files/documents/A%20Tracking%20Error%20Primer%20-%20White%20Paper.pdf

[5] Each series was simulated within Matlab using the pearsrnd function for a Pearson system of random numbers with moments calibrated to match the mean, standard deviation, skewness, and kurtosis of the monthly return of the Dow Jones Industrial Average over the period 1985 through 2019.

[6] The measures of absolute and semi tracking error are beyond the scope of this paper

[7] We also consider excess standard deviation in the range 0.10 to 5, excess skewness in the range -1.4 to 1.4, and excess kurtosis in the range 1 to 7.

[8] We scale as follows: Tracking Measure Value – min(Tracking Measure Value)/(max(Tracking Measure Value)-min(Tracking Error Value))

[9] We also consider excess standard deviation in the range 0.10 to 5, excess skewness in the range -1.4 to 1.4, and excess kurtosis in the range 1 to 7.

[10] We scale as follows: Tracking Measure Value – min(Tracking Measure Value)/(max(Tracking Measure Value)-min(Tracking Error Value))

[11] More precisely, we divide by the sum of the average bin returns times relative frequencies. Due to the averaging across the bins, this value may not be equal to the actual portfolio return in any given dataset, but will approach that value as the distance between the grid points approach 0.

[12] Our findings are similar for AQuTE and AAQuTE

[13] During this time period, the difference in kurtosis reached a high of -0.50.


PDF Download: Quantile Tracking Errors (QuTE)


4+

“Rebalance Timing Luck: The Dumb (Timing) Luck of Smart Beta” by Hoffstein, Faber and Braun

11+

Epsilon Theory will occasionally publish academic research of merit pertaining to financial and political markets.

You can read about our reasons and our guidelines here: Why Publish Academic Research?

If you have publishable academic research that you think expands our collective understanding of financial or political markets, and you’d like to give it access to our network of 100,000+ investment professionals, asset owners, academics and market enthusiasts, please send it to us at info@epsilontheory.com.

Will making academic journals irrelevant save the world? No.

But it’s a good start.


PDF Download: Rebalance Timing Luck: The Dumb (Timing) Luck of Smart Beta


Authors

Corey Hoffstein is Chief Investment Officer at Newfound Research. 380 Washington Street 2nd Floor, Wellesley, MA 02481. E-mail: corey@thinknewfound.com. [1]

Nathan Faber is a vice president at Newfound Research. 380 Washington Street 2nd Floor, Wellesley, MA 02481. E-mail: nathan@thinknewfound.com.

Steven Braun is a quantitative analyst at Newfound Research. 380 Washington Street 2nd Floor, Wellesley, MA 02481. E-mail: sbraun@thinknewfound.com.


Abstract

Prior research and empirical investment results have shown that portfolio construction choices related to rebalance schedules may have non-trivial impacts on realized performance. We construct long-only indices that provide exposures to popular U.S. equity factors (value, size, momentum, quality, and low volatility) and vary their rebalance schedules to isolate the effects of “rebalance timing luck.” Our constructed indices exhibit high levels of rebalance timing luck, often exceeding 100 basis points annualized, with total impact dependent upon the frequency of rebalancing, portfolio concentration, and the nature of the underlying strategy. As a case study, we replicate popular factor-based index funds and similarly find meaningful performance impacts due to rebalance timing luck. For example, a strategy replicating the S&P Enhanced Value index saw calendar year return differentials above 40% strictly due to the rebalance schedule implemented. Our results suggest substantial problems for analyzing any investment when the strategy, its peer group, or its benchmark is susceptible to performance impacts driven by the choice of rebalance schedule.


INTRODUCTION

The popularization and distribution of equity factor strategies has been a boon to investors, providing low-cost access to a range of systematic investment styles. However, there is no precise method of measuring or executing these strategies. Differences in the approaches to constructing these strategies can lead to significant dispersion in results even for strategies targeting the same investment style (Ciliberti and Gualdi (2018)). While substantial effort is spent researching new factor signals, refining previously discovered signals, and developing portfolio construction techniques, the seemingly innocuous activity of choosing when to rebalance these strategies is largely absent from the existing literature.

Blitz, van der Grient, and van Vliet (2010) first documented this impact for an annually-rebalanced fundamental equity index, finding a large discrepancy in realized results. This fundamental index, as described in Arnott, Hsu, and Moore (2005), weights its constituents in proportion to the companies’ fundamentals (book value, cash-flow, and dividends), in contrast to the conventional approach where the constituent weights are proportional to their market capitalization. Blitz et al (2010) documented that a fundamental index annually rebalanced in March outperformed an identically constructed index rebalanced in September by over 10 percentage points in 2009, despite the two indices being identical in process and rebalance frequency. Further, the authors found that the realized performance dispersion resulting from the different rebalance schedules [2] was not mean-reverting, generating a permanent remnant in the performance of the indices; an effect large enough to influence investment decisions long after the initial dispersion was manifested.

We label the potential performance dispersion between two identically managed strategies with different rebalance schedules rebalance timing luck (RTL). When applied to a single manager or fund, this concept is theoretical in that the effect lies in the investment decisions that could have been made (e.g. annually rebalancing in March rather than September). The realized performance of a fund cannot be changed and RTL can only be explicitly measured ex-post through the lens of a theoretical universe of identically-managed investment strategies with varied rebalance schedules. Importantly, the effects of RTL can also present itself when comparing a manager’s performance to another manager or even to a benchmark. Given different rebalance schedules, positive and negative RTL impacts can make a given manager appear more or less skilled. [3]

To illustrate these effects, we first construct long-only U.S. equity strategies designed to capture value, momentum, quality, and low volatility tilts, where the universe of eligible securities is obtained from the S&P 500 universe and fundamental data is obtained from Sharadar Fundamentals. For each style, we vary the target number of holdings as well as the rebalance frequency to target specific sensitivities to these explicit decisions. In line with the analytical derivation of RTL from Hoffstein, Sibears, and Faber (2019), we find that the realized RTL is directly influenced by the number of holdings, the portfolio turnover realized by the strategy, and the rebalance frequency. Our results also align with the expectation that strategies with low average turnover tend to exhibit less RTL.

To further illustrate the real-world effects of timing luck, we then replicate popular smart beta indices in the United States Large-Cap equity space. Our findings suggest that the choice of rebalance schedule is material and has affected annualized returns by as much as 200 basis points for higher turnover strategies, with one-year performance discrepancies as high as 40 percentage points.

Through the results in our study, we extend the literature by validating the existence of RTL in indices corresponding to popular equity investment styles. Further, by utilizing the framework identified in Hoffstein et al (2019), our results empirically validate the influence that portfolio concentration, portfolio turnover, and rebalance frequency choices have on the realized results of an investment strategy. By explicitly testing the RTL framework on different equity investment styles, we also show that the analytical derivation of RTL unveils significant insights for analyzing the realized performance of an investment strategy.

Our results suggest significant potential problems for return-based strategy comparisons and analysis.  For example, failing to inoculate a benchmark against the effects of RTL can cause a strategy to appear skilled or un-skilled by relative comparison when the performance dispersion is actually an artifact of luck.  This is a particularly timely topic given the popularization of “smart beta” strategies and other systematic funds over the last decade.  Our results show that the spectre of RTL is an ongoing influence on portfolio results and the prioritization of portfolio construction, through the use of an overlapping portfolio solution, leads to more consistent outcomes for the end investor and successfully mitigates the unpalatable effects of RTL.

CONSTRUCTING EQUITY FACTOR PORTFOLIOS

We begin by constructing long-only, U.S. large-cap factor portfolios, using the S&P 500 as the parent universe. For each factor, securities are first ranked by corresponding characteristics and the top-ranking securities are purchased in equal weight. The characteristics defining our factor strategies are as follows: [4]

To estimate RTL for a given factor, we first construct sub-indexes reflecting the different potential rebalance schedules and then we use those sub-indexes to construct an RTL-neutral benchmark. For the latter, we follow the suggestion of Blitz et al (2010) – proved optimal by Hoffstein et al (2019) – and implement an “overlapping portfolio” solution (also referred to as “staggered rebalancing” or “tranching”) by holding the sub-indexes in equal weight.

By construction, the performance differences that occur between the sub-indexes and the RTL-neutral benchmark are due only to differences in rebalance schedule. Therefore, by calculating the differences in monthly returns between the sub-indexes and the RTL-neutral benchmark, we can empirically measure RTL. Specifically, we measure RTL as the annualized volatility of these differences.

Hoffstein et al (2019) derived an intuitive closed-form solution for an ex-ante estimate of RTL (Equation 1). From this equation, it becomes clear that RTL (L) is affected by a portfolio’s turnover rate (T), rebalance frequency (f), and the opportunity set allotted to the portfolio (S). [5]

A higher turnover rate implies that the holdings of a portfolio have a higher potential for meaningful divergence for different rebalance schedules. Consider a portfolio with 100% average annual turnover; it would follow that a portfolio such as this, with an annual rebalance schedule in January versus a portfolio rebalanced in July, would have a low level of holdings overlap, thus increasing the role of RTL in the two portfolios’ performance results. Conversely, a strategy with close to zero turnover would have a high level of holdings overlap between rebalance schedules, implying a lower amount of performance dispersion from RTL alone.

We should think of T as an intrinsic, continuous turnover rate of the strategy driven by the decay speed of the driving signals.  In practice, however, portfolios are typically refreshed at a discrete frequency (f) to balance signal freshness with implementation costs.  For faster moving signals (e.g. momentum which has a particularly short half-life as opposed to a slow signal such as value), the level of signal decay in between rebalance dates can introduce RTL into the portfolio’s performance as the signal begins to decay, favoring more recent information.

With this in mind, we also construct a number of specifications for each factor by varying (1) the number of holdings and (2) the rebalance frequency. Portfolio holdings range between 50 and 400 securities in increments of 50. Rebalance frequency is either annual, semi-annual, or quarterly. [6]

Exhibit 1 depicts the calculated RTL of the four factor portfolios for different concentration and rebalance frequency specifications. [7]


Exhibit 1

In this table, we show the empirical estimate of timing luck of Value, Momentum, Quality, and Low Volatility U.S. Large Cap equity factor portfolios for annual, semi-annual, and quarterly rebalance frequencies, varied by the number of holdings in the portfolio. The Momentum portfolio is constructed by sorting on 12-1 month realized returns; the Value portfolio is constructed by sorting on trailing twelve-month earnings yield; the Quality portfolio is constructed by sorting on the average rank of trailing twelve-month return on equity, accruals ratio (negative), and leverage ratio (negative); the Low Volatility portfolio is constructed by sorting on trailing twelve-month realized volatility (negative). The time-period for these results is July 2000 to September 2019.

Source: Sharadar. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes. Performance assumes the reinvestment of all distributions.


In line with Equation 1, the empirical results show that higher turnover styles, such as momentum, exhibit higher realized RTL as opposed to lower turnover styles such as low volatility. Further, higher portfolio concentration (i.e. fewer holdings) increases the magnitude of RTL as more concentrated portfolios would reduce the level of holdings overlap between rebalance versions, while more frequent rebalancing tends to reduce it. Surprising, however, is the actual magnitude of RTL; for a semi-annual rebalance schedule, annualized RTL is as high as 2.5%, 4.4%, 1.1% and 2.0% for 100-stock value, momentum, low volatility, and quality portfolios, respectively.

A portfolio that takes a long position in one of these sub-portfolios while being short another, could then explicitly capture the relative effect of timing luck between the two portfolios. If we assume that the impacts of RTL are independent from one another, we can calculate the volatility of this long-short portfolio through Equation 2, where vi and vj are the different sub-portfolios of the same strategy.  From this, a confidence level can be generated to capture the potential return range that a strategy would be expected to achieve, simply from the rebalance choices the strategy had made. For the 100-stock value, momentum, low-volatility, and quality portfolios, we could, therefore, infer that a strategy targeting one of these styles could have resulted in performance dispersions of +/- 7.1, 12.5, 3.1, and 5.7 annual percentage points due to RTL alone. 

These results complicate the manager selection process as the annual returns of two managers tilting towards the same style could be several hundred basis points apart strictly due to different rebalance schedules and nothing else.  Conversely, the skill of a manager may appear diminished (inflated) when compared to a benchmark that realized positive (negative) RTL. 

To highlight the effects of dispersion caused by RTL, Exhibit 2 depicts the various equity curves of the sub-indexes for a semi-annually rebalanced, 100-stock momentum strategy. We also construct the RTL-neutral benchmark (labeled “Tranche”). Exhibit 3 details the realized performance statistics of the sub-indices as well as their tracking error to the RTL-neutral benchmark. We find that the minimum tracking error realized is 2.9%, which happens to also arise from the best-performing rebalance schedule over the analysis period (MAY-NOV), while the greatest tracking error realized over this period is 4.6%.

While the sub-index rebalanced in May and November had the highest realized returns, the performance difference is not statistically significant and suggests that the realized excess performance of this parameterization is not persistent.  Rather, the May and November rebalance schedule simply benefited from positive RTL shocks relative to its peers.


Exhibit 2

In this figure, we show the equity curves of 100-stock equity momentum portfolios constructed from the S&P 500 universe. These portfolios depict the different rebalance schedules of a semi-annual rebalance frequency. The tranched portfolio is also shown which represents a composite of the different rebalance schedules.

Source: Sharadar. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes. Performance assumes the reinvestment of all distributions.


Exhibit 3

In this table, we show the annualized performance statistics of the six rebalance schedules available to a semi-annually rebalanced equity momentum portfolio sorted on 12-1 month realized returns, as well as the tranched composite of these rebalance schedules. Tracking error is calculated relative to the tranched composite.


Constructing portfolios that are long one sub-index and short another for all iterations isolates the relative RTL between the two sub-indices.  We find that the overall significance of any persistent outperformance is low, indicating that no rebalance schedule shows significant outperformance over other versions of the strategy. Out of the fifteen permutations of the momentum style, no combinations were found to be statistically significant,[8] and similar results were found in the remaining styles (pairwise t-stat tables can be found in Appendix A). 

Importantly, this test of significance serves the purpose of disproving whether there exists a rebalance schedule that is inherently superior versus the others. The lack of evidence for schedule superiority suggests that RTL is an uncompensated source of risk in portfolio construction. The manner in which this risk manifests is in the dispersion of terminal wealth achieved, and the RTL shocks that lead to this dispersion not expected to have mean-reverting characteristics, as shown in Blitz et al (2010).

To further isolate the dispersion due to RTL, Exhibit 4 plots the rolling 252-day performance difference between two different rebalance schedules for a semi-annually rebalanced 100-stock momentum strategy. Shockingly, the seemingly trivial decision to rebalance the portfolio in May and November resulted in a twenty percentage-point return difference when measured against the same strategy, with its rebalance shifted by only one month (April and October).


Exhibit 4

In this figure, we show the rolling 252-day performance difference between a 100-stock momentum portfolio rebalanced in May/November and a 100-stock momentum portfolio rebalanced in April/October.

Source: Sharadar. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes. Performance assumes the reinvestment of all distributions.


REPLICATING EXISTING SMART BETA PRODUCTS

To bridge the gap from hypothetical to use-case, we replicate the process behind the S&P 500 Enhanced Value, Momentum, Low Volatility, and Quality indices. Specifically, we implement the rules disclosed in the index methodology as follows:[9]

Building from these rules, we construct all possible rebalance schedule variations of these four indexes.[10] Exhibit 5 highlights the terminal wealth realized from the portfolios along with the best and worst performing rebalance schedules. The resulting portfolios are shown to exhibit significant amounts of performance dispersion, flowing through to meaningful differences in the terminal wealth accumulated. Again, it is important to emphasize that the only difference in these portfolios is the rebalance schedule: all other aspects of the portfolio construction process are held constant.


Exhibit 5

In this figure, we show the terminal wealth results from a one-dollar investment in different replicated S&P equity factor index variations from January 2001 to September 2019.


For the Enhanced Value, Momentum, Low Volatility, and Quality indices, the annualized return dispersion between the best- and worst-performing rebalance schedules is 100, 192, 25, and 106 basis points, respectively. Importantly, a pattern does not exist as to which rebalance schedule shows consistent under- or out-performance between factors.

Exhibits 6, 7, 8, and 9 plot the calendar year returns in excess of the average sub-portfolio return for that year, for different rebalance schedules. The annual returns of the factors highlight that periods of elevated market volatility can exacerbate performance dispersion. The S&P 500 Enhanced Value replications, for example, see a highly significant dispersion arising in 2009, whereby the indices rebalanced in FEB-AUG and JAN-JUL significantly outperformed the other versions. Between the JAN-JUL and JUN-DEC rebalance schedules, the performance differential in 2009 is an astounding 41.7 percentage points.


Exhibit 6

In this figure, we show the calendar year excess returns of the replicated S&P 500 Enhanced Value index relative to the average sub-portfolio calendar year return, varied by rebalance schedule.

Source: Sharadar. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes. Performance assumes the reinvestment of all distributions.


Exhibit 7

In this figure, we show the calendar year excess returns of the replicated S&P 500 Momentum index relative to the average sub-portfolio calendar year return varied by rebalance schedule.

Source: Sharadar. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes. Performance assumes the reinvestment of all distributions.


Exhibit 8

In this figure, we show the calendar year excess returns of the replicated S&P 500 High Quality index relative to the average sub-portfolio calendar year return, varied by rebalance schedule.

Source: Sharadar. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes. Performance assumes the reinvestment of all distributions.


Exhibit 9

In this figure, we show the calendar year excess returns of the replicated S&P 500 Low Volatility index relative to the average sub-portfolio calendar year return, varied by rebalance schedule.

Source: Sharadar. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes. Performance assumes the reinvestment of all distributions.


The S&P 500 Momentum replications show that the overall dispersion in performance throughout the period analyzed tends to be more consistent, given that the turnover of this strategy tends to remain high, as the majority of the years realize a difference of at least four percentage points.[11] For each of the factor replication strategies, minimum annual performance dispersion, as measured by absolute difference in calendar year returns, are 1.3, 4.5, 1.8, and 0.1 percentage points for Enhanced Value, Momentum, Quality, and Low Volatility, respectively. The maximum return differences were 41.7, 14.6, 8.6, and 2.9 percentage points, respectively.  Elevated bouts of broad market volatility tend to increase the amounts of absolute dispersion (e.g. 14.6 percentage points in 2002 and 14.1 percentage points in 2009).

CONCLUSION

While the concept and execution of rebalance schedules has been glossed over in the existing literature, a decision must be made as to when a strategy is measured and executed.  This decision does not come without consequence. Empirical evidence has shown that performance results can vary drastically and leave a lasting impact on wealth outcomes.

In this piece, we explored the impact of rebalance timing luck on the results of smart beta / equity style portfolios with varying portfolio characteristics. We empirically tested this impact by designing a variety of portfolio specifications for four different equity styles (Value, Momentum, Low Volatility, and Quality). The specifications were varied by holding concentration as well as rebalance frequency.

We then constructed all possible rebalance variations of each specification to calculate the realized impact of rebalance timing luck over the test period (2001-2019). In line with the mathematical model from Hoffstein et al (2019), we generally find that those strategies with higher turnover are more sensitive to timing luck, while those that rebalance more frequently exhibit less timing luck. Additionally, a higher number of portfolio holdings reduces the impact timing luck has on realized returns, all else equal.

The sheer magnitude of timing luck, however, may come as a surprise to many. For reasonably concentrated portfolios (100 stocks) with semi-annual rebalance frequencies (common in many index definitions), annual timing luck ranged from 1-to-4%, which translated to a 95% confidence interval in annual performance dispersion ranging from +/-1.5% per year for low turnover strategies to +/-12.5% for higher turnover strategies, though, we identify periods in which this estimate falls drastically short of empirical results.

These results call into question one’s ability to draw meaningful relative performance conclusions between two strategies, or a strategy and its benchmark, even if other variables such as factor definition and portfolio constructions methods are controlled.

We then explored more concrete examples, replicating the S&P 500 Enhanced Value, Momentum, Low Volatility, and Quality indices, which are tracked by live assets. In line with expectations, we find that Momentum (a high turnover strategy) exhibits significantly higher realized timing luck than a lower turnover strategy rebalanced more frequently (e.g. Low Volatility). For these four indices, the amount of rebalance timing luck leads to a staggering level of dispersion in realized terminal wealth.

Given that most of the major equity style benchmarks are managed with annual or semi-annual rebalance schedules, even the benchmarks that investors use for comparison and analysis may be realizing hundreds of basis points of positive or negative performance luck a year. While identifying and testing the impacts of RTL in a systematically managed strategy is certainly feasible, conducting the same exercise with a discretionary, actively managed strategy becomes non-trivial. Given that an active manager would not necessarily operate on a set rebalancing schedule, one might argue that timing is an active decision within an active manager’s process. Nevertheless, while difficult to explicitly measure, the specter of RTL would still play an important role in the manager’s result and therefore comparison against an RTL-neutral benchmark would be prudent.  With such a large emphasis on identifying and quantifying the skill of investment managers, investors should always bear in mind that supposed skill, seemingly beyond passive smart beta investing, might merely be attributable to dumb (timing) luck.


Appendix A

This appendix shows the t-statistics of the annualized realized returns of long-short portfolios for each equity style. The portfolios are constructed by creating a portfolio that is long one rebalance schedule and short another from January 2001 to September 2019.  The t-stats depicted in these tables show the significance of average outperformance of the rebalance schedules, where the existence of statistically significant results would indicate the existence of a superior rebalance schedule over a long timeframe.  Bolded values indicate statistical significance at the 5% level.

Pairwise t-stat table of constructed Long-Short Value portfolios of different rebalance dates. 5% statistical significance is indicated in bold.

Pairwise t-stat table of constructed Long-Short Momentum portfolios of different rebalance dates. 5% statistical significance is indicated in bold.

Pairwise t-stat table of constructed Long-Short Quality portfolios of different rebalance dates. 5% statistical significance is indicated in bold.

Pairwise t-stat table of constructed Long-Short Low-Volatility portfolios of different rebalance dates. 5% statistical significance is indicated in bold.


References

Arnott, R.D., Hsu, J., and Moore, P. (2005), “Fundamental Indexation”, Financial Analysts Journal, Vol. 61, No. 2, 83-89.

Blitz, D., van der Grient, B., and van Vliet, P. (2010). “Fundamental Indexation: Rebalancing Assumptions and Performance,” Journal of Index Investing, Vol. 1, No. 2, 82-88.

Ciliberti, S., and Gualdi, S. “Portfolio Construction Matters.” arXiv.org, October 19, 2018. https://arxiv.org/abs/1810.08384.

Doran, J., Jiang, D., and Peterson, D. (2012). “Gambling Preference and the New Year Effect of Assets with Lottery Features,” Review of Finance, Vol. 16, No. 3, 685-731.

Haugen, R., and Lakonishok, J. (1988). “The Incredible January Effect: the Stock Markets Unsolved Mystery”. Homewood Ill.: Dow Jones-Irwin.

 Hoffstein, C., Faber, N., Sibears, D. (2019). “Rebalance Timing Luck: The Difference Between Hired and Fired,” Journal of Index Investing, Vol. 10, No. 1, 27-36.

Keim, D. (1983). “Size Related Anomalies and Stock Return Seasonalities,” Journal of Financial Economics, Vol. 12, No. 1, 13-32.

Sias, R. (2007). “Causes and Seasonality of Momentum Profits.” Financial Analysts Journal, Vol. 63, No. 2, 48-54.


Notes

[1] The authors would like to thank (in alphabetical order) Adam Butler, David Cantor, Conrad Ciccotello, Antti Ilmanen, and Pim van Vliet who offered their opinions and insights.

[2] Herein we distinguish between rebalance frequency (e.g. semi-annual or annual) and rebalance schedule (e.g. every June and December or each May). The frequency defines how often the strategy is rebalanced while the schedule determines when, specifically, the rebalances occur within a year.

[3] When analyzing active portfolio managers, it is important to highlight that there is no evidence that managers make deliberate rebalance choices with the objective of maximizing performance, so any rebalance choice from actively managed portfolios is an active decision with unmeasured risk.

[4] The characteristics chosen to construct our factor portfolios were selected as these definitions generally align with the existing literature and popular indices tracking each style.  These characteristics are meant to be representative only, but our research suggests they are without loss of generality.

[5] The S variable in Equation 1 is technically the estimated volatility of a long/short portfolio where the long leg of the portfolio is what the portfolio is invested in and the short leg captures the residual assets that the portfolio could be invested in at a given time. See Hoffstein et al (2019) for a further discussion of this variable.

[6] Data comes from Sharadar and utilizes all available pricing history at the timing of writing (2001 to 2019).

[7] All return results presented are gross of transaction fees or advisory expenses, so any increases in portfolio turnover from more frequent rebalances would negatively influence net returns, all else equal.

[8] There is existing literature citing a seasonality effect in momentum profits, known as the “January Effect”.  This anomaly is credited to window-dressing (managers removing losing holdings from a portfolio before holdings are released at year-end), liquidity conditions in the market, higher investor risk appetites, as well as from tax-loss selling of underperforming stocks. The January Effect has been shown to boost common factor strategies returns in January, while impairing the returns of momentum strategies. Conversely, this effect originates in December, where institutional buying of recent winners pushes momentum profits higher in the month of December.  See Keim (1983); Haugen, Lakonishok (1988), Sias (2007), and Doran, Jiang, Peterson (2009) for further descriptions and evidence of this phenomenon.

In the scope of this study, we found the results of the MAY-NOV (rebalanced and remeasured at month-end in May and November) momentum strategies to outperform other rebalance schedules; however, when analyzed through the lens of long-short portfolios, no combinations were found to be significant.  Further, by instantiating simulation-based analysis of significance, there were no pairings that resulted in returns that were statistically dissimilar from zero.

[9] These methodologies were referenced from the S&P Dow Jones Indices website in December 2019.

[10] For indices with semi-annual rebalance schedules, there are six unique sub-indices that can be constructed, while there are three sub-indices available for an index that rebalances quarterly.

[11] The factor replication minimum performance dispersion, as measured by absolute difference in calendar year returns, are 1.3, 4.5, 1.8, and 0.1 percentage points for Enhanced Value, Momentum, Quality, and Low Volatility, respectively. The maximum return differences were 41.7, 14.6, 8.6, and 2.9 percentage points.


PDF Download: Rebalance Timing Luck: The Dumb (Timing) Luck of Smart Beta


11+