Abstract
The ubiquitous spreadsheet can be used to model situations with random values, in what is commonly referred to as Monte Carlo simulation. For simple cases, adding random functions such as Excel^{TM}'s RAND) is enough. In general business models, complex inverse distribution functions, in combination with RAND, are needed to generate the right random values. But first the modeler must determine the appropriate bestfit distribution to use. This can be a daunting process for undergraduates and typical executives. So for expediency, simulation addins (with additional learning time and possible costs) may be employed. The use of addins, however, makes the modeling less transparent. A more direct alternative is to resample the raw data, which in many cases are not sufficient in sample size to establish statistical goodness of fit. This paper reviews the limitations of current spreadsheet resampling methods and proposes new simple yet effective formulations that better accommodate classroom and practical realworld application.
Monte Carlo simulations have been used to evaluate business situations where there are uncertainties and randomness. For each random variable, its values typically are assumed to vary according to common probability distributions; one is selected that best represents the variable's behavior. In spreadsheets, these would simply be cells having formulas that automatically generate random values. Simple cases, like those typically found in academic exercises, can use random functions such as Excel^{TM}'s RAND to generate the needed uniformly distributed random values. In general, complex inverse distribution functions, in combination with RAND, are required.
Modeling random behavior in a spreadsheet, as opposed to writing a computer subroutine using a programming language, is fairly straightforward and intuitive and at the same time powerful and flexible. However, there still are some difficulties. First, to be technically rigorous, field data have to be collected, candidate distribution functions testfitted to this sample, and, for the selected function, the associated population parameters accurately estimated. Often the amount of data that can be collected is insufficient for these tasks to be properly done. Moreover, the procedures involved are challenging for typical undergraduate students and business executives, who have to do the modeling and analysis work, and to their managers, who have to understand and accept the models and analysis results.
For expediency, Excel simulation addins including CrystalBall^{TM} (Decisioneering, 2006), @Risk^{TM} (Palisade Corporation, 2006), XLSim^{TM} (AnalyCorp, 2006), Insight^{TM} (Savage, 2002), Simtools^{TM} (Myerson, 2006), and Resampling Stats^{TM} (Resampling Stats, 2006) have been used. These are comprehensive software packages, use of which usually means additional learning time for the students. Unless the software is applied extensively by a business, our experience suggests that it is really neither cost nor time effective. The packages have features to help modelers select the needed distribution and generate the random values. Though the technical burden is much reduced for the modeler, the process is less transparent, and the responsibility for technical accuracy is subtly relegated to the addin software. For example, a good modeler should not blindly adopt a distribution just because the addin software found it to have the best fit to the data sample. So although addins are valuable in simplifying the work, learning is compromised.
With some ingenuity, native spreadsheet features can provide better learning opportunities for students. They can appreciate and learn about the behavior of random variables, statistical thinking, and business modeling. This is consistent with industry feedback and, as elaborated in Leong and Cheong (2006), quantitative foundation and management science courses, even with the aid of userfriendly software like Excel, can at best teach students to be better consumers of analysis (Powell, 1997). Management education can be made more relevant to the market place by training students to be active modelers, end users building models to address immediate challenges and concerns.
For many years, we have been teaching a course on spreadsheet business modeling; it is mostly taken by secondyear undergraduates from business, accounting, economics, information systems, and social sciences majors. The classes are conducted in an interactive manner, that is, seminar style with about 45 students per class. In this course, Monte Carlo simulation is taught at an introductory level over 2 weeks (i.e., 6 hours of class time). Our experience here is that addins, other than the Analysis Toolpak and Solver (both in the MS Office^{TM} standard package), are really not needed for the short class module on Monte Carlo simulation.
Instead we employ resampling. Resampling eliminates the need to fit distributions to the sample data and the ensuing tests for goodness of fit, for which sufficient data must be present to achieve the needed level of statistical confidence. In contrast, resampling as a method requires less data, as it just uses whatever is available. Thus, it allows us to dive directly into business modeling without spending class time on statistical data fitting and distribution function inversion. These more complex topics are best covered in fullblown statistics and simulation courses, rather than in a business modeling course. Students would also find it easier to collect data for their team projects and future ad hoc modeling work. However, so far, spreadsheet resampling approaches have not been studied much. We therefore seek to review and evaluate past attempts in using spreadsheets to resample data and propose simpler yet effective formulations that better accommodate classroom and practical realworld application.
The rest of the paper is organized as follows. A brief review of the literature in resampling and how it relates to Monte Carlo simulation follows this introductory section. Next we elaborate on the current spreadsheet resampling formulations available. Then we propose new Excel formulations, first for discrete data with small ranges. This is followed by resampling methods with interpolations, for continuous data and discrete data with large ranges. A short discussion is included on bias errors arising from the use of frequency tables in resampling. The paper culminates in a section on resampling of multivariate dependent data, and finally, the paper ends with some concluding comments.
The term bootstrapping is often used in the literature interchangeably with resampling. Strictly, in bootstrapping, data are resampled from an original data sample to draw statistical inference about the population or its sampling distribution. That is, it is used for analyzing the data. This work was theoretically established as a general statistical method by Efron (1979) and Efron and Tibshirani (1993). Resampling as a method, however, is more correctly associated with Monte Carlo simulation, an idea derived from gamblers testing their chances by using repeated simulations to check their probability of winning. Simulations can be done by either sampling from a theoretical distribution or resampling from a data sample. So resampling is used here to generate data, as inputs to a business model.
Willemain (1994) first proposed that bootstrapping be made more accessible to teachers and students by implementing it in a simple spreadsheet. He demonstrated how the LOOKUP function may be used to randomly sample univariate raw data. Woodroof (2000) further elaborated with a more complete multivariate illustration on how Lotus 123^{TM} (a spreadsheet program) can be applied to make bootstrapping easier to understand and implement.
More recently, Yu (2003) added that many resampling tools are available in standard applications, such as SAS^{TM} and SyStat^{TM}, and also refuted the misconception that resampling is not founded on strong theoretical support. He did this by first providing a historical survey of the various resampling methods; he then reviewed popular arguments for and against the use of resampling. In many situations, as when the data population is ill defined and the sample size is small, exploring the empirical distribution by resampling should be favored over unjustified use of theoretical distributions.
Duckworth and Stephenson (2003) deliberated on how others have attempted to use resampling as a tool to teach courses in introductory statistics. They proposed that simulation (i.e., random sampling from a known population), resampling (i.e., further sampling from a given sample of the population), and the use of computing (to do either) are necessary statistical tools for health care professionals. In this paper, we are also interested in resampling, but closer to its origins in Monte Carlo simulations and for a broader purpose: its application in general business modeling.
As improvements over those proposed by other researchers, our resampling formulas are also updated to use Excel, a more commonly used spreadsheet program nowadays. One of our formulas has been applied (Leong, 2007) to simulate parametric multiserver queues. For the discussion in this paper, we assume that resampling is done with replacement and will cover both univariate and multivariate cases. A short note on the interesting resampling without replacement case is provided in Appendix 2. Variables in the multivariate case will be allowed to be dependent. If the variables are independent, we can easily use the univariate model multiple times.
To aid discussion, we will use an example (see Figure 1) to explain the resampling formulas. In this example, we seek to resample historical annual returns to simulate for next year the return rates of treasury bills, treasury bonds, and common stocks. Before we present our formulas, we first collate current available approaches and rewrite them as Excel formulas for ease of comparison.
Figure 1. Data and Basic Model
The Excel equivalent of Willemain's (1994) Lotus 123 univariate formula, applied to our example, to resample treasury bill return rates would be as follows:
(1) M11= LOOKUP(INT(RAND()*n) + 1), B$11:B$110, D$11:D$110),
where n is the number of values in the original sample, B11:B110 is the indexRange and D11:D110 is the valueRange. (Explanations are provided in Appendix 1 for those who are less familiar with Excel functions.) The first argument in (1) is a formula that randomly generates an integer between 1 and n. Cell M11 in Figure 1 would be copied and pasted in the other rows below it, invoking Excel's relative referencing property, to replicate more resampled values.
Woodroof's (2000) multivariable formula in Excel^{ }would be:
(2) I11 = INT(RAND())*n+1)
(3a) M11 = LOOKUP(I11, B$11:B$110, D$11:D$110)
(3b) N11 = LOOKUP(I11, B$11:B$110, E$11:E$110)
(3c) O11 = LOOKUP(I11, B$11:B$110, F$11:F$110).
The essential difference here is that I11 is a common value shared by resampling formulas (3a)(3c). Each time resampling is done, a whole row in the original data sample table is selected. This approach retains the statistical dependency among the different variables.
Instead of resampling from the original sample, an approach commonly encountered in textbooks (recent ones include Moore and Weatherford, 2001, Gips, 2003, Hillier and Hillier, 2004, Winston, 2004, and Powell and Baker, 2004) uses LOOKUP functions to resample from probability tables. Such tables compress the sample data by presenting data values against their relative frequencies. Extra computational work is required to construct the tables (a step often skipped by the textbooks). Relative frequencies are taken to be good estimates of the probabilities of occurrences. Examples of such probability tables are shown in Table 1.
Table 1. Examples of Probability Tables
From the relative frequencies, students can compute the cumulative relative frequencies (as the estimated cumulative probabilities). From this, we can shift the values down one row (to be explained later) to obtain the lookupRange. To generate a random value according to the distribution of the data, we can use
(4) resampledValue = LOOKUP(RAND(), lookupRange, valueRange).
Here, the LOOKUP function seeks out the largest value in lookupRange that is less than or equal to the value generated by RAND and remembers its relative position in the lookupRange. It then returns the value in the same relative position in the valueRange. Formulation (4) works best for discrete data with small finite ranges, as in Table 1(a).
To resample discrete data with small ranges, still referring to Figure 1, we can use
(5) M11= LOOKUP(RANDBETWEEN(1,n), B$11:B$110, D$11:D$110).
This formula is modified from (1), making only a minor changereplacing RAND with RANDBETWEEN. RANDBETWEEN(1,n) also uniformly samples from {1, 2. ..., n}, but this function is easier to comprehend. The Analysis Toolpak addin needs to be activated before RANDBETWEEN can be used.
We further simplify (5) to propose the following:
(6) M11 = INDEX(D$11:D$110, RANDBETWEEN(1,n)).
This formula also directly resamples, with equal probability, the n values of the original sample, stored in valueRange D11:D110. It is simpler than (1) and (5) because it no longer requires the indexRange B11:B110. The values resampled would appear with probabilities that match their relative frequencies in the original sample.
Other possible equivalent formulations are
(7) M11 = SMALL(D$11:D$110, RANDBETWEEN(1,n))
and
(8) M11 = LARGE(D$11:D$110, RANDBETWEEN(1,n)).
Functions SMALL and LARGE in (7) and (8) implicitly sort the values in the valueRange D11:D110 in ascending and descending order, respectively. Sorting may appear to be unnecessary, in contrast to (1), (5), and (6); nonetheless, it costs negligible computation time in Excel and is no more difficult to read than (5). The explanation for (7) goes as follows: assign 1/n probability to each of the sorted data values and then compute their cumulative probabilities. Resampling using (7) is thus equivalent to sampling from the population's inverse empirical distribution. Sampling from an inverse distribution is a wellaccepted approach for random values generation, and in using the empirical distribution (7) does so without needing to justify a theoretical distribution. Formulas (5)(8) are improvements over (4), resampling directly from the original sample data, as the effort to construct the probability table is saved. Overall, the preferred approach is the SMALL formulation (7), as it parallels the PERCENTILE formulation, which (to be explained next) resamples for the continuous data case.
For resampling formulas in sections 3 and 4 to be effective, we should strictly require all the possible data values to be represented in the sample. That is, if the variable ranges from 1 to 25, there must be at least one count of 1, at least one count of 2,..., and at least one count of 25 in the sample. Satisfying this requirement could mean very large samples when the discrete variable has a nonuniform distribution or a large range. Applying these formulas for discrete data with large ranges or even for continuous data will yield subranges of discontinuities in the resampled values.
To mitigate this imperfection, we propose in this section more new formulations. We focus first on continuous variables and later elaborate how this formulation can be modified for discrete variables with large ranges.
The proposed formula for resampling continuous data is
(9) M11 = PERCENTILE(D$11:D$110, RAND()).
Similar to the description made for the SMALL formulation (7), this formula samples from the inverse empirical distribution. The additional advantage here is that although the empirical distribution in (7) is a series of staircase steps, this empirical distribution is piecewise linear (joining consecutive values in the original sample) and thus smoother. The random values it generates include the original sample data values and all values linearly interpolated in between. Again, no probability table needs to be created and there is no loss of information from aggregating the data.
Formulation (9) looks deceptively simple, but it stealthily does all the massive sorting (in ascending order) and interpolation work. With a small modification, it turns into the formula for resampling discrete data with large ranges. This is done by incorporating the ROUND function in (9) as follows:
(10) M11 = ROUND(PERCENTILE(D$11:D$110, RAND()),0).
When (9) is used on discrete data, the generated random value, as a result of the implicit interpolation, can be nondiscrete. Applying the ROUND function converts this continuous value into a discrete value.
By visual inspection (see Appendices 3 and 4), we can see that cumulative relative frequency plots of the interpolated resampled values have smaller maximum absolute deviations from the cumulative relative frequency of the original sample than resampling without interpolation. It is thus an improvement in both form and outcome.
When constructing probability tables for continuous data (e.g., Table 1(b)) or discrete data with large ranges (e.g., Table 1(c)), the relative frequencies are given for data intervals, typically referred to as "bins." In Excel, the binRange contains the upper limits of the intervals. On the other hand, the LOOKUP function needs the lower limits of the intervals as input. The shifting down of the cumulative relative frequency by one row, to give the lookupRange, corrects for this. The same needs to done for the discrete data case. We have found that students are either not adept in doing this or have tremendous difficulty accepting the reasons they should. "N/A" errors will arise sporadically when the cumulative relative frequency data range is used as the lookupRange because LOOKUP cannot find relative positions for RAND values smaller than the first cumulative relative frequency value in the table.
Furthermore, to avoid bias, the binRange should not be used in (4) as the valueRange, a common mistake made by students and textbook writers alike. The bins' midpoint values should be used instead. Using the data in Table 1(c), we will illustrate the bias that can arise for all of these reasons. We first arbitrarily simulate a sample of 16 raw data points (shown as the StemandLeaf table in Figure 2), uniformly distributed within the bin intervals, with frequency occurrences that approximately yield the relative frequencies of Table 1(c). From this raw sample, we can compute its mean and standard deviation. Data are then simulated by resampling according to five other scenarios, described in cells H6:I10 of Figure 2. For each of these scenarios, 126 data values were resampled. We computed the mean and standard deviation of these five data sets and compared them against that of the raw data.
The result of one typical representative sample run shows that the sample mean under Scenario 1 (resampling using (10) directly from the sample data) is only 4% larger than the mean of the raw data. The sample mean for Scenario 2 (using bins, with the correct lookupRange and midRange as valueRange) is 43% larger. It was 79% larger when the binRange was incorrectly used as the valueRange (in Scenario 3). The deviation of sample mean for Scenario 4 (using cumulative relative frequency as lookupRange and midRange as valueRange) at 9% is about the same as that of Scenario 5 (also using cumulative relative frequency as lookupRange, but now with binrange as valueRange) at 15%, both of which are inherently incorrect representations of the data distribution. The standard deviations for all the scenarios are comparable. In all, because data are aggregated into bins, resampling using (4) requires more work and has a large bias, and not using the midpoints of bins as resampled values generally makes it worse.
Figure 2. Bin and Other Errors
The new formulations discussed so far are for univariate cases. We can apply them without change to multivariate cases if we assume that the variables are statistically independent. For the general multivariate case, resampling (without interpolation) can be done by refining fomulations (2)(3c) as follows:
(11) I11 = RANDBETWEEN(1, n)
(12a) M11 = SMALL(D$11:D$110, I11)
(12b) N11 = SMALL(E$11:E$110, I11)
(12c) O11 = SMALL(F$11:F$110, I11).
The data for the variables are stored in columns, one for each variable: D11:D110 for treasury bills, E11:E110 for treasury bonds and F11:F110 for common stocks' annual return rates. In multidimensional space, each row represents a data point, so randomly sampling the rows preserves the correlation among the data values from the different variables. Again, this approach is only appropriate for discrete variables with small ranges.
For continuous data and discrete data with large ranges, we propose the following resampling (with interpolation) formulas:
(13) I11 = RAND()
(14) J11 = RANDBETWEEN(1, n1)
(15a) K11 = MATCH(SMALL(F$11:F$110, J11), F$11:F$110,0)
(15b) L11 = MATCH(SMALL(F$11:F$110,J11+1), F$11:F$110,0)
(16a) M11 = I11*INDEX(D$11:D$110,K11) + (1I11)*INDEX(D$11:D$110,L11)
(16b) N11 = I11*INDEX(E$11:E$110,K11) + (1I11)*INDEX(E$11:E$110,L11)
(16c) O11 = I11*INDEX(F$11:F$110,K11) + (1I11)*INDEX(F$11:F$110,L11).
Value range F11:F110 in (15a) and (15b) could be replaced by any of the other variables' value ranges (i.e., E11:E110 or D11:D110). We call the variable selected for this role the pivot variable. Equation (14) randomly selects an (ascending) ordered position in the pivot variable. Equations (15a) and (15b) compute the relative (nonsorted) positions in the pivot variable that correspond to the selected ordered position J11 and its adjacent ordered position J11 + 1. Adjacency of data points here is defined with respect to the sorted order of the pivot variable. It then takes, in Equations (16a)(16c), a weighted average of the two values in the relative positions K11 and L11 for each variable, using uniform random [0,1) value in I11 as a weight.
These are complex formulas but not too difficult to explain and, if done systematically, are easily constructed in Excel. The end result is shown in Figure 3. The resampled data can be used in a class exercise to determine the optimal portfolio mix of the three groups of investment instruments. Remember to "freeze" the resampled values once they have been computed by the cell formulas (by Copy and PasteSpecial Values to itself in Excel). Otherwise, Excel's Solver will not be able to find the optimal solution, because the data values would be ever changing while Solver is running. The scattergrams and cumulative relative frequency charts of the illustrative example (in Appendices 3 and 4) show that the proposed formulas are effective in resampling data.
In our limited computational experiments, we found that choosing the variable with the largest standard deviation (in this case common stocks) as the pivot variable would give the best variance reduction effect, compared with the cases when the other variables are used as pivot variables. The intuition behind this is that this variable has the flattest marginal cumulative distribution function (CDF). Then randomly choosing any two adjacent data points along this profile will give a relatively larger interval to interpolate in, to generate new data points. These two data points are only adjacent along the pivot variable's marginal CDF but may be far apart with respect to the other variables' marginal CDFs. Because the other variables' marginal CDFs are steeper, any new point created by interpolating between two nonadjacent points along them would not deviate too far from their profiles. Detailed theoretical and computational analyses to support this argument and also to show the effectiveness of the resampling formulas are left for future research.
Figure 3. Multivariate Resampling with Interpolation
The major contributions of this paper are in highlighting bias that arises from both proper and improper use of frequency tables in resampling, providing new streamlined formulations in Excel for resampling (with and without replacement) for both univariate and multivariate cases. As typical of resampling methods, no assumption is made here of the underlying distributional form or its parameter values. Therefore, there is no need to test goodness of fit and accuracy of parameter estimates and no need to keep a ready set of required inverse distribution formulas for generating the random values.
We found that distribution fitting and the use of inverse distribution formulas are really challenging for the undergraduates whom we teach, and we would have required additional class time to discuss them. Though available commercial addins can ease the pain, the use of unfamiliar addins puts new stress on students, and class time is again diverted to learn and develop competency to use them. To use resampling to generate data for our classroom modelbuilding exercises, we need to prepare sample data sets ahead of time. For project work, students do field work to collect the data. Now with resampling, even students with statistics phobia can do Monte Carlo simulation modeling.
Another reason that addins are used is to make it easier to replicate the simulations and collate the results. Excel's DataTable feature (not found in older Excel versions and not to be confused with a table of data) can now be used to easily automate replications (refer to Appendix 1 for details). Because computing DataTables would have been taught in the first few weeks of the term, its application in Monte Carlo simulation reinforces learning. However, DataTables is computationally intensive and thus can completely slow down all the opened spreadsheets. It should therefore be avoided whenever possible. For fast calculations, we recommend that relative referenced formulas be set up as base equations in the top row of the simulation table and copied down to other rows to complete the model. It is, however, appropriate to use DataTable to extract and tabulate the Monte Carlo simulation summary results. Even so, remember to "freeze" the values (i.e., Copy and PasteSpecial Values into itself) after they are collected.
Using only native Excel features, the undergraduates in our businessmodeling course learn to build reasonably sophisticated spreadsheet models of problems in operations, marketing, and finance business areas. Through the problembased pedagogy and studio classroom environment, students develop good intimate knowledge of Excel spreadsheet use, and more importantly, can apply them effectively to business problems. This is assessed by having student teams do realworld projects in the community, using what they have learned in the course. Typically, the Excel workbooks the teams generate are of high quality and win the praise of companies and organizations for which they were developed.
AnalyCorp (2006), XLSim and Insight, (last accessed on March 15, 2006).
Barreto, H. and F. Howland (2005), Introductory Econometrics: Monte Carlo Simulation Using Excel, Cambridge University Press, Cambridge, UK.
Decisioneering (2006), Crystal Ball, (last accessed on March 15, 2006).
Duckworth, W. M. and W. R. Stephenson (2003), "Resampling Methods: Not Just for Statisticians Anymore," 2003 Joint Statistical Meetings, San Francisco, CA.
Efron, B. (1979), "Bootstrap Methods: Another Look at the Jackknife," Annals of Statistics, Vol. 7, pp. 126.
Efron, B. and R. Tibshirani (1993), An Introduction to the Bootstrap, Chapman & Hall Ltd.
Gips, J. (2003), Mastering Excel: A ProblemSolving Approach, 2nd ed., Wiley, New York.
Hillier, F.S. and M.S. Hillier (2004), Introduction to Management Science: A Modeling and Case Studies Approach with Spreadsheets, 2nd ed., McGraw Hill, New York.
Leong, TY. (2007), "Simpler Spreadsheet Simulation of MultiServer Queues," INFORMS Transactions on Education, Vol. 7, No. 1, .
Leong, TY. and M.L.F. Cheong (2006), "Teaching Business Modeling Using Spreadsheets," Working paper, Singapore Management University, Singapore.
Moore, J.H. and L.R. Weatherford (2001), Decision Modeling with Microsoft Excel, 6th ed., Prentice Hall, Upper Saddle River, NJ.
Myerson, R. (2006), Simtools.xla, , (last accessed on March 22, 2006).
Palisade Corporation (2006), @Risk, , (last accessed on March 15, 2006).
Powell, G.S. (1997), "From Intelligent Consumer to Active Modeler, Two MBA Success Stories," Interfaces, Vol. 27, No. 3, pp. 8899.
Powell, S.G. and K.R. Baker (2004), The Art of Modeling with Spreadsheets: Management Science, Spreadsheet Engineering, and Modeling Craft, Wiley, New York.
Resampling Stats (2006), Resampling Stats for Excel 3.2, , (last accessed on March 22, 2006).
Rogers, J.L. (1999), "The Bootstrap, the Jackknife, and the Randomization Test: A Sampling Taxonomy," Multivariate Behavioral Research, Vol. 34, No. 4, pp. 441456.
Savage, S.L. (2002), Decision Making with Insight, Duxbury Press, Pacific Grove, CA.
Willemain, T.R. (1994), "Bootstrap on a Shoestring: Resampling Using Spreadsheets," The American Statistician, Vol. 48, No. 1, pp. 4042.
Winston, W.L. (2004), Microsoft Excel: Data Analysis and Business Modeling, Microsoft Press, Redmond, VA.
Wittwer, J.W. (2004), "Generating Random Numbers in Excel for Monte Carlo Simulation," Vertex42.com June 1, .
Woodroof, J. (2000), "Bootstrapping: As Easy as 124," Journal of Applied Statistics, Vol. 27, No. 4, pp. 509517.
Yu, C.H. (2003), "Resampling Methods: Concepts, Applications, and Justification. Practical Assessment," Research and Evaluation, Vol. 8, No. 19.
