Next: Least Squares and the Up: Simulation of random variables Previous: Demonstration of statistical or

## Practical Guidelines for Monte Carlo simulation

Monte Carlo simulation has become an important tool for analyses of models and comparison of statistical methods. This section discusses some practical issues that should be considered when implementing a monte carlo study to compare methods.

1. Decide which methods should be compared. These methods should all be applicable to the same data. If a method has tuning parameters, then you must consider how the tuning parameters are determined for a particular data set. In some cases, a tuning parameter may be derived from statistical theory. For example, the MAD estimator for scale is given by

where MD is the median of the data. The value for the tuning parameter k=1.4826 is chosen to make this an unbiased estimator of the standard deviation if the data is normally distributed. Another estimator is based on the inter-quartile range, scaled to make it unbiased for normal distributions,

However, some methods such as predictive models may have tuning parameters that are not determined by theory. In such cases optimal values may be obtained by k-fold cross-validation. This requires a measure of optimality such as sum of squared errors. The sample is randomly partitioned into k groups and a grid of values for the tuning parameter is specified. For each value of the tuning parameter, the model is fit using the data with group j removed and then the fitted model is applied to group j. The measure of optimality is obtained from the fitted values in group j and this is repeated for each group. The optimality measures are then averaged to obtain an estimate of optimality for the corresponding value of the tuning parameter. This is repeated for each value of the tuning parameter and the value is chosen that has the highest optimality. Obviously the use of k-fold cross-validation within a large monte carlo study will add considerable computation time. An alternative is to treat each value of the tuning parameter as a separate method. This approach makes possible an analysis of the sensitivity of the method to the tuning parameter.

2. Determine what population distributions should be considered for the comparison. In practice it is often the case that populations are not normally distributed, so you must consider what characteristics of populations are important to include as part of the study. Do we only wish to compare methods for normally distributed populations or do we also wish to consider the effects of asymmetry, heavy-tails, or contamination on the methods?

3. Use a fixed seed and a small number of replications to begin code writing so the results can be checked for correctness. Once you are satisfied that your code is working correctly, generate data for all replications of a particular population distribution and save the data using the R function save(). All methods should be applied to the same data. If you decide later to include other methods in your comporison, you will be able to apply the new methods to the same data used for all of the previous methods withou rerunning the entire monte carlo study.

4. In addition to averaging across replications to estimate mean values of the measure of interest, you should also obtain standard errors associated with monte carlo estimates. This can be treated as a sample size determination problem. For example, if you are estimating a proportion such as the proportion of times a confidence interval covers a specified parameter value, then the s.e. would be

where p is the average proportion and N is the number of replications. Examination standard error can show whether or not the number of replications is sufficient to compare methods with the required level of accuracy.

5. Try to avoid looping over replications of a particular population distribution if possible. R functions apply, tapply, sapply, vapply can be helpful for this purpose.

6. If you must loop over a set of parameter values, it can be helpful to construct functions that perform some of the complex computations within a loop so that memory usage won't increase unnecessarily. R generally releases memory required for those computations when a function returns. Avoid making multiple references to the same elements of an array inside a loop because that causes multiple copies of the entire array to be made, potentially increasing execution times significantly.

7. Think about how you want to display results of the study so you can organize the output of your code efficiently. Consider defining your own functions to give the same list of output components from different methods.

Example. Consider the problem of estimating a location parameter of some population. There are a number of methods to obtain confidence intervals, the most commonly used of which is based on the t-distribution. However, it is well-known that the sample mean and s.d. are not robust, so alternatives have been proposed that purport to be more robust. This study will compare 8 methods.

stdT.
Standard confidence interval for a population mean,

IQR.
Replace mean with median and s.d. with scaled IQR. Use same t-value as standard.

Replace mean with median and s.d. with scaled MAD. Use same t-value as standard.

Gini.
Replace mean with median and s.d. with scaled sum of Gini differences. Use same t-value as standard.

Gini1.
Replace mean with median and s.d. with scaled lower quartile of Gini differences. Use same t-value as standard.

Huber.
Replace sample mean with Huber M-estimator of location and s.d. with MAD scale. Use same t-value as standard.

Hubers.
Replace sample mean and s.d. with Huber M-estimators of location and scale. Use same t-value as standard.

Wilcoxon.
Use Hodges-Lehmann estimator associated with the Wilcoxon signed rank test to obtain confidence interval.
Note that all of the alternative methods exaept Standard, Huber, and Hubers give confidence intervals for the population median, not the mean.

The tasks for this project can be partitioned into four parts.

Define parameters and functions. All of the parameters needed for this simulation project are defined in one file:
http://www.utdallas.edu/~ammann/stat6341scripts/Parms.r
All other scripts use these parameters and attributes of the objects defined here. For example, if the number of methods is needed, then use

nmeths = length(meths)

instead of the literal 8. That way any changes or additions to those parameters or functions only need to be made in one location.

To unify the interface to the confidence interval methods, a separate function is defined for each method. These functions all have the same main arguments and the same format for returning a list with four components: mean miscoverage probability, s.e. of mean miscoverage probability, mean interval width, and s.e. of mean interval width. Since some methods are designed to estimate the mean and others the median, these functions include optional arguments for the population mean and median, but a particular method just uses the appropriate location parameter. After running the scripts initially, I decided to add a pseudomedian argument for the Wilcoxon confidence interval since that is what this method estimates for asymmetric distributions. However, this parameter is used as the reference only if the pmed1 argument is not null. I included this option for huber and hubers just for experimentation since those two methods actually give M-estimates of location, not the median or pseudomedian. By organizing the code in this way, I only needed to make these changes in the parameter file and then make any needed modifications to the function call in the other scripts.

This file needs to be sourced just once before using the other scripts for the project. All of the objects defined here, including the function definitions, are saved in the file Data/Parms.RData. Note that this file also sources the script described below that defines a function to generate contaminated normal data. All other scripts access these objects by putting the line

load("Data/Parms.RData")

at the beginning of the other scripts. The subdirectory Data is used to store this parameter file along with all of the data generated for the project.

Generate data. The distributions used for this study are normal, t-distribution with 3 d.f., asymmetrically contaminated normal, symmetrically contaminated normal, and gamma. R has functions defined for the normal, t, and gamma distributions, so to unify the code, I wrote a new function that generates data for contaminated normal. That code is here:
http://www.utdallas.edu/~ammann/stat6341scripts/rnormcontam.r
and is sourced by the Parms.r script.
Notes:

• The number of observations that are contaminated in each simulated sample is a random variable, not a fixed number. This corresponds to the probability model

where is a Bernoulli r.v. with success probability and are independent. This is not the same as replacing a fixed proportion of observations in each sample.
• The code to identify specific rows in each sample that are replaced by the contaminated values is vectorized so there is no looping over replications to accomplish this task.
• After writing and using this function, an option was added to use the Gamma distribution as the contaminating distribution. This was done by adding the optional argument useGamma with default value FALSE so that code that used the original function could use the modified function without needing any changes. This script is here:
http://www.utdallas.edu/~ammann/stat6341scripts/rnormcontam1.r

All of the generated data objects are saved in the Data subdirectory using self-describing file names generated by the paste() function. These objects are accessed with the load() function. For any monte carlo study all methods must be applied to the same data, so if any methods are modified or new methods added, these can be applied without rerunning everything. The code for data generation is here:
http://www.utdallas.edu/~ammann/stat6341scripts/ConfintMCdata.r

Apply methods. To avoid loops over the methods within this script, Parms.r defines a function All.ci for this purpose. That avoids memory buildup since a function releases memory needed within the function once it returns. The script
http://www.utdallas.edu/~ammann/stat6341scripts/ConfintMC.r
performs the tasks needed here. This script collects each output component (mean miscoverage probability, etc.) within separate arrays for each type of population in the study. These output arrays are saved in the Out subdirectory using self-describing file names for each population type. Dimension names are defined for these arrays to make it easier to generate summaries and graphics of the results.

Summarize results. The file
http://www.utdallas.edu/~ammann/stat6341scripts/ConfintMCplot.r
loads the appropriate output arrays and then generates png graphics to display the results. These graphic files are saved in the Graphics subdirectory using self-describing file names. Since the methods will be compared based on miscoverage probability and interval width, plots for both of these are put on one page. Tabular summaries including standard errors also are generated and saved in the Tables subdirectory. This script uses the xtable library to generate these tables in LaTeX so they can be easily incorporated into a report.

Next: Least Squares and the Up: Simulation of random variables Previous: Demonstration of statistical or
ammann
2017-12-10