Monte Carlo simulation has become an important tool for analyses of models and comparison of statistical methods. This section discusses some practical issues that should be considered when implementing a monte carlo study to compare methods.

- Decide which methods should be compared. These methods should all be applicable
to the same data. If a method has tuning parameters, then you must consider how the
tuning parameters are determined for a particular data set. In some cases, a tuning
parameter may be derived from statistical theory. For example, the MAD estimator for
scale is given by

where*MD*is the median of the data. The value for the tuning parameter*k=1.4826*is chosen to make this an unbiased estimator of the standard deviation if the data is normally distributed. Another estimator is based on the inter-quartile range, scaled to make it unbiased for normal distributions,

However, some methods such as predictive models may have tuning parameters that are not determined by theory. In such cases optimal values may be obtained by k-fold cross-validation. This requires a measure of optimality such as sum of squared errors. The sample is randomly partitioned into*k*groups and a grid of values for the tuning parameter is specified. For each value of the tuning parameter, the model is fit using the data with group*j*removed and then the fitted model is applied to group*j*. The measure of optimality is obtained from the fitted values in group*j*and this is repeated for each group. The optimality measures are then averaged to obtain an estimate of optimality for the corresponding value of the tuning parameter. This is repeated for each value of the tuning parameter and the value is chosen that has the highest optimality. Obviously the use of k-fold cross-validation within a large monte carlo study will add considerable computation time. An alternative is to treat each value of the tuning parameter as a separate method. This approach makes possible an analysis of the sensitivity of the method to the tuning parameter. - Determine what population distributions should be considered for the comparison.
In practice it is often the case that populations are not normally distributed, so
you must consider what characteristics of populations are important to include as
part of the study. Do we only wish to compare methods for normally distributed
populations or do we also wish to consider the effects of asymmetry, heavy-tails,
or contamination on the methods?
- Use a fixed seed and a small number of replications to begin code writing so the
results can be checked for correctness. Once you are satisfied that your code is
working correctly, generate data for all replications of a particular population
distribution and save the data using the
**R**function`save()`. All methods should be applied to the same data. If you decide later to include other methods in your comporison, you will be able to apply the new methods to the same data used for all of the previous methods withou rerunning the entire monte carlo study. - In addition to averaging across replications to estimate mean values of the measure
of interest, you should also obtain standard errors associated with monte carlo estimates.
This can be treated as a sample size determination problem. For example, if you are estimating
a proportion such as the proportion of times a confidence interval covers a specified parameter
value, then the s.e. would be

where*p*is the average proportion and*N*is the number of replications. Examination standard error can show whether or not the number of replications is sufficient to compare methods with the required level of accuracy. - Try to avoid looping over replications of a particular population distribution if
possible.
**R**functions`apply, tapply, sapply, vapply`can be helpful for this purpose. - If you must loop over a set of parameter values, it can be helpful to construct
functions that perform some of the complex computations within a loop so that
memory usage won't increase unnecessarily.
**R**generally releases memory required for those computations when a function returns. Avoid making multiple references to the same elements of an array inside a loop because that causes multiple copies of the entire array to be made, potentially increasing execution times significantly. - Think about how you want to display results of the study so you can organize the output of your code efficiently. Consider defining your own functions to give the same list of output components from different methods.

**Example**. Consider the problem of estimating a location parameter of some population.
There are a number of methods to obtain confidence intervals, the most commonly used of which is
based on the t-distribution. However, it is well-known that the sample mean and s.d. are not
robust, so alternatives have been proposed that purport to be more robust. This study will compare
8 methods.

**stdT**.- Standard confidence interval for a population mean,

**IQR**.- Replace mean with median and s.d. with scaled IQR. Use same t-value as
standard.
**MAD**.- Replace mean with median and s.d. with scaled MAD. Use same t-value as
standard.
**Gini**.- Replace mean with median and s.d. with scaled sum of Gini differences. Use
same t-value as standard.
**Gini1**.- Replace mean with median and s.d. with scaled lower quartile of Gini
differences. Use same t-value as standard.
**Huber**.- Replace sample mean with Huber M-estimator of location and s.d. with MAD
scale. Use same t-value as standard.
**Hubers**.- Replace sample mean and s.d. with Huber M-estimators of location and
scale. Use same t-value as standard.
**Wilcoxon**.- Use Hodges-Lehmann estimator associated with the Wilcoxon signed rank test to obtain confidence interval.

The tasks for this project can be partitioned into four parts.

**Define parameters and functions**. All of the parameters needed for this simulation project
are defined in one file:

http://www.utdallas.edu/~ammann/stat6341scripts/Parms.r

All other scripts use these parameters and attributes of the objects defined here. For example,
if the number of methods is needed, then use

nmeths = length(meths)instead of the literal

To unify the interface to the confidence interval methods, a separate function is defined for
each method. These functions all have the same main arguments and the same format for returning a
list with four components: mean miscoverage probability, s.e. of mean miscoverage probability,
mean interval width, and s.e. of mean interval width. Since some methods are designed to estimate
the mean and others the median, these functions include optional arguments for the population mean
and median, but a particular method just uses the appropriate location parameter. After running the
scripts initially, I decided to add a pseudomedian argument for the Wilcoxon confidence interval
since that is what this method estimates for asymmetric distributions. However, this parameter is
used as the reference only if the *pmed1* argument is not null. I included this option for
*huber* and *hubers* just for experimentation since those two methods actually give
M-estimates of location, not the median or pseudomedian. By organizing the code in this way, I
only needed to make these changes in the parameter file and then make any needed modifications
to the function call in the other scripts.

This file needs to be sourced just once before using the other scripts for the project. All of the
objects defined here, including the function definitions, are saved in the file
*Data/Parms.RData*. Note that this file also sources the script described below that defines
a function to generate contaminated normal data. All other scripts access these objects by putting
the line

load("Data/Parms.RData")at the beginning of the other scripts. The subdirectory

**Generate data**. The distributions used for this study are normal, t-distribution with 3
d.f., asymmetrically contaminated normal, symmetrically contaminated normal, and gamma. **R**
has functions defined for the normal, t, and gamma distributions, so to unify the code, I wrote a
new function that generates data for contaminated normal. That code is here:

http://www.utdallas.edu/~ammann/stat6341scripts/rnormcontam.r

and is sourced by the *Parms.r* script.
**Notes**:

- The number of observations that are contaminated in each simulated sample is a random
variable, not a fixed number. This corresponds to the probability model

where is a Bernoulli r.v. with success probability and are independent. This is not the same as replacing a fixed proportion of observations in each sample. - The code to identify specific rows in each sample that are replaced by the contaminated values is vectorized so there is no looping over replications to accomplish this task.
- After writing and using this function, an option was added to use the Gamma distribution as
the contaminating distribution. This was done by adding the optional argument
*useGamma*with default value*FALSE*so that code that used the original function could use the modified function without needing any changes. This script is here:

http://www.utdallas.edu/~ammann/stat6341scripts/rnormcontam1.r

All of the generated data objects are saved in the *Data* subdirectory using self-describing
file names generated by the *paste()* function. These objects are accessed with the
*load()* function. For any monte carlo study all methods must be applied to the same data,
so if any methods are modified or new methods added, these can be applied without rerunning
everything. The code for data generation is here:

http://www.utdallas.edu/~ammann/stat6341scripts/ConfintMCdata.r

**Apply methods**. To avoid loops over the methods within this script, *Parms.r*
defines a function *All.ci* for this purpose. That avoids memory buildup since a function
releases memory needed within the function once it returns. The script

http://www.utdallas.edu/~ammann/stat6341scripts/ConfintMC.r

performs the tasks needed here. This script collects each output component (mean miscoverage
probability, etc.) within separate arrays for each type of population in the study. These output
arrays are saved in the *Out* subdirectory using self-describing file names for each
population type. Dimension names are defined for these arrays to make it easier to generate
summaries and graphics of the results.

**Summarize results**. The file

http://www.utdallas.edu/~ammann/stat6341scripts/ConfintMCplot.r

loads the appropriate output arrays and then generates *png* graphics to display the results.
These graphic files are saved in the *Graphics* subdirectory using self-describing file
names. Since the methods will be compared based on miscoverage probability and interval width,
plots for both of these are put on one page. Tabular summaries including standard errors also are
generated and saved in the *Tables* subdirectory. This script uses the *xtable*
library to generate these tables in *LaTeX* so they can be easily incorporated into a report.

2017-12-10