SAS Simulations¶

In this lecture, we will discuss random variable generation using a DATA step in SAS and basic simulation experiments. There are several ways to generate random variables in SAS. SAS has individual functions for generating random variables from a specific distribution such as ranuni, rannor, and ranbin. We will use the RAND function. Another way is using PROC IML (Interactive Matrix Language). We will discuss using PROC IML and a DATA step to simulate data. Some distributions that you should be familiar with are

Uniform
Binomial
Chi-square
F
T
Normal
Gamma
Exponential
Beta
Poisson
Weibull

Using the DATA Step to Generate Random Data¶

In SAS, you can use the RAND function to in the SAS DATA step to simulate from an elementary probability distribution such as the normal, binomial, and uniform distributions. The first parameter to the RAND function is a string that gives the name of the distribution. Before we call this function, we should call the STREAMINIT function to set the seed for the random number generation. This allows for repeatable simulations. The following DATA step generates a random sample of size 100 from both the normal and uniform distributions.

Example 1. Sample 100 observations from a Normal population with mean 0 and standard deviation 1 and from the uniform distribution.

data Rand(keep=x u);
call streaminit(4321); /* set seed */
do i = 1 to 100; /* generate 100 random values */
x = rand("Normal"); /* x ~ N(0,1) */
u = rand("Uniform"); /* u ~ U(0,1) */
output;
end;
run;

proc print data=Rand(obs=5);
run;

SAS Connection established. Subprocess id is 8988

We can do the same simulation with PROC IML.

proc iml;
call randseed(4321); /* set seed */
x = j(50, 2); /* allocate 50 x 2 matrix */
call randgen(x, "Normal"); /* fill matrix, x ~ N(0,1) */
u = randfun(100, "Uniform"); /* return 100 x 1 vector, u ~ U(0,1) */

print (x[1:5])[colname="x"];
print (u[1:5])[colname="r"];
QUIT;

To simulate from a discrete distribution, we can use the "TABLE" distribution in RAND. For the next example, we will simulate from a discrete distribution with values 1,2,3 and probabilities 0.5,0.3, and 0.2. For this example, we will label these cells as 1="Easy" 2="Specialized" and 3="Hard". Consider these as samples from a call center, where they classify calls into one of these three categories.

data Categories(keep=Type);
call streaminit(4321);
array p[3] (0.5 0.3 0.2);      /*probabilities*/
do i = 1 to 100;
Type = rand("Table", of p[*]); /*
use OF operator*/
output;
end;
run;

proc format;
value Call  1='Easy' 2='Specialized' 3='Hard';
run;

proc freq data=Categories;
format Type Call.;
tables Type / nocum;
run;

proc iml;
call randseed(4321);
p = {0.5 0.3 0.2};
Type = j(100, 1);                  /*allocate vector*/
call randgen(Type, "Table", p);    /*fill with 1,2,3*/
create categories2 from Type[colname="Type"];
append from Type;
close categories2;
quit;

PROC FREQ DATA=categories2;
tables Type / nocum;
run;

Challenge 1¶

Generate a random sample of size 100 from a chi-square distribution with 7 degrees of freedom using a DATA step and using PROC IML.

Simulating a Sampling Distribution¶

In PHC 6052, we discussed the sampling distribution of $\hat{p}$ and $\bar{x}$. In the next example, we will simulate the sampling distribution of $\bar{x}$ when sampling from a uniform population. Recall, that we get the sampling distribution of $\bar{x}$ from the central limit theorem. We will simulate the distribution by doing the following:

Generate a data set that contains many samples of size n (in this example n=10)
Compute the mean for each sample
Create a histogram of all the $\bar{x}$'s generated to see the (approximate) sampling distribution.

We know from the central limit theorem that if the sample size is large enough, then the sampling distribution will be approximately normal.

/*Step 1: Generate a data set that contains many samples*/
%let N = 10;                      /*sample size*/
%let NumSamples = 1000;           /*number of samples*/

data Sim;
call streaminit(123);
do SampleID = 1 to &NumSamples;   /*ID variable for each sample*/
    do i = 1 to &N;
        x = rand("Uniform");
        output;
    end;
end;
run;

/*Step 2: Compute the mean of each sample*/
proc means data=Sim noprint;
by SampleID;
var x;
output out=OutStats mean=SampleMean;
run;

/*Step 3: Visualize and compute descriptive statistics for the ASD*/
ods select Moments Histogram;
proc univariate data=OutStats;
label SampleMean = "Sample Mean of U(0,1) Data";
var SampleMean;
histogram SampleMean / normal;      /*overlay normal fit*/
run;

/* Using PROC IML*/
%let N = 10;
%let NumSamples = 1000;

proc iml;
call randseed(123);
x = j(&NumSamples,&N);       /*many samples (rows), each of size N*/
call randgen(x, "Uniform");  /*1. Simulate data*/
s = x[,:];                   /*2. Compute statistic for each row*/
Mean = mean(s);              /*3. Summarize and analyze ASD*/
StdDev = std(s);
call qntl(q, s, {0.05 0.95});
print Mean StdDev (q`)[colname={"5th Pctl" "95th Pctl"}];

create samplemean from mean[colname="samplemean"];
append from mean;
close samplemean;
QUIT;

ods select Moments Histogram;
proc univariate data=OutStats;
label SampleMean = "Sample Mean of U(0,1) Data";
var SampleMean;
histogram SampleMean / normal;      /*overlay normal fit*/
run;

Challenge 2¶

Simulate the sampling distribution of $\hat{p}$, when sampling from a binary population with $p=0.1$. To do this generate 1000 samples of size 20, calculate p-hat for each sample (this is just the sample mean of the resulting 0's and 1's), and plot a histogram of the resulting $\hat{p}$.

Simulation for Regression Estimates¶

In the next example, we will simulate data from a linear regression model. We will use the following regression model:

$$Y_i=1 + X_i/2 +Z_i/3 +\varepsilon_i$$

where $\varepsilon_i\overset{i.i.d.}{\sim} N(0,1)$. Normally X and Z will be observed or set for an experiment. Here we will simulate the explanatory variables X and Z. For convenience, we will simulate X and Z from two independent standard normal distributions.

%let N = 50; /* sample size */

data Explanatory(keep=x z);
    call streaminit(12345);
    do i = 1 to &N;
        x = rand("Normal");
        z = rand("Normal");
        output;
    end;
run;

/* Simulate multiple samples from a regression model */
%let NumSamples = 1000; /* number of samples */

data RegSim(drop=eta rmse);
    call streaminit(123);
    rmse = 1; /* scale of error term */
    set Explanatory; /* implicit loop over obs*/
    ObsNum = _N_; /* observation number */
    eta = 1 + X/2 + Z/3; /* linear predictor */
    do SampleID = 1 to &NumSamples;
        Y = eta + rand("Normal", 0, rmse); /* random error term */
        output;
    end;
run;

PROC PRINT DATA = RegSim (OBS=5);
RUN;

proc sort data=RegSim; /* sort for BY-group processing */
by SampleID ObsNum;
run;

proc reg data=RegSim outest=OutEst NOPRINT;
by SampleID;
model y = x z;
quit;

PROC PRINT DATA=OutEst (OBS=5);
RUN;

proc means nolabels data=OutEst Mean Std P5 P95;
var Intercept x z;
run;

ODS SELECT Univariate.x.Histogram.Histogram Univariate.z.Histogram.Histogram Univariate.intercept.Histogram.Histogram;
PROC UNIVARIATE DATA=OutEst;
VAR Intercept X Z;
HISTOGRAM Intercept /NORMAL;
HISTOGRAM X / NORMAL;
HISTOGRAM Z / NORMAL;
RUN;
ODS SELECT ALL;

Power Simulation¶

For the next example, we will simulate a power curve for the F test of $$H_0: \beta_Z=0\;vs\;H_1:\beta_Z\neq 0$$ for different true values of $\beta_Z$.

%let N = 50; /* sample size */
%let NumSamples = 1000; /* number of samples */

data PowerSim(drop=eta);
call streaminit(1);
set Explanatory; /* fixed values for x and z */
do gamma = 0 to 0.6 by 0.05;
    eta = 1 + x/2 + gamma*z; /* linear predictor */
    do SampleID = 1 to &NumSamples;
        y = eta + rand("Normal");
        output;
    end;
end;
run;

proc sort data=PowerSim;
by gamma SampleID;
run;

ODS SELECT NONE;
proc reg data=PowerSim;
by gamma SampleID;
model y = x z;
test z=0;
ods output TestANOVA=TestAnova(where=(Source="Numerator"));
quit;
ODS SELECT ALL;

proc print data=TestANOVA(firstobs=13 obs=17);
var gamma SampleID FValue ProbF;
run;

/* Construct an indicator variable for observations that reject H0 */
data Results;
set TestANOVA;
Reject = (ProbF <= 0.05); /* indicator variable */
run;

/* compute proportion of times H0 is rejected for each gamma value;
compute upper/lower confidence intervals for the proportion */
proc freq data=Results noprint;
by gamma;
tables Reject / nocum binomial(level='1');
output out=Signif binomial;
run;

PROC PRINT DATA=Signif (OBS=5);
RUN;

/* plot the proportion versus the magnitude of gamma */
title "Power of F Test for gamma=0";
title2 "N = &N";
proc sgplot data=Signif noautolegend;
scatter x=gamma y=_BIN_ / yerrorlower=L_Bin yerrorupper=U_Bin;
series x=gamma y=_BIN_;
yaxis values=(0 to 1 by 0.1) grid
label="Power (1 - P[Type II Error])";
xaxis label="Gamma" grid;
run;

Obs	x	u
1	1.24067	0.92960
2	-0.53532	0.20874
3	-1.01394	0.45677
4	0.68965	0.27118
5	-0.67680	0.87254

Type	Frequency	Percent
Easy	48	48.00
Specialized	31	31.00
Hard	21	21.00

Type	Frequency	Percent
1	48	48.00
2	31	31.00
3	21	21.00

Moments
N	1000	Sum Weights	1000
Mean	0.50264072	Sum Observations	502.640718
Std Deviation	0.09254832	Variance	0.00856519
Skewness	-0.019496	Kurtosis	0.28029163
Uncorrected SS	261.204319	Corrected SS	8.55662721
Coeff Variation	18.4124209	Std Error Mean	0.00292663

Moments
N	1000	Sum Weights	1000
Mean	0.50264072	Sum Observations	502.640718
Std Deviation	0.09254832	Variance	0.00856519
Skewness	-0.019496	Kurtosis	0.28029163
Uncorrected SS	261.204319	Corrected SS	8.55662721
Coeff Variation	18.4124209	Std Error Mean	0.00292663

Obs	x	z	ObsNum	SampleID	Y
1	0.26423	1.07473	1	1	1.57430
2	0.26423	1.07473	1	2	1.01326
3	0.26423	1.07473	1	3	0.77788
4	0.26423	1.07473	1	4	1.30092
5	0.26423	1.07473	1	5	0.08179

Obs	SampleID	_MODEL_	_TYPE_	_DEPVAR_	_RMSE_	Intercept	x	z	Y
1	1	MODEL1	PARMS	Y	1.05422	1.13997	0.26757	0.06667	-1
2	2	MODEL1	PARMS	Y	1.03666	0.80959	0.76312	0.58866	-1
3	3	MODEL1	PARMS	Y	0.97669	1.17492	0.26234	0.34080	-1
4	4	MODEL1	PARMS	Y	0.94227	0.92717	0.52793	0.44518	-1
5	5	MODEL1	PARMS	Y	1.07785	0.85295	0.50722	0.31616	-1

Obs	SampleID	FValue	ProbF
13	13	4.45	0.0403
14	14	2.61	0.1127
15	15	0.10	0.7513
16	16	2.52	0.1192
17	17	0.17	0.6805

Obs	gamma	N	_BIN_	E_BIN	L_BIN	U_BIN	XL_BIN	XU_BIN	E0_BIN	Z_BIN	PL_BIN	PR_BIN	P2_BIN
1	0.00	1000	0.048	0.006760	0.03475	0.06125	0.03560	0.06314	0.015811	-28.5870	4.8754E-180	.	9.7508E-180
2	0.05	1000	0.069	0.008015	0.05329	0.08471	0.05408	0.08652	0.015811	-27.2588	6.5293E-164	.	1.3059E-163
3	0.10	1000	0.099	0.009445	0.08049	0.11751	0.08119	0.11921	0.015811	-25.3615	3.3581E-142	.	6.7161E-142
4	0.15	1000	0.194	0.012505	0.16949	0.21851	0.16992	0.21989	0.015811	-19.3531	9.58982E-84	.	1.91796E-83
5	0.20	1000	0.283	0.014245	0.25508	0.31092	0.25525	0.31203	0.015811	-13.7243	3.63211E-43	.	7.26422E-43

x
1.2406739
-0.535325
-1.01394
0.6896477
-0.324583

r
0.8755548
0.1179871
0.7075341
0.7351896
0.2623021

SAS Simulations¶

Using the DATA Step to Generate Random Data¶

SAS Output

The PRINT Procedure

Data Set WORK.RAND

SAS Output

The IML Procedure

_TEM1002

_TEM1002

SAS Output

The FREQ Procedure

Table Type

One-Way Frequencies

SAS Output

The FREQ Procedure

Table Type

One-Way Frequencies

Challenge 1¶

Simulating a Sampling Distribution¶

SAS Output

The UNIVARIATE Procedure

SampleMean

Moments

Histogram 1

Panel 1

SAS Output

The IML Procedure

Mean_StdDev

The UNIVARIATE Procedure

SampleMean

Moments

Histogram 1

Panel 1

Challenge 2¶

Simulation for Regression Estimates¶

SAS Output

The PRINT Procedure

Data Set WORK.REGSIM

The PRINT Procedure

Data Set WORK.OUTEST

The MEANS Procedure

Summary statistics

The UNIVARIATE Procedure

Intercept

Histogram 1

Panel 1

x

Histogram 2

Panel 1

z

Histogram 3

Panel 1

Power Simulation¶

SAS Output

The PRINT Procedure

Data Set WORK.TESTANOVA

SAS Output

The PRINT Procedure

Data Set WORK.SIGNIF

The SGPLOT Procedure

The SGPlot Procedure