SAS Simulations

In this lecture, we will discuss random variable generation using a DATA step in SAS and basic simulation experiments. There are several ways to generate random variables in SAS. SAS has individual functions for generating random variables from a specific distribution such as ranuni, rannor, and ranbin. We will use the RAND function. Another way is using PROC IML (Interactive Matrix Language). We will discuss using PROC IML and a DATA step to simulate data. Some distributions that you should be familiar with are

  • Uniform
  • Binomial
  • Chi-square
  • F
  • T
  • Normal
  • Gamma
  • Exponential
  • Beta
  • Poisson
  • Weibull

Using the DATA Step to Generate Random Data

In SAS, you can use the RAND function to in the SAS DATA step to simulate from an elementary probability distribution such as the normal, binomial, and uniform distributions. The first parameter to the RAND function is a string that gives the name of the distribution. Before we call this function, we should call the STREAMINIT function to set the seed for the random number generation. This allows for repeatable simulations. The following DATA step generates a random sample of size 100 from both the normal and uniform distributions.

Example 1. Sample 100 observations from a Normal population with mean 0 and standard deviation 1 and from the uniform distribution.

In [1]:
data Rand(keep=x u);
call streaminit(4321); /* set seed */
do i = 1 to 100; /* generate 100 random values */
x = rand("Normal"); /* x ~ N(0,1) */
u = rand("Uniform"); /* u ~ U(0,1) */
output;
end;
run;

proc print data=Rand(obs=5);
run;
SAS Connection established. Subprocess id is 8988

Out[1]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.RAND

Obs x u
1 1.24067 0.92960
2 -0.53532 0.20874
3 -1.01394 0.45677
4 0.68965 0.27118
5 -0.67680 0.87254

We can do the same simulation with PROC IML.

In [7]:
proc iml;
call randseed(4321); /* set seed */
x = j(50, 2); /* allocate 50 x 2 matrix */
call randgen(x, "Normal"); /* fill matrix, x ~ N(0,1) */
u = randfun(100, "Uniform"); /* return 100 x 1 vector, u ~ U(0,1) */

print (x[1:5])[colname="x"];
print (u[1:5])[colname="r"];
QUIT;
Out[7]:
SAS Output

SAS Output

The SAS System

The IML Procedure

_TEM1002

x
1.2406739
-0.535325
-1.01394
0.6896477
-0.324583

_TEM1002

r
0.8755548
0.1179871
0.7075341
0.7351896
0.2623021

To simulate from a discrete distribution, we can use the "TABLE" distribution in RAND. For the next example, we will simulate from a discrete distribution with values 1,2,3 and probabilities 0.5,0.3, and 0.2. For this example, we will label these cells as 1="Easy" 2="Specialized" and 3="Hard". Consider these as samples from a call center, where they classify calls into one of these three categories.

In [8]:
data Categories(keep=Type);
call streaminit(4321);
array p[3] (0.5 0.3 0.2);      /*probabilities*/
do i = 1 to 100;
Type = rand("Table", of p[*]); /*
use OF operator*/
output;
end;
run;

proc format;
value Call  1='Easy' 2='Specialized' 3='Hard';
run;

proc freq data=Categories;
format Type Call.;
tables Type / nocum;
run;
Out[8]:
SAS Output

SAS Output

The SAS System

The FREQ Procedure

The FREQ Procedure

Table Type

One-Way Frequencies

Type Frequency Percent
Easy 48 48.00
Specialized 31 31.00
Hard 21 21.00
In [12]:
proc iml;
call randseed(4321);
p = {0.5 0.3 0.2};
Type = j(100, 1);                  /*allocate vector*/
call randgen(Type, "Table", p);    /*fill with 1,2,3*/
create categories2 from Type[colname="Type"];
append from Type;
close categories2;
quit;

PROC FREQ DATA=categories2;
tables Type / nocum;
run;
Out[12]:
SAS Output

SAS Output

The SAS System

The FREQ Procedure

The FREQ Procedure

Table Type

One-Way Frequencies

Type Frequency Percent
1 48 48.00
2 31 31.00
3 21 21.00

Challenge 1

Generate a random sample of size 100 from a chi-square distribution with 7 degrees of freedom using a DATA step and using PROC IML.

Simulating a Sampling Distribution

In PHC 6052, we discussed the sampling distribution of $\hat{p}$ and $\bar{x}$. In the next example, we will simulate the sampling distribution of $\bar{x}$ when sampling from a uniform population. Recall, that we get the sampling distribution of $\bar{x}$ from the central limit theorem. We will simulate the distribution by doing the following:

  1. Generate a data set that contains many samples of size n (in this example n=10)
  2. Compute the mean for each sample
  3. Create a histogram of all the $\bar{x}$'s generated to see the (approximate) sampling distribution.

We know from the central limit theorem that if the sample size is large enough, then the sampling distribution will be approximately normal.

In [13]:
/*Step 1: Generate a data set that contains many samples*/
%let N = 10;                      /*sample size*/
%let NumSamples = 1000;           /*number of samples*/

data Sim;
call streaminit(123);
do SampleID = 1 to &NumSamples;   /*ID variable for each sample*/
    do i = 1 to &N;
        x = rand("Uniform");
        output;
    end;
end;
run;

/*Step 2: Compute the mean of each sample*/
proc means data=Sim noprint;
by SampleID;
var x;
output out=OutStats mean=SampleMean;
run;

/*Step 3: Visualize and compute descriptive statistics for the ASD*/
ods select Moments Histogram;
proc univariate data=OutStats;
label SampleMean = "Sample Mean of U(0,1) Data";
var SampleMean;
histogram SampleMean / normal;      /*overlay normal fit*/
run;
Out[13]:
SAS Output

SAS Output

The SAS System

The UNIVARIATE Procedure

Variable: SampleMean (Sample Mean of U(0,1) Data)

The UNIVARIATE Procedure

SampleMean

Moments

Moments
N 1000 Sum Weights 1000
Mean 0.50264072 Sum Observations 502.640718
Std Deviation 0.09254832 Variance 0.00856519
Skewness -0.019496 Kurtosis 0.28029163
Uncorrected SS 261.204319 Corrected SS 8.55662721
Coeff Variation 18.4124209 Std Error Mean 0.00292663

The SAS System

The UNIVARIATE Procedure

Histogram 1

Panel 1

Histogram for SampleMean
In [16]:
/* Using PROC IML*/
%let N = 10;
%let NumSamples = 1000;

proc iml;
call randseed(123);
x = j(&NumSamples,&N);       /*many samples (rows), each of size N*/
call randgen(x, "Uniform");  /*1. Simulate data*/
s = x[,:];                   /*2. Compute statistic for each row*/
Mean = mean(s);              /*3. Summarize and analyze ASD*/
StdDev = std(s);
call qntl(q, s, {0.05 0.95});
print Mean StdDev (q`)[colname={"5th Pctl" "95th Pctl"}];

create samplemean from mean[colname="samplemean"];
append from mean;
close samplemean;
QUIT;

ods select Moments Histogram;
proc univariate data=OutStats;
label SampleMean = "Sample Mean of U(0,1) Data";
var SampleMean;
histogram SampleMean / normal;      /*overlay normal fit*/
run;
Out[16]:
SAS Output

SAS Output

The SAS System

The IML Procedure

Mean_StdDev

Mean StdDev 5th Pctl 95th Pctl
0.5026407 0.0925483 0.3540121 0.6588903

The SAS System

The UNIVARIATE Procedure

Variable: SampleMean (Sample Mean of U(0,1) Data)

The UNIVARIATE Procedure

SampleMean

Moments

Moments
N 1000 Sum Weights 1000
Mean 0.50264072 Sum Observations 502.640718
Std Deviation 0.09254832 Variance 0.00856519
Skewness -0.019496 Kurtosis 0.28029163
Uncorrected SS 261.204319 Corrected SS 8.55662721
Coeff Variation 18.4124209 Std Error Mean 0.00292663

The SAS System

The UNIVARIATE Procedure

Histogram 1

Panel 1

Histogram for SampleMean

Challenge 2

Simulate the sampling distribution of $\hat{p}$, when sampling from a binary population with $p=0.1$. To do this generate 1000 samples of size 20, calculate p-hat for each sample (this is just the sample mean of the resulting 0's and 1's), and plot a histogram of the resulting $\hat{p}$.

Simulation for Regression Estimates

In the next example, we will simulate data from a linear regression model. We will use the following regression model:

$$Y_i=1 + X_i/2 +Z_i/3 +\varepsilon_i$$

where $\varepsilon_i\overset{i.i.d.}{\sim} N(0,1)$. Normally X and Z will be observed or set for an experiment. Here we will simulate the explanatory variables X and Z. For convenience, we will simulate X and Z from two independent standard normal distributions.

In [14]:
%let N = 50; /* sample size */

data Explanatory(keep=x z);
    call streaminit(12345);
    do i = 1 to &N;
        x = rand("Normal");
        z = rand("Normal");
        output;
    end;
run;

/* Simulate multiple samples from a regression model */
%let NumSamples = 1000; /* number of samples */

data RegSim(drop=eta rmse);
    call streaminit(123);
    rmse = 1; /* scale of error term */
    set Explanatory; /* implicit loop over obs*/
    ObsNum = _N_; /* observation number */
    eta = 1 + X/2 + Z/3; /* linear predictor */
    do SampleID = 1 to &NumSamples;
        Y = eta + rand("Normal", 0, rmse); /* random error term */
        output;
    end;
run;

PROC PRINT DATA = RegSim (OBS=5);
RUN;

proc sort data=RegSim; /* sort for BY-group processing */
by SampleID ObsNum;
run;

proc reg data=RegSim outest=OutEst NOPRINT;
by SampleID;
model y = x z;
quit;

PROC PRINT DATA=OutEst (OBS=5);
RUN;

proc means nolabels data=OutEst Mean Std P5 P95;
var Intercept x z;
run;

ODS SELECT Univariate.x.Histogram.Histogram Univariate.z.Histogram.Histogram Univariate.intercept.Histogram.Histogram;
PROC UNIVARIATE DATA=OutEst;
VAR Intercept X Z;
HISTOGRAM Intercept /NORMAL;
HISTOGRAM X / NORMAL;
HISTOGRAM Z / NORMAL;
RUN;
ODS SELECT ALL;
Out[14]:
SAS Output

SAS Output

Power of F Test for gamma=0

N = 50

The PRINT Procedure

Data Set WORK.REGSIM

Obs x z ObsNum SampleID Y
1 0.26423 1.07473 1 1 1.57430
2 0.26423 1.07473 1 2 1.01326
3 0.26423 1.07473 1 3 0.77788
4 0.26423 1.07473 1 4 1.30092
5 0.26423 1.07473 1 5 0.08179

Power of F Test for gamma=0

N = 50

The PRINT Procedure

Data Set WORK.OUTEST

Obs SampleID _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ Intercept x z Y
1 1 MODEL1 PARMS Y 1.05422 1.13997 0.26757 0.06667 -1
2 2 MODEL1 PARMS Y 1.03666 0.80959 0.76312 0.58866 -1
3 3 MODEL1 PARMS Y 0.97669 1.17492 0.26234 0.34080 -1
4 4 MODEL1 PARMS Y 0.94227 0.92717 0.52793 0.44518 -1
5 5 MODEL1 PARMS Y 1.07785 0.85295 0.50722 0.31616 -1

Power of F Test for gamma=0

N = 50

The MEANS Procedure

The MEANS Procedure

Summary statistics

Variable Mean Std Dev 5th Pctl 95th Pctl
Intercept
x
z
1.0024931
0.5077344
0.3314210
0.1497273
0.1675608
0.1360661
0.7580550
0.2229376
0.1056315
1.2551858
0.7733572
0.5544892

Power of F Test for gamma=0

N = 50

The UNIVARIATE Procedure

The UNIVARIATE Procedure

Intercept

Histogram 1

Panel 1

Histogram for Intercept

Power of F Test for gamma=0

N = 50

The UNIVARIATE Procedure

x

Histogram 2

Panel 1

Histogram for x

Power of F Test for gamma=0

N = 50

The UNIVARIATE Procedure

z

Histogram 3

Panel 1

Histogram for z

Power Simulation

For the next example, we will simulate a power curve for the F test of $$H_0: \beta_Z=0\;vs\;H_1:\beta_Z\neq 0$$ for different true values of $\beta_Z$.

In [10]:
%let N = 50; /* sample size */
%let NumSamples = 1000; /* number of samples */

data PowerSim(drop=eta);
call streaminit(1);
set Explanatory; /* fixed values for x and z */
do gamma = 0 to 0.6 by 0.05;
    eta = 1 + x/2 + gamma*z; /* linear predictor */
    do SampleID = 1 to &NumSamples;
        y = eta + rand("Normal");
        output;
    end;
end;
run;

proc sort data=PowerSim;
by gamma SampleID;
run;

ODS SELECT NONE;
proc reg data=PowerSim;
by gamma SampleID;
model y = x z;
test z=0;
ods output TestANOVA=TestAnova(where=(Source="Numerator"));
quit;
ODS SELECT ALL;

proc print data=TestANOVA(firstobs=13 obs=17);
var gamma SampleID FValue ProbF;
run;
Out[10]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.TESTANOVA

Obs gamma SampleID FValue ProbF
13 0 13 4.45 0.0403
14 0 14 2.61 0.1127
15 0 15 0.10 0.7513
16 0 16 2.52 0.1192
17 0 17 0.17 0.6805
In [12]:
/* Construct an indicator variable for observations that reject H0 */
data Results;
set TestANOVA;
Reject = (ProbF <= 0.05); /* indicator variable */
run;

/* compute proportion of times H0 is rejected for each gamma value;
compute upper/lower confidence intervals for the proportion */
proc freq data=Results noprint;
by gamma;
tables Reject / nocum binomial(level='1');
output out=Signif binomial;
run;

PROC PRINT DATA=Signif (OBS=5);
RUN;

/* plot the proportion versus the magnitude of gamma */
title "Power of F Test for gamma=0";
title2 "N = &N";
proc sgplot data=Signif noautolegend;
scatter x=gamma y=_BIN_ / yerrorlower=L_Bin yerrorupper=U_Bin;
series x=gamma y=_BIN_;
yaxis values=(0 to 1 by 0.1) grid
label="Power (1 - P[Type II Error])";
xaxis label="Gamma" grid;
run;
Out[12]:
SAS Output

SAS Output

Power of F Test for gamma=0

N = 50

The PRINT Procedure

Data Set WORK.SIGNIF

Obs gamma N _BIN_ E_BIN L_BIN U_BIN XL_BIN XU_BIN E0_BIN Z_BIN PL_BIN PR_BIN P2_BIN
1 0.00 1000 0.048 0.006760 0.03475 0.06125 0.03560 0.06314 0.015811 -28.5870 4.8754E-180 . 9.7508E-180
2 0.05 1000 0.069 0.008015 0.05329 0.08471 0.05408 0.08652 0.015811 -27.2588 6.5293E-164 . 1.3059E-163
3 0.10 1000 0.099 0.009445 0.08049 0.11751 0.08119 0.11921 0.015811 -25.3615 3.3581E-142 . 6.7161E-142
4 0.15 1000 0.194 0.012505 0.16949 0.21851 0.16992 0.21989 0.015811 -19.3531 9.58982E-84 . 1.91796E-83
5 0.20 1000 0.283 0.014245 0.25508 0.31092 0.25525 0.31203 0.015811 -13.7243 3.63211E-43 . 7.26422E-43

The SGPLOT Procedure

The SGPlot Procedure

The SGPlot Procedure
In [ ]: