Data Set WORK.RAND
Obs | x | u |
---|---|---|
1 | 1.24067 | 0.92960 |
2 | -0.53532 | 0.20874 |
3 | -1.01394 | 0.45677 |
4 | 0.68965 | 0.27118 |
5 | -0.67680 | 0.87254 |
In this lecture, we will discuss random variable generation using a DATA step in SAS and basic simulation experiments. There are several ways to generate random variables in SAS. SAS has individual functions for generating random variables from a specific distribution such as ranuni, rannor, and ranbin. We will use the RAND function. Another way is using PROC IML (Interactive Matrix Language). We will discuss using PROC IML and a DATA step to simulate data. Some distributions that you should be familiar with are
In SAS, you can use the RAND function to in the SAS DATA step to simulate from an elementary probability distribution such as the normal, binomial, and uniform distributions. The first parameter to the RAND function is a string that gives the name of the distribution. Before we call this function, we should call the STREAMINIT function to set the seed for the random number generation. This allows for repeatable simulations. The following DATA step generates a random sample of size 100 from both the normal and uniform distributions.
Example 1. Sample 100 observations from a Normal population with mean 0 and standard deviation 1 and from the uniform distribution.
data Rand(keep=x u);
call streaminit(4321); /* set seed */
do i = 1 to 100; /* generate 100 random values */
x = rand("Normal"); /* x ~ N(0,1) */
u = rand("Uniform"); /* u ~ U(0,1) */
output;
end;
run;
proc print data=Rand(obs=5);
run;
We can do the same simulation with PROC IML.
proc iml;
call randseed(4321); /* set seed */
x = j(50, 2); /* allocate 50 x 2 matrix */
call randgen(x, "Normal"); /* fill matrix, x ~ N(0,1) */
u = randfun(100, "Uniform"); /* return 100 x 1 vector, u ~ U(0,1) */
print (x[1:5])[colname="x"];
print (u[1:5])[colname="r"];
QUIT;
To simulate from a discrete distribution, we can use the "TABLE" distribution in RAND. For the next example, we will simulate from a discrete distribution with values 1,2,3 and probabilities 0.5,0.3, and 0.2. For this example, we will label these cells as 1="Easy" 2="Specialized" and 3="Hard". Consider these as samples from a call center, where they classify calls into one of these three categories.
data Categories(keep=Type);
call streaminit(4321);
array p[3] (0.5 0.3 0.2); /*probabilities*/
do i = 1 to 100;
Type = rand("Table", of p[*]); /*
use OF operator*/
output;
end;
run;
proc format;
value Call 1='Easy' 2='Specialized' 3='Hard';
run;
proc freq data=Categories;
format Type Call.;
tables Type / nocum;
run;
proc iml;
call randseed(4321);
p = {0.5 0.3 0.2};
Type = j(100, 1); /*allocate vector*/
call randgen(Type, "Table", p); /*fill with 1,2,3*/
create categories2 from Type[colname="Type"];
append from Type;
close categories2;
quit;
PROC FREQ DATA=categories2;
tables Type / nocum;
run;
Generate a random sample of size 100 from a chi-square distribution with 7 degrees of freedom using a DATA step and using PROC IML.
In PHC 6052, we discussed the sampling distribution of $\hat{p}$ and $\bar{x}$. In the next example, we will simulate the sampling distribution of $\bar{x}$ when sampling from a uniform population. Recall, that we get the sampling distribution of $\bar{x}$ from the central limit theorem. We will simulate the distribution by doing the following:
We know from the central limit theorem that if the sample size is large enough, then the sampling distribution will be approximately normal.
/*Step 1: Generate a data set that contains many samples*/
%let N = 10; /*sample size*/
%let NumSamples = 1000; /*number of samples*/
data Sim;
call streaminit(123);
do SampleID = 1 to &NumSamples; /*ID variable for each sample*/
do i = 1 to &N;
x = rand("Uniform");
output;
end;
end;
run;
/*Step 2: Compute the mean of each sample*/
proc means data=Sim noprint;
by SampleID;
var x;
output out=OutStats mean=SampleMean;
run;
/*Step 3: Visualize and compute descriptive statistics for the ASD*/
ods select Moments Histogram;
proc univariate data=OutStats;
label SampleMean = "Sample Mean of U(0,1) Data";
var SampleMean;
histogram SampleMean / normal; /*overlay normal fit*/
run;
/* Using PROC IML*/
%let N = 10;
%let NumSamples = 1000;
proc iml;
call randseed(123);
x = j(&NumSamples,&N); /*many samples (rows), each of size N*/
call randgen(x, "Uniform"); /*1. Simulate data*/
s = x[,:]; /*2. Compute statistic for each row*/
Mean = mean(s); /*3. Summarize and analyze ASD*/
StdDev = std(s);
call qntl(q, s, {0.05 0.95});
print Mean StdDev (q`)[colname={"5th Pctl" "95th Pctl"}];
create samplemean from mean[colname="samplemean"];
append from mean;
close samplemean;
QUIT;
ods select Moments Histogram;
proc univariate data=OutStats;
label SampleMean = "Sample Mean of U(0,1) Data";
var SampleMean;
histogram SampleMean / normal; /*overlay normal fit*/
run;
Simulate the sampling distribution of $\hat{p}$, when sampling from a binary population with $p=0.1$. To do this generate 1000 samples of size 20, calculate p-hat for each sample (this is just the sample mean of the resulting 0's and 1's), and plot a histogram of the resulting $\hat{p}$.
In the next example, we will simulate data from a linear regression model. We will use the following regression model:
$$Y_i=1 + X_i/2 +Z_i/3 +\varepsilon_i$$where $\varepsilon_i\overset{i.i.d.}{\sim} N(0,1)$. Normally X and Z will be observed or set for an experiment. Here we will simulate the explanatory variables X and Z. For convenience, we will simulate X and Z from two independent standard normal distributions.
%let N = 50; /* sample size */
data Explanatory(keep=x z);
call streaminit(12345);
do i = 1 to &N;
x = rand("Normal");
z = rand("Normal");
output;
end;
run;
/* Simulate multiple samples from a regression model */
%let NumSamples = 1000; /* number of samples */
data RegSim(drop=eta rmse);
call streaminit(123);
rmse = 1; /* scale of error term */
set Explanatory; /* implicit loop over obs*/
ObsNum = _N_; /* observation number */
eta = 1 + X/2 + Z/3; /* linear predictor */
do SampleID = 1 to &NumSamples;
Y = eta + rand("Normal", 0, rmse); /* random error term */
output;
end;
run;
PROC PRINT DATA = RegSim (OBS=5);
RUN;
proc sort data=RegSim; /* sort for BY-group processing */
by SampleID ObsNum;
run;
proc reg data=RegSim outest=OutEst NOPRINT;
by SampleID;
model y = x z;
quit;
PROC PRINT DATA=OutEst (OBS=5);
RUN;
proc means nolabels data=OutEst Mean Std P5 P95;
var Intercept x z;
run;
ODS SELECT Univariate.x.Histogram.Histogram Univariate.z.Histogram.Histogram Univariate.intercept.Histogram.Histogram;
PROC UNIVARIATE DATA=OutEst;
VAR Intercept X Z;
HISTOGRAM Intercept /NORMAL;
HISTOGRAM X / NORMAL;
HISTOGRAM Z / NORMAL;
RUN;
ODS SELECT ALL;
For the next example, we will simulate a power curve for the F test of $$H_0: \beta_Z=0\;vs\;H_1:\beta_Z\neq 0$$ for different true values of $\beta_Z$.
%let N = 50; /* sample size */
%let NumSamples = 1000; /* number of samples */
data PowerSim(drop=eta);
call streaminit(1);
set Explanatory; /* fixed values for x and z */
do gamma = 0 to 0.6 by 0.05;
eta = 1 + x/2 + gamma*z; /* linear predictor */
do SampleID = 1 to &NumSamples;
y = eta + rand("Normal");
output;
end;
end;
run;
proc sort data=PowerSim;
by gamma SampleID;
run;
ODS SELECT NONE;
proc reg data=PowerSim;
by gamma SampleID;
model y = x z;
test z=0;
ods output TestANOVA=TestAnova(where=(Source="Numerator"));
quit;
ODS SELECT ALL;
proc print data=TestANOVA(firstobs=13 obs=17);
var gamma SampleID FValue ProbF;
run;
/* Construct an indicator variable for observations that reject H0 */
data Results;
set TestANOVA;
Reject = (ProbF <= 0.05); /* indicator variable */
run;
/* compute proportion of times H0 is rejected for each gamma value;
compute upper/lower confidence intervals for the proportion */
proc freq data=Results noprint;
by gamma;
tables Reject / nocum binomial(level='1');
output out=Signif binomial;
run;
PROC PRINT DATA=Signif (OBS=5);
RUN;
/* plot the proportion versus the magnitude of gamma */
title "Power of F Test for gamma=0";
title2 "N = &N";
proc sgplot data=Signif noautolegend;
scatter x=gamma y=_BIN_ / yerrorlower=L_Bin yerrorupper=U_Bin;
series x=gamma y=_BIN_;
yaxis values=(0 to 1 by 0.1) grid
label="Power (1 - P[Type II Error])";
xaxis label="Gamma" grid;
run;