{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simulations in SAS\n", "\n", "In this lesson, we'll investigate ways to use some of the random number generators available in SAS:\n", "\n", "* to select a random sample of observations from a larger data set\n", "* to generate a scheme for assigning treatments to subjects in a randomized, controlled experiment\n", "* to generate (i.e., \"simulate\") numbers that follow some underlying probability distribution\n", "\n", "The random generator functions used in the lesson include:\n", "\n", "* **ranuni**, which generates a number from a uniform (0, 1) distribution\n", "* **rannor**, which generates a number from a standard normal distribution\n", "* **ranbin**, which generates a number from a binomial (n, p) distribution\n", "* **ranpoi**, which generates a number from a poisson distribution with mean m.\n", "\n", "Other random functions available in SAS, but not illustrated in this lesson include **rancau** (a Cauchy distribution), **rangam** (a Gamma distribution), **rantri** (a triangular distribution), **ranexp** (an exponential distribution), and **rantbl** (a discrete distribution with user-specified probabilities).\n", "\n", "In accomplishing the goals of the lesson, we'll primarily take advantage of the tools that are available to us in a SAS data step, such as do loops, if-then-else statements, retain, and output statements. By taking such an approach, we have not only the opportunity to review and put into practice these useful data step techniques, but also the opportunity to better understand the processes of random sampling, random assignment and simulation. In the case of random sampling, however, in addition to using the data step, we will also use the SURVEYSELECT procedure just so that you are aware of its basic functionality for your future use. Due to time constraints of the course and the complexity of the PLAN procedure, we will not use it to accomplish any of our random assignments. You should be aware, however, of its existence should you want to explore it on your own in the future.\n", "\n", "## Random Sampling without Replacement\n", "\n", "Randomly selecting records from a large data set may be helpful if your data set is so large as to prevent or slow processing, or if one is conducting a survey and needs to select a random sample from some master database. When you select records randomly from a larger data set (or some master database), you can achieve the sampling in a few different ways, including:\n", "\n", "* **sampling without replacement**, in which a subset of the observations are selected randomly, and once an observation is selected it cannot be selected again.\n", "* **sampling with replacement**, in which a subset of observations are selected randomly, and an observation may be selected more than once.\n", "* **selecting a stratified sample**, in which a subset of observations are selected randomly from each group of the observations defined by the value of a stratifying variable, and once an observation is selected it cannot be selected again.\n", "\n", "In this section, we'll investigate sampling without replacement. Then, in the next two sections, we'll investigate sampling with replacement and selecting a stratified sample. Throughout the three sections we'll work with a contrived mailing list. We'll use the list under the guise of being a large catalog mail-order company wanting to conduct a random survey of a subset of our customers. The actual list we'll use is admittedly (much) smaller than what we would be working with in practice. Our teeny-tiny mailing list is, of course, used merely for the purpose of illustrating some random sampling techniques in SAS.\n", "\n", "The mailing list with which we will be working is contained in a permanent SAS data set called mailing. The following SAS code simply prints the mailing list:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Sample Dataset: Mailing List

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetCityState
11Jonathon Smothers103 Oak LaneBellefontePA
22Jane Doe845 Main StreetBellefontePA
33Jim Jefferson10101 Allegheny StreetBellefontePA
44Mark Adams312 Oak LaneBellefontePA
55Lisa Brothers89 Elm StreetBellefontePA
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME phc6089 '/folders/myfolders/SAS_Notes/data';\n", " \n", "PROC PRINT data=phc6089.mailing(obs=5);\n", " title 'Sample Dataset: Mailing List';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mailing datatset, mailing.sas7bdat, can be found in the data folder available on the course website. Be sure to edit the LIBNAME statement so that it reflects the location in which you saved the data set. Run the program and review the resulting output in order to familiarize yourself with the data set.\n", "\n", "When using a computer program, such as SAS, to randomly select a subset of observations from some larger data set, there are two approaches we can take. We could tell SAS to randomly select a percentage, say 30%, of the observations in the data set. Or, we could tell SAS to randomly select an exact number, say 25, of the observations in the data set. With the former approach, we cannot be guaranteed that the subset data set will achieve a specific size. We consider such samples then an \"approximate-sized sample.\" In general, to obtain an approximate-sized sample, one selects k% of the observations from the original data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program illustrates how to use a SAS data step to obtain an approximate-sized random sample without replacement. Specifically, the program uses the ranuni function and a WHERE statement to tell SAS to randomly sample approximately 30% of the 50 observations from the permanent SAS data set mailing:

\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Sample1A: Approximate-Sized Simple Random Sample

\n", "

without Replacement

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
NumNameStreetCityStaterandom
1Jonathon Smothers103 Oak LaneBellefontePA0.07478
2Jane Doe845 Main StreetBellefontePA0.25203
4Mark Adams312 Oak LaneBellefontePA0.08918
6Delilah Fequa2094 Acorn StreetBellefontePA0.02253
7John Doe812 Main StreetBellefontePA0.15570
8Mamie Davison102 Cherry AvenueBellefontePA0.05460
9Ernest Smith492 Main StreetBellefontePA0.05662
14William Edwards79 Oak LaneBellefontePA0.15432
38Miriam Denders2348 Robin AvenuePort MatildaPA0.16192
41Lou Barr219 Eagle StreetPort MatildaPA0.13033
43Leslie Olin487 Bluebird HavenPort MatildaPA0.23101
44Edwin Hoch389 Dolphin DrivePort MatildaPA0.20708
49Tim Winters95 Dove StreetPort MatildaPA0.03722
20Kristin Jones120 Stratford DriveState CollegePA0.29425
22Roberta Kudla312 Whitehall RoadState CollegePA0.05187
24Mark Mendel256 Fraser StreetState CollegePA0.06246
26Jan Davison201 E. Beaver AvenueState CollegePA0.00799
31Robert Williams156 Straford DriveState CollegePA0.14537
34Mike Dahlberg1201 No. AthertonState CollegePA0.27246
35Doris Alcorn453 Fraser StreetState CollegePA0.24231
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA sample1A (where = (random le 0.30));\n", " set phc6089.mailing;\n", " random = ranuni(43420);\n", "RUN;\n", " \n", "PROC PRINT data=sample1A NOOBS;\n", " title1 'Sample1A: Approximate-Sized Simple Random Sample';\n", " title2 'without Replacement';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Then, review the resulting output to see the random sample that SAS selected from the mailing data set. You should note a couple of things. First, the people that appear in the random sample appear to be fairly uniformly distributed across the 50 possible Num values. Also, the final random sample contains 20 of the 50 observations in the mailing data set. At 40% (20 out of 50), this is a little higher than the 30% sample we were asking for, but it should not be surprising as it is an artifact of the method used. Finally, note that the variable random contains only values that are smaller than 0.30, as should be expected in light of the WHERE= option attached to the DATA statement.

\n", "

Okay, now how does the program work? Before we answer the question, note that the technique we use is a technique commonly used by statisticians. It will work in any program, not just SAS. Now, for the answer ... the random assignment statement tells SAS to use the ranuni function to generate a (pseudo) random number between 0 and 1 and to assign the resulting number to a variable called random. The number 43420 that appears in the parentheses of the ranuni function is specified by the user and is called the seed. In general:

\n", "
    \n", "
  • The seed must be a nonnegative number less than 2,147,483,647.
  • \n", "
  • A given seed always produces the same results. That is, using the same seed, the ranuni function would select the same observations.
  • \n", "
  • If you choose 0 as the seed, then the computer clock time at execution is used. In this case, it is very unlikely that the ranuni function would produce the same results. It should be noted, that it is common practice when conducting research to use a non-zero seed, so that the results could be reproduced if necessary.
  • \n", "
  • The ranuni function can be used without assigning it to another variable. We assigned the value to the variable called random just so we could print the results.
  • \n", "
\n", "

Now, because the numbers generated by the ranuni function are uniformly distributed across the numbers between 0 and 1, we should expect about 30% of the random numbers to be less than 0.30. That's where the WHERE= option on the DATA statement comes into play. If the random number generated is less than or equal to 0.30, then the observation is selected for inclusion in the sample. Since the mailing data set has 50 observations, about 30% of the observations should be selected to create a sample of approximately 15 people. Because the selection depends on the values of the numbers generated, the sample cannot be guaranteed to be of a certain size.

\n", "

You might want to change the seed a few times to see how it affects the sample. If you use the seed 1, for example, you'll see that the new random sample contains 15 observations, not 20 as in our first sample. You might also want to change the proportion 0.30 to various other numbers between 0 and 1 to see how it affects the size of the sample.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following code illustrates an alternative way of randomly selecting an approximate-sized random sample without replacement. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample approximately 30% of the 50 observations from the permanent SAS data set mailing:

\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SURVEYSELECT Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Selection MethodSimple Random Sampling
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Input Data SetMAILING
Random Number Seed12345678
Sampling Rate0.3
Sample Size15
Selection Probability0.3
Sampling Weight3.333333
Output Data SetSAMPLE1B
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Sample1B: Approximate-Sized Simple Random Sample

\n", "

without Replacement (using PROC SURVEYSELECT)

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
NumNameStreetCityState
1Jonathon Smothers103 Oak LaneBellefontePA
5Lisa Brothers89 Elm StreetBellefontePA
12Fran Cipolla912 Cardinal DriveBellefontePA
14William Edwards79 Oak LaneBellefontePA
38Miriam Denders2348 Robin AvenuePort MatildaPA
39Scott Fitzgerald43 Blue Jay DrivePort MatildaPA
40Jane Smiley298 Cardinal DrivePort MatildaPA
44Edwin Hoch389 Dolphin DrivePort MatildaPA
45Ann Draper72 Lake RoadPort MatildaPA
50George Matre75 Ashwind DrivePort MatildaPA
19Frank Smith238 Waupelani DriveState CollegePA
24Mark Mendel256 Fraser StreetState CollegePA
29Joe White678 S. Allen StreetState CollegePA
34Mike Dahlberg1201 No. AthertonState CollegePA
35Doris Alcorn453 Fraser StreetState CollegePA
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SURVEYSELECT data = phc6089.mailing\n", " out = sample1B\n", " method = SRS\n", " seed = 12345678\n", " samprate = 0.30;\n", " title;\n", "RUN;\n", " \n", "PROC PRINT data = sample1B NOOBS;\n", " title1 'Sample1B: Approximate-Sized Simple Random Sample';\n", " title2 'without Replacement (using PROC SURVEYSELECT)';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample from the mailing data set. As you can see, the SURVEYSELECT procedure produces one page of output that is merely informational, reiterating much of the information that we supplied to SAS in our SURVEYSELECT code:

\n", "
    \n", "
  • The DATA= option tells SAS the name of the input data set (phc6089.mailing) from which observations should be selected.
  • \n", "
  • The OUT= option tells SAS the name of the output data set (sample1B) in which the selected observations should be stored.
  • \n", "
  • The METHOD= option tells SAS the sampling method that should be used. Here, SRS tells SAS to use the simple random sampling method to select observations, that is, with equal probability and without replacement.
  • \n", "
  • The SEED= option tells SAS the initial seed (12345678) for generating the random number. In general, the value of the SEED= option must be an integer, and if you do not specify the SEED= option, or if the SEED= value is negative or zero, the computers clock is used to obtain the initial seed.
  • \n", "
  • The SAMPRATE= option tells SAS what proportion (0.30) of the input data set should be sampled.
  • \n", "
\n", "

Oh, and the empty title statement that appears in the code is there merely to minimize any confusion its absence may cause you. If it, or another TITLE statement, is not present, the first (informational) page of the SURVEYSELECT output will contain the most recent title, which in this case would concern Sample1A from the previous example. Now that would be confusing!

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thus far, we've produced only approximate-sized random samples without replacement. Now, we'll turn our attention to three examples that illustrate how to produce exact-sized random samples without replacement. We'll start (naturally?!) with the most complicated procedure first (using a DATA step) and end up with the most straightforward procedure last (using the SURVEYSELECT procedure).\n", "\n", "
\n", "

Example

\n", "

The following program illustrates how to use a SAS data step to obtain an exact-sized random sample without replacement. Specifically, the program uses the ranuni function in a DATA step to tell SAS to randomly sample exactly 15 of the 50 observations from the permanent SAS data set mailing:

\n", "
" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Sample2: Exact-Sized Simple Random Sample

\n", "

without Replacement

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
NumNamenkrandompropn
4Mark Adams47150.128290.31915
5Lisa Brothers46140.087990.30435
6Delilah Fequa45130.024460.28889
9Ernest Smith42120.012280.28571
14William Edwards37110.129080.29730
15Harold Harvey36100.031360.27778
41Lou Barr3290.112300.28125
46Linda Nicolson2780.108260.29630
16Linda Edmonds2270.262600.31818
23Greg Pope1560.170210.40000
25Steve Lindhoff1350.363750.38462
28Srabashi Kundu1040.080950.40000
33Steve Ignella530.565560.60000
34Mike Dahlberg420.354890.50000
36Daniel Fremgen210.140880.50000
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA sample2;\n", " set phc6089.mailing nobs=total;\n", " if _N_ = 1 then n=total;\n", " retain k 15 n;\n", " random = ranuni(860244);\n", " propn = k/n;\n", " if random le propn then\n", " do;\n", " output;\n", " k=k-1;\n", " end;\n", " n=n-1;\n", " if k=0 then stop;\n", "RUN;\n", " \n", "PROC PRINT data=sample2 NOOBS;\n", " title1 'Sample2: Exact-Sized Simple Random Sample';\n", " title2 'without Replacement';\n", " var num name n k random propn;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set.

\n", "

In summary, here's the approach used to select the sample:

\n", "
    \n", "
  • For each observation in the data set, generate a uniform random number.
  • \n", "
  • Select the first observation in the original data set for inclusion in the sample if its random number is less than or equal to the proportion of records needed (15 of 50, or 0.30).
  • \n", "
  • Modify the proportion still needed in the sample. Here, it is 14/49 if the first observation was selected for the sample; and it is 15/49 if it was not. If the random number generated for the second observation is less than or equal to this proportion, include it in the sample.
  • \n", "
  • Continue this process until you have selected exactly 15 observations.
  • \n", "
\n", "

Now, how to accomplish this approach using the SAS DATA step? Here's how we did it step-by-step:

\n", "

k = the number of observations needed to complete the sample.

\n", "

n = the number of observations left to read from the original data set.

\n", "
    \n", "
  • Define two variables k and n, where:
  • \n", "
  • Using the NOBS= option of the SET statement, determine the number of observations in the phc6089.mailing data set and assign the value to a variable called total. In general, the NOBS= option creates and names a temporary variable whose value is the total number of observations in the data set specified in the SET statement.
  • \n", "
  • For the first observation, that is, when the automatic variable _N_ equa1s 1, set the variable n to the value of the variable total (here, 50). (Recall that automatic variables are created automatically by the DATA step, are added to the program data vector, but are not output to the data set being created. The values of automatic variables are retained from one iteration of the DATA step to the next, rather than set to missing. The automatic variable _N_ is initially set to 1. Each time the DATA step loops past the DATA statement, the variable _N_ increments by 1. The value of _N_ represents the number of times the DATA step has iterated, and often equals the number of observations in the output data set.)
  • \n", "
  • Using the RETAIN statement, initialize k to 15, the number of observations desired in the final sample.
  • \n", "
  • Use the ranuni function (starting with a seed of 860244) to generate a uniform random number between 0 and 1. Use k and n to determine the proportion of observations that still needs to be selected from the mailing data set.
  • \n", "
  • If the random number generated is less than the proportion of observations still needed, then OUTPUT the observation to the output data set. When an observation is selected, reduce the number of observations still needed in the sample by 1 (that is, k = k-1).
  • \n", "
  • At the end of each iteration of the DATA step:
  • \n", "
      \n", "
    • reduce the number of observations left in the mailing data set by 1 (n = n - 1)
    • \n", "
    • determine if the sample is complete (is k = 0?). If yes, tell SAS to STOP. In general, the STOP statement tells SAS to stop processing the current DATA step immediately and resume processing statements after the end of the current DATA step.
    • \n", "
    \n", "
\n", "

Note that the random = ranuni() and propn = k/n assignments are made here only so their values can be printed. In another situation, these values would be incorporated directly in the IF statement: if ranuni() le k/n then do; Additionally, k and n could be dropped from the output data set, but are kept here, so their values can be printed for educational purposes.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following code illustrates an alternative way of using a DATA step to randomly select an exact-sized random sample without replacement. The code, while less efficient — because it requires that the data set be processed twice and sorted once — may feel more natural and intuitive to you:

\n", "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Sample3A: Exact-Sized Simple Random Sample

\n", "

without Replacement

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetCityStaterandom
19Ernest Smith492 Main StreetBellefontePA0.01228
26Delilah Fequa2094 Acorn StreetBellefontePA0.02446
315Harold Harvey480 Main StreetBellefontePA0.03136
428Srabashi Kundu112 E. Beaver AvenueState CollegePA0.08095
55Lisa Brothers89 Elm StreetBellefontePA0.08799
646Linda Nicolson71 Liberty TerracePort MatildaPA0.10826
741Lou Barr219 Eagle StreetPort MatildaPA0.11230
84Mark Adams312 Oak LaneBellefontePA0.12829
914William Edwards79 Oak LaneBellefontePA0.12908
1036Daniel Fremgen103 W. College AvenueState CollegePA0.14088
1123Greg Pope5100 No. AthertonState CollegePA0.17021
1216Linda Edmonds410 College AvenueState CollegePA0.26260
1338Miriam Denders2348 Robin AvenuePort MatildaPA0.32450
1413James Whitney104 Pine Hill DriveBellefontePA0.33555
1534Mike Dahlberg1201 No. AthertonState CollegePA0.35489
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA sample3A;\n", " set phc6089.mailing;\n", " random=ranuni(860244);\n", "RUN;\n", " \n", "PROC SORT data=sample3A;\n", " by random;\n", "RUN;\n", " \n", "DATA sample3A;\n", " set sample3A;\n", " if _N_ le 15;\n", "RUN;\n", " \n", "PROC PRINT data=sample3A;\n", " title1 'Sample3A: Exact-Sized Simple Random Sample';\n", " title2 'without Replacement';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set.

\n", "

The approach used is very similar to the approach used previously for selecting an approximate-sized sample without replacement. That is:

\n", "
    \n", "
  • For each observation in the data set, use the ranuni function to generate a uniform random number and store it in the variable called random.
  • \n", "
  • Sort the data set by the random number random.
  • \n", "
  • Select the first 15 observations from the sorted data set using the automatic variable _N_ (if _N_ le 15).
  • \n", "
\n", "

By so doing, every observation in the mailing data set has an equal likelihood of being one of the first 15 observations, and therefore an equal likelihood of being selected into the sample.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following code illustrates yet another alternative way of randomly selecting an exact-sized random sample without replacement. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly 15 of the 50 observations from the permanent SAS data set mailing:

\n", "
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SURVEYSELECT Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Selection MethodSimple Random Sampling
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Input Data SetMAILING
Random Number Seed12345678
Sample Size15
Selection Probability0.3
Sampling Weight3.333333
Output Data SetSAMPLE3B
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Sample3B: Exact-Sized Simple Random Sample

\n", "

without Replacement (using PROC SURVEYSELECT)

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetCityState
11Jonathon Smothers103 Oak LaneBellefontePA
25Lisa Brothers89 Elm StreetBellefontePA
312Fran Cipolla912 Cardinal DriveBellefontePA
414William Edwards79 Oak LaneBellefontePA
538Miriam Denders2348 Robin AvenuePort MatildaPA
639Scott Fitzgerald43 Blue Jay DrivePort MatildaPA
740Jane Smiley298 Cardinal DrivePort MatildaPA
844Edwin Hoch389 Dolphin DrivePort MatildaPA
945Ann Draper72 Lake RoadPort MatildaPA
1050George Matre75 Ashwind DrivePort MatildaPA
1119Frank Smith238 Waupelani DriveState CollegePA
1224Mark Mendel256 Fraser StreetState CollegePA
1329Joe White678 S. Allen StreetState CollegePA
1434Mike Dahlberg1201 No. AthertonState CollegePA
1535Doris Alcorn453 Fraser StreetState CollegePA
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SURVEYSELECT data = phc6089.mailing\n", " out = sample3B\n", " method = SRS\n", " seed = 12345678\n", " sampsize = 15;\n", " title;\n", "RUN;\n", " \n", "PROC PRINT data = sample3B;\n", " title1 'Sample3B: Exact-Sized Simple Random Sample';\n", " title2 'without Replacement (using PROC SURVEYSELECT)';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set. Note that the only difference between this code and the previous SURVEYSELECT code is the sampsize = 15 statement here replaces the samprate = 0.30 statement there. You might want to change the seed (seed) and sample size (sampsize) a few times to see how it affects the sample.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Sampling with Replacement\n", "\n", "In the previous section, all of the samples that we selected were without replacement. That is, once an observation was selected from the data set, it could not be selected again. Now, we'll investigate how to take random samples with replacement. That is, if an observation is selected once, it does not prevent it from being selected again.\n", "\n", "
\n", "

Example

\n", "

The following code illustrates how to use the DATA step to randomly select an exact-sized random sample with replacement. Specifically, the program uses the ranuni function in conjunction with the POINT= option of the SET statement to tell SAS to randomly sample exactly 15 of the 50 observations from the permanent SAS data set mailing:

\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Sample4A: Exact-Sized Unrestricted Random Sample

\n", "

Selects units with equal probabilities & with replacement

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetCityStatei
124Mark Mendel256 Fraser StreetState CollegePA1
214William Edwards79 Oak LaneBellefontePA2
310Laura Mills704 Hill StreetBellefontePA3
43Jim Jefferson10101 Allegheny StreetBellefontePA4
545Ann Draper72 Lake RoadPort MatildaPA5
611Linda Bentlager1010 Tricia LaneBellefontePA6
747Barb Wyse21 Cleveland DrivePort MatildaPA7
829Joe White678 S. Allen StreetState CollegePA8
932George Ball888 Park AvenueState CollegePA9
1031Robert Williams156 Straford DriveState CollegePA10
1149Tim Winters95 Dove StreetPort MatildaPA11
1242Casey Spears123 Main StreetPort MatildaPA12
1347Barb Wyse21 Cleveland DrivePort MatildaPA13
1432George Ball888 Park AvenueState CollegePA14
1548Coach Pierce74 Main StreetPort MatildaPA15
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA sample4A;\n", " choose=int(ranuni(58)*n)+1;\n", " set phc6089.mailing point=choose nobs=n;\n", " i+1;\n", " if i > 15 then stop;\n", "RUN;\n", "\n", "PROC PRINT data=sample4A;\n", " title1 'Sample4A: Exact-Sized Unrestricted Random Sample';\n", " title2 'Selects units with equal probabilities & with replacement';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set.

\n", "

The key to understanding how this code works is to understand what the expression:

\n", "
choose = int(ranuni(58)*n) + 1
\n", "

accomplishes. As you know, ranuni(58) tells SAS to use an initial seed of 58 to generate a uniform random number between 0 and 1. For the sake of example, suppose SAS generates the number 0.99. Then, the value of choose becomes 50 as calculated here:

\n", "
choose = int(0.99*50) + 1 = int(49.5) + 1 = 49 + 1 = 50
\n", "

And, if SAS generates the number 0.01, the value of choose becomes 1 as calculated here:

\n", "
choose = int(0.01*50) + 1 = int(0.5) + 1 = 0 + 1 = 1
\n", "

In this way, you can see how the expression always generates a positive integer 1, 2, 3, ..., up to n, the number of observations in your data set. All we need to do then is to tell SAS to generate such a random integer over and over again until we reach our desired sample size.

\n", "

Here's a summary of the approach:

\n", "
    \n", "
  • Use the NOBS= option of the SET statement to determine n, the number of observations in the original data set.
  • \n", "
  • Use the above choose= assignment statement to generate a random integer between 1 and n. (Note that the choose= assignment statement must be placed before the SET statement. If it is not, SAS would not know which observation to read first.)
  • \n", "
  • Use the POINT= option of the SET statement to select the choose'th observation from the original data set. The POINT= option tells SAS to read the SAS data set using direct access by observation number. In general, with the POINT= option, you name a temporary variable (here, choose) whose value is the number of the observation you want the SET statement to read.
  • \n", "
  • Perform the above two steps repeatedly, keeping count of the number of observations selected. The expression i + 1 takes care of the counting for us: by default, SAS sets i to 0 on the first iteration of the DATA step, and then increases i by 1 for each subsequent iteration.
  • \n", "
  • Once you've selected the number of observations desired (15, here), tell SAS to STOP. Note that when using the POINT= option, you must use a STOP statement to tell SAS when to stop processing the DATA step.
  • \n", "
\n", "

That's all there is to it! Again, you might want to change the seed (the 58) and the sample size (the 15) a few times to see how it affects the sample.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following code illustrates an alternative way of randomly selecting an exact-sized random sample with replacement. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly 15 of the 50 observations from the permanent SAS data set mailing:

\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SURVEYSELECT Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Selection MethodUnrestricted Random Sampling
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Input Data SetMAILING
Random Number Seed12345
Sample Size15
Expected Number of Hits0.3
Sampling Weight3.333333
Output Data SetSAMPLE4B
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Sample4B: Exact-Sized Unrestricted Random Sample

\n", "

Selects units with equal probabilities & with replacement

\n", "

(using PROC SURVEYSELECT)

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetCityStateNumberHits
110Laura Mills704 Hill StreetBellefontePA1
214William Edwards79 Oak LaneBellefontePA1
315Harold Harvey480 Main StreetBellefontePA1
442Casey Spears123 Main StreetPort MatildaPA1
545Ann Draper72 Lake RoadPort MatildaPA1
648Coach Pierce74 Main StreetPort MatildaPA1
717Rigna Patel101 Beaver AvenueState CollegePA2
820Kristin Jones120 Stratford DriveState CollegePA1
929Joe White678 S. Allen StreetState CollegePA1
1030Daniel Peterson328 Waupelani DriveState CollegePA2
1131Robert Williams156 Straford DriveState CollegePA1
1234Mike Dahlberg1201 No. AthertonState CollegePA1
1337Scott Henderson245 W. Beaver AvenueState CollegePA1
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SURVEYSELECT data = phc6089.mailing\n", " out = sample4B\n", " method = URS\n", " seed = 12345\n", " sampsize = 15;\n", " title;\n", "RUN;\n", " \n", "PROC PRINT data = sample4B;\n", " title1 'Sample4B: Exact-Sized Unrestricted Random Sample';\n", " title2 'Selects units with equal probabilities & with replacement';\n", " title3 '(using PROC SURVEYSELECT)';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set. Note that the only difference between this code and the previous SURVEYSELECT code is the method = URS statement here replaces the method = SRS statement there. Here, URS tells SAS to use the unrestricted random sampling method to select observations, that is, with equal probability and with replacement. (Oh, yeah, I guess the specified seed differs from the previous code, too, but that's no matter.)

\n", "

Again, you might want to change the seed (seed) and sample size (sampsize) a few times to see how it affects the sample.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stratified Random Sampling\n", "\n", "In the two previous sections, we were concerned with taking a random sample from a data set without regard to whether an observation comes from a particular subgroup. When you are conducting a survey, it often behooves you to make sure that your sample contains a certain number of observations from each particular subgroup. We'll concern ourselves with such a restriction here. That is, in this section, we'll focus on ways of using SAS to obtain a **stratified random sample**, in which a subset of observations are selected randomly from each subgroup of observations as determined by the value of a stratifying variable. We'll also go back to sampling without replacement, in which once an observation is selected it cannot be selected again.\n", "\n", "### Selecting a Stratified Sample of Equal-Sized Groups\n", "\n", "We'll first focus on the situation in which an equal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable.\n", "\n", "
\n", "

Example

\n", "

The following code illustrates how to select a stratified random sample of equal-sized groups. Specifically, the code tells SAS to randomly select 5 observations from each of the three subgroups — State College, Port Matilda, Bellefonte — as determined by the value of the variable city:

\n", "
" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Count by CITY

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsCityCOUNTPERCENT
1Bellefonte1530
2Port Matilda1326
3State College2244
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Sample5: Stratified Random Sample with Equal-Sized Strata

\n", "
\n", "
\n", "
\n", "

F4=Bellefonte

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetStateCOUNTkrandompropn
11Jonathon Smothers103 Oak LanePA1550.160920.33333
24Mark Adams312 Oak LanePA1240.274450.33333
37John Doe812 Main StreetPA930.184730.33333
410Laura Mills704 Hill StreetPA620.255750.33333
512Fran Cipolla912 Cardinal DrivePA410.101890.25000
\n", "
\n", "
\n", "
\n", "
\n", "

F4=Port Matilda

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetStateCOUNTkrandompropn
640Jane Smiley298 Cardinal DrivePA1150.202330.45455
743Leslie Olin487 Bluebird HavenPA840.017780.50000
846Linda Nicolson71 Liberty TerracePA530.307590.60000
948Coach Pierce74 Main StreetPA320.103580.66667
1049Tim Winters95 Dove StreetPA210.163130.50000
\n", "
\n", "
\n", "
\n", "
\n", "

F4=State College

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetStateCOUNTkrandompropn
1117Rigna Patel101 Beaver AvenuePA2150.094460.23810
1226Jan Davison201 E. Beaver AvenuePA1240.068310.33333
1328Srabashi Kundu112 E. Beaver AvenuePA1030.151350.30000
1432George Ball888 Park AvenuePA620.151320.33333
1537Scott Henderson245 W. Beaver AvenuePA110.219301.00000
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.mailing;\n", " table city / out=bycount noprint;\n", "RUN;\n", " \n", "PROC SORT data=phc6089.mailing;\n", " by city;\n", "RUN;\n", " \n", "DATA sample5;\n", " merge phc6089.mailing bycount (drop = percent);\n", " by city;\n", " retain k;\n", " if first.city then k=5;\n", " random = ranuni(109);\n", " propn = k/count;\n", " if random le propn then\n", " do;\n", " output;\n", " k=k-1;\n", " end;\n", " count=count-1;\n", "RUN;\n", " \n", "PROC PRINT data=bycount;\n", " title 'Count by CITY';\n", "RUN;\n", " \n", "PROC PRINT data=sample5;\n", " title 'Sample5: Stratified Random Sample with Equal-Sized Strata';\n", " by city;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

\n", "

Now, how does the program work? In order to select a stratified random sample in SAS, we basically use code similar to selecting equal-sized random samples without replacement, except now we process within each subgroup. More specifically, here's how the program works step-by-step:

\n", "
    \n", "
  • The sole purpose of the FREQ procedure is to determine the number of observations in the phc6089.mailing data set that correspond to each level of the stratification variable city (hence, \"table city\"). The OUT = option tells SAS to create a data set called bycount that contains the variable city and two variables that contain the number (count) and percentage (percent) of records for each level of city.
  • \n", "
  • The SORT procedure merely sorts the phc6089.mailing data set by city and stores the sorted result in a temporary data set called mailing so that it can be processed by city in the next DATA step.
  • \n", "
  • Merge, by city, the sorted data set mailing with the bycount data set, so that the number of observations per subgroup is available. Since the percentage of observations is not needed, drop it from the data set on input.
  • \n", "
  • The rest of the code in the DATA step should look very familiar. That is, once the number of observations per subgroup in the original phc6089.mailing data set is available, you can randomly select records from the subgroup as you would select equal-sized random samples without replacement, except you select within city (hence, \"by city\"). Every time SAS reads in a new city (hence, \"if first.city\"), the number of observations still needed in the subgroup's sample (k) is set to the number of observations desired in each of the subgroups (5, here).
  • \n", "
\n", "

Note that, again, the random = ranuni( ) and propn = k/n assignments are made here only so their values can be printed. In another situation, these values would be incorporated directly in the IF statement: if ranuni( ) le k/n then do; Additionally, k and count could be dropped from the output data set, but are kept here, so their values can be printed for educational purposes.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following code illustrates an alternative way of randomly selecting a stratified random sample of equal-sized groups. The code, while less efficient — because it requires that the data set be processed twice and sorted once — may feel more natural and intuitive to you:

\n", "
" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Sample6A: Stratified Random Sample with Equal-Sized Strata

\n", "
\n", "
\n", "
\n", "

F4=Bellefonte

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetStaterandom
110Laura Mills704 Hill StreetPA0.05728
24Mark Adams312 Oak LanePA0.22701
313James Whitney104 Pine Hill DrivePA0.28315
412Fran Cipolla912 Cardinal DrivePA0.34773
55Lisa Brothers89 Elm StreetPA0.42637
\n", "
\n", "
\n", "
\n", "
\n", "

F4=Port Matilda

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetStaterandom
647Barb Wyse21 Cleveland DrivePA0.05728
741Lou Barr219 Eagle StreetPA0.22701
850George Matre75 Ashwind DrivePA0.28315
949Tim Winters95 Dove StreetPA0.34773
1042Casey Spears123 Main StreetPA0.42637
\n", "
\n", "
\n", "
\n", "
\n", "

F4=State College

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetStaterandom
1133Steve Ignella367 Whitehall RoadPA0.00548
1225Steve Lindhoff130 E. College AvenuePA0.05728
1319Frank Smith238 Waupelani DrivePA0.22701
1431Robert Williams156 Straford DrivePA0.26377
1528Srabashi Kundu112 E. Beaver AvenuePA0.28315
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA scollege pmatilda bellefnt;\n", " set phc6089.mailing;\n", " if city = 'State College' then output scollege;\n", " else if city = 'Port Matilda' then output pmatilda;\n", " else if city = 'Bellefonte' then output bellefnt;\n", "RUN;\n", " \n", "%MACRO select (dsn, num);\n", " DATA &dsn;\n", " set &dsn;\n", " random=ranuni(85329);\n", " RUN;\n", " PROC SORT data=&dsn;\n", " by random;\n", " RUN;\n", " DATA &dsn;\n", " set &dsn;\n", " if _N_ le #\n", " RUN;\n", "%MEND select;\n", " \n", "%SELECT(scollege, 5); %SELECT(pmatilda, 5); %SELECT(bellefnt, 5); \n", " \n", "DATA sample6A;\n", " set bellefnt pmatilda scollege;\n", "RUN;\n", " \n", "PROC PRINT data=sample6A;\n", " title 'Sample6A: Stratified Random Sample with Equal-Sized Strata';\n", " by city;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

\n", "

Now, how does the program work? In summary, here's how this the approach works:

\n", "
    \n", "
  • The first DATA step uses an IF-THEN-ELSE statement in conjunction with OUTPUT statements to divide the original mailing data set up into several data sets based on the value of city. (Here, we create three data sets, one for each city... namely, scollege, pmatilda, and bellefnt.)
  • \n", "
  • Then, the macro select exactly mimics the creation of the sample3A data set in Example 10.5 on the Random Sampling Without Replacement page. That is, the macro generates a random number for each observation, the data set is sorted by the random number, and then the first num observations are selected.
  • \n", "
  • Then, call the macro select three times once for each of the city data sets .... scollege, pmatilda, and bellefnt .... selecting five observations from each.
  • \n", "
  • Finally, the final DATA step concatenates the three data sets, bellefnt, scollege, and pmatilda, with 5 observations each back into one data set called sample6A with the 15 randomly selected observations..
  • \n", "
\n", "

Lo and behold, when all is said and done, we have another stratified random sample of equal-sized groups! Approach #2 checked off. Now, onto one last approach!

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following code illustrates yet another alternative way of randomly selecting a stratified random sample of equal-sized groups. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly five observations from each of the three city subgroups in the permanent SAS data set mailing:

\n", "
" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SURVEYSELECT Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Selection MethodSimple Random Sampling
Strata VariableCity
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Input Data SetMAILING
Random Number Seed12345678
Number of Strata3
Total Sample Size15
Output Data SetSAMPLE6B
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Sample6B: Stratified Random Sample

\n", "

with Equal-Sized Strata (using PROC SURVEYSELECT)

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsCityNumNameStreetStateSelectionProbSamplingWeight
1Bellefonte5Lisa Brothers89 Elm StreetPA0.333333.0
2Bellefonte7John Doe812 Main StreetPA0.333333.0
3Bellefonte8Mamie Davison102 Cherry AvenuePA0.333333.0
4Bellefonte11Linda Bentlager1010 Tricia LanePA0.333333.0
5Bellefonte15Harold Harvey480 Main StreetPA0.333333.0
6Port Matilda41Lou Barr219 Eagle StreetPA0.384622.6
7Port Matilda42Casey Spears123 Main StreetPA0.384622.6
8Port Matilda44Edwin Hoch389 Dolphin DrivePA0.384622.6
9Port Matilda48Coach Pierce74 Main StreetPA0.384622.6
10Port Matilda50George Matre75 Ashwind DrivePA0.384622.6
11State College20Kristin Jones120 Stratford DrivePA0.227274.4
12State College30Daniel Peterson328 Waupelani DrivePA0.227274.4
13State College32George Ball888 Park AvenuePA0.227274.4
14State College35Doris Alcorn453 Fraser StreetPA0.227274.4
15State College37Scott Henderson245 W. Beaver AvenuePA0.227274.4
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SURVEYSELECT data = phc6089.mailing\n", " out = sample6B\n", " method = SRS\n", " seed = 12345678\n", " sampsize = (5 5 5);\n", " strata city notsorted;\n", " title;\n", "RUN;\n", " \n", "PROC PRINT data = sample6B;\n", " title1 'Sample6B: Stratified Random Sample';\n", " title2 'with Equal-Sized Strata (using PROC SURVEYSELECT)';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

\n", "

Now, the specifics about the code. The only things that should look new here are the STRATA statement and the form of the SAMPSIZE statement. The STRATA statement tells SAS to partition the input data set stat482.mailing into nonoverlapping groups defined by the variable city. The NOTSORTED option does not tell SAS that the data set is unsorted. Instead, the NOTSORTED option tells SAS that the observations in the data set are arranged in city groups, but the groups are not necessarily in alphabetical order. The SAMPSIZE statement tells SAS that we are interested in sampling five observations from each of the city groups.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting a Stratified Sample of Unequal-Sized Groups\n", "\n", "Now, we'll focus on the situation in which an unequal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable. If there are an unequal number of observations for each subgroup in the original data set, this sampling scheme may be accomplished by selecting the same proportion of observations from each subgroup. Again, we'll sample without replacement, in which once an observation is selected it cannot be re-selected.\n", "\n", "To select a **stratified random sample of unequal-sized groups**, we could use the code from Example 10.10 by passing the different group sample sizes into the macro select. Alternatively, we could create a data set containing two count variables ...one that contains the number of observations in each subgroup (n) ...and the other that contains the number of observations that need to be selected from each subgroup (k). Once the data set is created, we could merge it with the original data set, and select observations randomly as we have done previously for random samples without replacement. That's the strategy that the following example uses.\n", "\n", "
\n", "

Example

\n", "

The following code illustrates how to select a stratified random sample of unequal-sized groups. Specifically, the code tells SAS to randomly select 5, 6, and 8 observations, respectively, from each of the three subgroups — Bellefonte, Port Matilda, and State College — as determined by the value of the variable city:

\n", "
" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

NSELECT: Count by CITY

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsCitynk
1Bellefonte155
2Port Matilda136
3State College228
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Sample7: Stratified Random Sample of Unequal-Sized Groups

\n", "
\n", "
\n", "
\n", "

F4=Bellefonte

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetState
11Jonathon Smothers103 Oak LanePA
23Jim Jefferson10101 Allegheny StreetPA
36Delilah Fequa2094 Acorn StreetPA
411Linda Bentlager1010 Tricia LanePA
515Harold Harvey480 Main StreetPA
\n", "
\n", "
\n", "
\n", "
\n", "

F4=Port Matilda

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetState
638Miriam Denders2348 Robin AvenuePA
742Casey Spears123 Main StreetPA
843Leslie Olin487 Bluebird HavenPA
946Linda Nicolson71 Liberty TerracePA
1048Coach Pierce74 Main StreetPA
1149Tim Winters95 Dove StreetPA
\n", "
\n", "
\n", "
\n", "
\n", "

F4=State College

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetState
1216Linda Edmonds410 College AvenuePA
1317Rigna Patel101 Beaver AvenuePA
1418Ade Fequa803 Allen StreetPA
1521Amy Kuntz357 Park AvenuePA
1624Mark Mendel256 Fraser StreetPA
1726Jan Davison201 E. Beaver AvenuePA
1834Mike Dahlberg1201 No. AthertonPA
1935Doris Alcorn453 Fraser StreetPA
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA nselect;\n", " set phc6089.mailing (keep = city);\n", " by city;\n", " n+1;\n", " if last.city then do;\n", " input k;\n", " output;\n", " n=0;\n", " end;\n", "DATALINES;\n", "5\n", "6\n", "8\n", ";\n", "RUN;\n", " \n", "DATA sample7 (drop = k n);\n", " merge phc6089.mailing nselect;\n", " by city;\n", " if ranuni(7841) le k/n then\n", " do;\n", " output;\n", " k=k-1;\n", " end;\n", " n=n-1;\n", "RUN;\n", "PROC PRINT data=nselect;\n", " title 'NSELECT: Count by CITY';\n", "RUN;\n", "PROC PRINT data=sample7;\n", " title 'Sample7: Stratified Random Sample of Unequal-Sized Groups';\n", " by city;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, six observations from Port Matilda, and eight observations from State College.

\n", "

Now, how does the program work? The key to understanding the program is to understand the first DATA step. The remainder of the program is much like code we've seen before in which a random sample is selected without replacement. Now, the first DATA step creates a temporary data set called nselect that contains three variables city, n, and k:

\n", "
    \n", "
  • To count the number of observations n from each city, we use a counter variable n in conjunction with the last.city variable. By default, SAS sets n to 0 on the first iteration of the DATA step, and then increases n by 1 for each subsequent iteration of the DATA step until it counts the number of observations for one of the levels of city.
  • \n", "
  • To tell SAS the number of observations to select from each city, we use an INPUT statement in conjunction with a DATALINES statement. The numbers k are listed in the order of city ...so here we tell SAS we want to randomly select 5 observations from Bellefonte, 6 observations from Port Matilda, and 8 observations from State College.
  • \n", "
  • To write the numbers n and k to the new data set nselect, we use the last.city variable in a subsetting IF statement. So here, when SAS finds the last record within a city subgroup, n and k are written to the nselect data set, and n is reset to 0 in preparation for counting the number of observations for the next city in the data set.
  • \n", "
\n", "

The second DATA step creates a temporary data set called sample7 by merging the phc6089.mailing data set with the nselect data set. After merging, the code then randomly selects the deemed number of observations from each city just as we did previously for random samples without replacement.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following code illustrates an alternative way of randomly selecting a stratified random sample of unequal-sized groups. In selecting such a sample, rather than specifying the desired number sampled from each group, we could tell SAS to select an equal proportion of observations from each group. The following code does just that. Specifically, the code tells SAS to randomly select 25% of the observations from each of the three subgroups — Bellefonte, Port Matilda, and State College:

\n", "
" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

NSELECT2: Count by CITY

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsCitynk
1Bellefonte154
2Port Matilda134
3State College226
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Sample8: Stratified Random Sample of Unequal-Sized Groups

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsNumNameStreetCityState
13Jim Jefferson10101 Allegheny StreetBellefontePA
26Delilah Fequa2094 Acorn StreetBellefontePA
311Linda Bentlager1010 Tricia LaneBellefontePA
415Harold Harvey480 Main StreetBellefontePA
538Miriam Denders2348 Robin AvenuePort MatildaPA
642Casey Spears123 Main StreetPort MatildaPA
746Linda Nicolson71 Liberty TerracePort MatildaPA
849Tim Winters95 Dove StreetPort MatildaPA
916Linda Edmonds410 College AvenueState CollegePA
1017Rigna Patel101 Beaver AvenueState CollegePA
1118Ade Fequa803 Allen StreetState CollegePA
1224Mark Mendel256 Fraser StreetState CollegePA
1326Jan Davison201 E. Beaver AvenueState CollegePA
1435Doris Alcorn453 Fraser StreetState CollegePA
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA nselect2;\n", " set phc6089.mailing (keep=city);\n", " by city;\n", " n+1;\n", " if last.city then do;\n", " k=ceil(0.25*n);\n", " output;\n", " n=0;\n", " end;\n", "RUN;\n", " \n", "DATA sample8 (drop = k n);\n", " merge phc6089.mailing nselect2;\n", " by city;\n", " if ranuni(7841) le k/n then\n", " do;\n", " output;\n", " k=k-1;\n", " end;\n", " n=n-1;\n", "RUN;\n", " \n", "PROC PRINT data=nselect2;\n", " title 'NSELECT2: Count by CITY';\n", "RUN;\n", " \n", "PROC PRINT data=sample8;\n", " title 'Sample8: Stratified Random Sample of Unequal-Sized Groups';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

In this case, it probably makes most sense to first compare the code here with the code from the previous example. The only difference you should see is that rather than using an INPUT and DATALINES statement to read in the number of observations, k, to be selected from each of the subgroups, here we use the ceiling function, ceil( ), to determine k. Specifically, k is calculated using:

\n", "
k=ceil(0.25*n);
\n", "

Now, if you think about it, if I tell you to select 25% of the n = 16 observations in a subgroup, you'd tell me that we need to select 4 observations. But what if I tell you to select 25% of the n = 15 observations in a subgroup? If you calculate 25% of 15, you get 3.75. Hmmm.... how can you select 3.75 observations? That's where the ceiling function comes in to play. The ceiling function, ceil(argument), returns the smallest integer that is greater than or equal to the argument. So, for example, ceil(3.75) equals 4 ... as does of course ceil(3.1), ceil(3.25), and ceil(3.87) ...you get the idea.

\n", "

That's it ... that's all there is to it. Once k is determined using the ceiling function, the rest of the program is identical to the program in the previous example.

\n", "

Now, try it out... launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, 25% of the observations from Bellefonte, Port Matilda, and State College. In this case, that translates to 4, 4, and 6 observations, respectively.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following code illustrates yet another alternative way... you've gotta be kidding me! ...of randomly selecting a stratified random sample of unequal-sized groups. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly 5, 6, and 8 observations, respectively, from each of the three city subgroups in the permanent SAS data set phc6089.mailing:

\n", "
" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SURVEYSELECT Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Selection MethodSimple Random Sampling
Strata VariableCity
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Input Data SetMAILING
Random Number Seed12345678
Number of Strata3
Total Sample Size19
Output Data SetSAMPLE9
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Sample9: Stratified Random Sample

\n", "

with Unequal-Sized Strata (using PROC SURVEYSELECT)

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsCityNumNameStreetStateSelectionProbSamplingWeight
1Bellefonte5Lisa Brothers89 Elm StreetPA0.333333.00000
2Bellefonte7John Doe812 Main StreetPA0.333333.00000
3Bellefonte8Mamie Davison102 Cherry AvenuePA0.333333.00000
4Bellefonte11Linda Bentlager1010 Tricia LanePA0.333333.00000
5Bellefonte15Harold Harvey480 Main StreetPA0.333333.00000
6Port Matilda40Jane Smiley298 Cardinal DrivePA0.461542.16667
7Port Matilda41Lou Barr219 Eagle StreetPA0.461542.16667
8Port Matilda42Casey Spears123 Main StreetPA0.461542.16667
9Port Matilda43Leslie Olin487 Bluebird HavenPA0.461542.16667
10Port Matilda49Tim Winters95 Dove StreetPA0.461542.16667
11Port Matilda50George Matre75 Ashwind DrivePA0.461542.16667
12State College20Kristin Jones120 Stratford DrivePA0.363642.75000
13State College24Mark Mendel256 Fraser StreetPA0.363642.75000
14State College25Steve Lindhoff130 E. College AvenuePA0.363642.75000
15State College28Srabashi Kundu112 E. Beaver AvenuePA0.363642.75000
16State College30Daniel Peterson328 Waupelani DrivePA0.363642.75000
17State College32George Ball888 Park AvenuePA0.363642.75000
18State College34Mike Dahlberg1201 No. AthertonPA0.363642.75000
19State College37Scott Henderson245 W. Beaver AvenuePA0.363642.75000
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SURVEYSELECT data = phc6089.mailing\n", " out = sample9\n", " method = SRS\n", " seed = 12345678\n", " sampsize = (5 6 8);\n", " strata city notsorted;\n", " title;\n", "RUN;\n", " \n", "PROC PRINT data = sample9;\n", " title1 'Sample9: Stratified Random Sample';\n", " title2 'with Unequal-Sized Strata (using PROC SURVEYSELECT)';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Straightforward enough! The only difference between this code and the code in the previous example with equal sized group sampling is that here the sample sizes are specified at 5, 6, and 8 rather than 5, 5, and 5. Note that you must list the stratum sample size values in the order in which the strata appear in the input data set. Like I said, straightforward enough.

\n", "

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, six observations from Port Matilda, and eight observations from State College.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Random Assignments\n", "\n", "We now turn our focus from randomly sampling a subset of observations from a data set to that of generating a random assignment of treatments to experimental units in a randomized, controlled experiment. The good news is that the techniques used to sample without replacement can easily be extended to generate such random assignment plans.\n", "\n", "It's probably a good time to remind you of the existence of the PLAN procedure. As I mentioned earlier, due to time constraints of the course and the complexity of the PLAN procedure, we will not use it to accomplish any of our random assignments. You should be aware, however, of its existence should you want to explore it on your own in the future.\n", "\n", "
\n", "

Example

\n", "

Suppose we are interested in conducting an experiment so that we can compare the effects of two drugs (A and B) and one placebo on headache pain. We have 30 subjects enrolled in our study, but need to determine a plan for randomly assigning 10 of the subjects to treatment A, 10 of the subjects to treatment B, and 10 of the subjects to the placebo. The following program does just that for us. That is, it creates a random assignment for 30 subjects in a completely randomized design with one factor having 3 levels:

\n", "
" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Random Assignment for CRD with One Factor

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Unitrandomgroup
110.00602Placebo
140.14366Placebo
180.18030Placebo
40.20396Placebo
120.21271Placebo
290.21515Placebo
90.25440Placebo
30.29567Placebo
210.32816Placebo
220.33889Placebo
170.44446Drug A
190.47514Drug A
50.49087Drug A
230.50231Drug A
280.52765Drug A
250.53381Drug A
240.55448Drug A
60.60245Drug A
80.60772Drug A
200.61191Drug A
160.69616Drug B
70.69824Drug B
10.70305Drug B
100.71145Drug B
130.71217Drug B
150.86676Drug B
270.96330Drug B
260.97864Drug B
300.98660Drug B
20.99081Drug B
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA exper1;\n", " DO Unit = 1 to 30;\n", " OUTPUT;\n", " END;\n", "RUN;\n", " \n", "DATA random1;\n", " set exper1;\n", " random=ranuni(27407349);\n", "RUN;\n", " \n", "PROC SORT data=random1;\n", " by random;\n", "RUN;\n", " \n", "PROC FORMAT;\n", " value trtfmt 1 = 'Placebo'\n", " 2 = 'Drug A'\n", " 3 = 'Drug B';\n", "RUN;\n", " \n", "DATA random1;\n", " set random1;\n", " if _N_ le 10 then group=1;\n", " else if _N_ gt 10 and _N_ le 20 then group=2;\n", " else if _N_ gt 20 then group=3;\n", " format group trtfmt.;\n", "RUN;\n", " \n", "PROC PRINT data = random1 NOOBS;\n", " title 'Random Assignment for CRD with One Factor';\n", "RUN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Okay, let's have you first launch and run the SAS program, so you can review the resulting output to convince yourself that the code did indeed generate the desired treatment plan. You should see that 10 of the subjects were randomly assigned to treatment A, 10 to treatment B, and 10 to the placebo.

\n", "

Now, let's walk ourselves through the program to make sure we understand how it works. The first DATA step merely uses a simple DO loop to create a temporary data set called exper1 that contains one observation for each of the experimental units (in our case, the experimental units are subjects). The only variable in the data set, unit, contains an arbitrary label 1, 2, ..., 30 assigned to each of the experimental units.

\n", "

The remainder of the code generates the random assignment. To do so, the code from Example 34.5 is simply extended. That is:

\n", "
    \n", "
  • The second DATA step uses the ranuni function to generate a uniform random number between 0 and 1 for each observation in the exper1 data set. The result is stored in a temporary data set called random1.
  • \n", "
  • The random1 data set is sorted in order of the random number.
  • \n", "
  • The third DATA step uses an IF-THEN-ELSE construct to assign the first ten units in sorted order to group 1, the second ten to group 2, and the last ten to group 3.
  • \n", "
  • A FORMAT is defined to label the groups meaningfully.
  • \n", "
  • The final randomization list is printed.
  • \n", "
\n", "

Note! The randomization list created here contains information that is potentially damaging to the success of the whole study if it ended up in the wrong hands. That is, blinding would be violated. It is better (and more common) practice to keep separate master lists which associate unit with the subject's name, and group number with treatment name. In many national trials, it is common to have statisticians also blinded from the master list, producing a \"triple-blind\" trial. I formatted treatment here just for illustration purposes only.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

To create a random assignment for a completely randomized design with two factors, you can just modify the IF statement in the previous example. The following program generates a random assignment of treatments to 30 subjects, in which Factor A has 2 levels and Factor B has 3 levels (and hence 6 treatments). The code is similar to the code from the previous example except the IF statement now divides the 30 subjects into 6 treatment groups and (arbitrarily) assigns the levels of factors A and B to the groups:

\n", "
" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Random Assignment for CRD with Two Factors

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsUnitrandomfactorAfactorB
190.0405211
2210.0473311
3170.0703811
4200.1133511
5190.1245911
670.1409312
7290.2320612
8100.2426712
9140.2716112
10260.2811712
1140.3127613
12180.3451213
13150.3739313
14280.3772413
1520.4048013
1660.4282921
17230.4737121
18110.4803121
19130.4855221
20120.4894321
2110.5015522
2230.5389222
23160.5476222
24240.6927222
25300.7425222
2650.7742323
27270.8027023
2880.8211323
29250.8433823
30220.9557123
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA random2;\n", " set exper1;\n", " random=ranuni(4901);\n", "RUN;\n", " \n", "PROC SORT data=random2;\n", " by random;\n", "RUN;\n", " \n", "DATA random2;\n", " set random2;\n", " if _N_ le 5 then \n", " do; factorA = 1; factorB = 1; end;\n", " else if _N_ gt 5 and _N_ le 10 then \n", " do; factorA = 1; factorB = 2; end;\n", " else if _N_ gt 10 and _N_ le 15 then\n", " do; factorA = 1; factorB = 3; end;\n", " else if _N_ gt 15 and _N_ le 20 then\n", " do; factorA = 2; factorB = 1; end;\n", " else if _N_ gt 20 and _N_ le 25 then \n", " do; factorA = 2; factorB = 2; end;\n", " else if _N_ gt 25 and _N_ le 30 then\n", " do; factorA = 2; factorB = 3; end;\n", "RUN;\n", " \n", "PROC PRINT data = random2;\n", " title 'Random Assignment for CRD with Two Factors';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

It's probably best if you first launch and run the SAS program, so you can review the resulting output to convince yourself that the code did indeed generate the desired treatment plan. You should see that five of the subjects were randomly assigned to the A=1, B=1 group, five to the A=1, B=2 group, five to the A=1, B=3 group, and so on.

\n", "

Then, if you compare the code to the code from the previous example, the only substantial difference you should see is the difference betwen the two IF statements. As previously mentioned, the IF statement here divides the 30 subjects into 6 treatment groups and (arbitrarily) assigns the levels of factors A and B to the groups:

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

Thus far, our random assignments have not involved dealing with a blocking factor. As you know, it is natural in some experiments to block some of the experimental units together in an attempt to reduce unnecessary variability in your measurements that might otherwise prevent you from making good treatment comparisons. Suppose, for example, that your workload would prevent you from making more than nine experimental measurements in a day. Then, it would be a good idea then to treat day as a blocking factor. The following program creates a random assignment for 27 subjects in a randomized block design with one factor having three levels.

\n", "
" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

EXPER2: Definition of Experimental Units

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsblockunit
111
212
313
414
515
616
717
818
919
10210
11211
12212
13213
14214
15215
16216
17217
18218
19319
20320
21321
22322
23323
24324
25325
26326
27327
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Random Assignment for RBD: Sorted in BLOCK-TRT Order

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
blockunitrandomktrt
150.1708301
180.1878111
160.1940021
190.4004332
170.5885242
140.6022652
130.6548863
120.7997773
110.9396883
2120.0681001
2130.0828011
2160.2319121
2150.2769032
2110.3819842
2140.6667752
2180.8417763
2100.9190673
2170.9331283
3210.0979101
3230.1145511
3220.2156921
3200.3046132
3260.3053442
3270.3287652
3240.4662763
3250.7475673
3190.9140183
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Random Assignment for RBD: Sorted in BLOCK-UNIT Order

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
blockunitrandomktrt
110.9396883
120.7997773
130.6548863
140.6022652
150.1708301
160.1940021
170.5885242
180.1878111
190.4004332
2100.9190673
2110.3819842
2120.0681001
2130.0828011
2140.6667752
2150.2769032
2160.2319121
2170.9331283
2180.8417763
3190.9140183
3200.3046132
3210.0979101
3220.2156921
3230.1145511
3240.4662763
3250.7475673
3260.3053442
3270.3287652
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA exper2 (drop = j);\n", " DO block = 1 to 3;\n", " DO j = 1 to 9; \n", " if block = 1 then do; \n", " unit = j; \n", " output; \n", " end;\n", " else if block = 2 then do; \n", " unit = j + 9; \n", " output; \n", " end;\n", " else if block = 3 then do; \n", " unit = j + 18; \n", " output; \n", " end;\n", " END;\n", " END;\n", "RUN;\n", "\n", "PROC PRINT data=exper2; \n", " title 'EXPER2: Definition of Experimental Units'; \n", "RUN;\n", "\n", "DATA random3;\n", " set exper2;\n", " random=ranuni(7214508);\n", "RUN;\n", "\n", "PROC SORT data=random3; \n", " by block random; \n", "RUN;\n", "\n", "DATA random3;\n", " set random3;\n", " by block;\n", " \n", " if first.block then k=0;\n", " else k=k+1;\n", " \n", " if k in (0,1,2) then trt=1;\n", " else if k in (3,4,5) then trt=2;\n", " else if k in (6,7,8) then trt=3;\n", " \n", " retain k;\n", "RUN;\n", "\n", "PROC PRINT data=random3 noobs;\n", " title 'Random Assignment for RBD: Sorted in BLOCK-TRT Order';\n", "RUN;\n", "\n", "PROC SORT data=random3; \n", " by block unit; \n", "RUN;\n", "\n", "PROC PRINT data=random3 noobs;\n", " title 'Random Assignment for RBD: Sorted in BLOCK-UNIT Order';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As you can see, the exper2 data set is created to contain one observation for each of the experimental units (27 subjects here). The variable unit contains an arbitrary label (1, 2, ..., 30) assigned to each of the experimental units. The variable block, which identifies the block number (1, 2, and 3), divides the experimental units up into three equal-sized blocks of nine.

\n", "

Now, to create the random assignment:

\n", "
    \n", "
  • We use the ranuni function generate a uniform random number between 0 and 1 for each observation.
  • \n", "
  • Then, within each block, we sort the data in order of the random number.
  • \n", "
  • Then, we create a counter variable to count the number of observations within each block: for the first observation within each block (\"if first.block\"), we set the counter (k) to 0; otherwise we increase the counter by 1 for each observation within the block. (For this to work, we must retain k from iteration to iteration).
  • \n", "
  • Using an IF-THEN-ELSE construct, within each block, assign the first three units in sorted order (k=0,1,2) to group 1, the second three (k=3,4,5) to group 2, and the last three (k=6,7,8) to group 3.
  • \n", "
\n", "

Depending on how the experiment will be conducted, you can print the random assignment in different orders:

\n", "
    \n", "
  • First, the randomization is printed in order of treatment within each block. This will accommodate experiments for which it is natural to perform the treatments in groups on the randomized experimental units.
  • \n", "
  • Then, the randomization is printed in order of units within block. This will accommodate experiments for which it is natural to perform the treatments in random order on consecutive experimental units.
  • \n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulating Random Numbers\n", "\n", "In statistical research, it is a rather common practice to generate (i.e., \"simulate\") numbers that follow some underlying probability distribution. Fortunately, SAS has several random number generator functions available to simulate random phenomena with certain probability distributions. We'll take a quick look at just three of the possible functions here. The others that are available in work similarly to these here.\n", "\n", "
\n", "

Example

\n", "

The following program uses the rannor() function to generate 200 random normal variables with mean 140 and standard deviation 20:

\n", "
" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Simulated Normal Variate

\n", "

with Mean 140 and Standard Deviation 20

\n", "
\n", "
\n", "

The UNIVARIATE Procedure

\n", "

Variable: x

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Moments
N200Sum Weights200
Mean140.735977Sum Observations28147.1955
Std Deviation17.7951858Variance316.668638
Skewness-0.2457294Kurtosis-0.0197441
Uncorrected SS4024340.13Corrected SS63017.0591
Coeff Variation12.6443758Std Error Mean1.25830966
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Basic Statistical Measures
LocationVariability
Mean140.7360Std Deviation17.79519
Median140.6105Variance316.66864
Mode.Range97.68601
  Interquartile Range23.73450
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Tests for Location: Mu0=0
TestStatisticp Value
Student's tt111.8453Pr > |t|<.0001
SignM100Pr >= |M|<.0001
Signed RankS10050Pr >= |S|<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Quantiles (Definition 5)
LevelQuantile
100% Max187.6195
99%177.7766
95%168.5732
90%164.1271
75% Q3153.7170
50% Median140.6105
25% Q1129.9825
10%118.6743
5%109.5800
1%95.7932
0% Min89.9334
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Extreme Observations
LowestHighest
ValueObsValueObs
89.9334175170.514110
95.451447176.1434
96.1350133176.355127
96.781935179.198199
96.884462187.619181
\n", "
\n", "
\n", "
\n", "\"Plots\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA simula1;\n", " do i = 1 to 200;\n", " x = 140 + 20*rannor(3452083);\n", " output;\n", " end;\n", "RUN;\n", " \n", "PROC UNIVARIATE data=simula1 plot;\n", " title1 'Simulated Normal Variate';\n", " title2 'with Mean 140 and Standard Deviation 20';\n", " var x;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The rannor( ) function returns a (pseudo) random number from a standard normal distribution with mean 0 and standard deviation 1. The x= assignment statement modifies the random number so that it comes from a normal distribution with mean 140 and standard deviation 20. The OUTPUT statement must be used to dump the random number after each iteration of the DO loop. If the OUTPUT function is not present, you would end up with only one random number, namely the last one generated. Incidentally, the rannor( ) function is an alias for the normal( ) function.

\n", "

Launch and run the SAS program, so you can review the output from the UNIVARIATE procedure. You should see a stem-and-leaf plot, a boxplot, and a normal probability plot that should make it believable that the data arose from a normal distribution. You might also want to check out the sample mean and sample standard deviation to see how (impressively) close they are to 140 and 20, respectively, with a sample of just 200 observations.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program uses the ranbin(seed, n, p) function to generate a random sample of 20 observations from a binomial distribution with n = 20 and p = 0.5:

\n", "
" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Simulated Binomial Variate

\n", "

with n = 20 and p = 0.5

\n", "
\n", "
\n", "

The UNIVARIATE Procedure

\n", "

Variable: b

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Moments
N20Sum Weights20
Mean10.1Sum Observations202
Std Deviation1.51830931Variance2.30526316
Skewness-0.9884422Kurtosis1.25977357
Uncorrected SS2084Corrected SS43.8
Coeff Variation15.0327654Std Error Mean0.33950428
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Basic Statistical Measures
LocationVariability
Mean10.10000Std Deviation1.51831
Median10.50000Variance2.30526
Mode11.00000Range6.00000
  Interquartile Range2.00000
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Tests for Location: Mu0=0
TestStatisticp Value
Student's tt29.74926Pr > |t|<.0001
SignM10Pr >= |M|<.0001
Signed RankS105Pr >= |S|<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Quantiles (Definition 5)
LevelQuantile
100% Max12.0
99%12.0
95%12.0
90%12.0
75% Q311.0
50% Median10.5
25% Q19.0
10%8.5
5%7.0
1%6.0
0% Min6.0
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Extreme Observations
LowestHighest
ValueObsValueObs
681114
8111120
919126
9171215
9131218
\n", "
\n", "
\n", "
\n", "\"Plots\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA simula2;\n", " do i = 1 to 20;\n", " b = ranbin(2340234,20,0.5);\n", " output;\n", " end;\n", "RUN;\n", " \n", "PROC UNIVARIATE data=simula2 plot;\n", " title1 'Simulated Binomial Variate';\n", " title2 'with n = 20 and p = 0.5';\n", " var b;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program, so you can review the output from the UNIVARIATE procedure. You might want to check out the sample mean and sample standard deviation to see how (impressively) close they are to 10 (np) and 2.24 (square root of np(1-p)), respectively, with a sample of just 20 observations.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program uses the ranpoi(seed, mean) function to generate a random sample of 200 observations from a Poisson distribution with a mean of 4:

\n", "
" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Simulated Poisson Variate with Mean 4

\n", "
\n", "
\n", "

The UNIVARIATE Procedure

\n", "

Variable: p

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Moments
N200Sum Weights200
Mean3.84Sum Observations768
Std Deviation2.0943571Variance4.38633166
Skewness0.77122877Kurtosis0.7413239
Uncorrected SS3822Corrected SS872.88
Coeff Variation54.5405495Std Error Mean0.14809341
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Basic Statistical Measures
LocationVariability
Mean3.840000Std Deviation2.09436
Median3.000000Variance4.38633
Mode3.000000Range12.00000
  Interquartile Range3.00000
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Tests for Location: Mu0=0
TestStatisticp Value
Student's tt25.92958Pr > |t|<.0001
SignM97.5Pr >= |M|<.0001
Signed RankS9555Pr >= |S|<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Quantiles (Definition 5)
LevelQuantile
100% Max12.0
99%9.5
95%8.0
90%7.0
75% Q35.0
50% Median3.0
25% Q12.0
10%1.5
5%1.0
1%0.0
0% Min0.0
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Extreme Observations
LowestHighest
ValueObsValueObs
01999133
01779148
01329197
071102
0621261
\n", "
\n", "
\n", "
\n", "\"Plots\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA simula3;\n", " do i = 1 to 200;\n", " p = ranpoi(67, 4);\n", " output;\n", " end;\n", "RUN;\n", " \n", "PROC UNIVARIATE data=simula3 plot;\n", " title 'Simulated Poisson Variate with Mean 4';\n", " var p;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program, so you can review the output from the UNIVARIATE procedure. You might want to check out the sample mean and sample standard deviation to see how (impressively) close they are to 4 and 2, respectively, with a sample of just 200 observations.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "1. In this example, we will explore the sampling distribution of a sample proportion.\n", " a) Generate 500 sample of size 50 from a Binomial distribution with n = 1 and p = 0.3. Do this by generating a dataset with 500 rows and 50 columns all filled with random variates from this binomial distribution. Hint: Use an array and a nested DO loop.\n", " b) Calculate the mean of each of these 500 rows using mean(of ) in a DATA step. Save this mean as a new column.\n", " c) Plot a histogram of the 500 means calculated in part b and use PROC MEANS to calculate the MEAN and standard deviation of the 500 means calculated in part b. This describes the sampling distribution of p-hat when the population distiribution has probability of success p = 0.3 when the sample size is 30." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "SAS", "language": "sas", "name": "sas" }, "language_info": { "codemirror_mode": "sas", "file_extension": ".sas", "mimetype": "text/x-sas", "name": "sas" } }, "nbformat": 4, "nbformat_minor": 2 }