{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Manipulation\n", "\n", "In this section, we will cover three main topics\n", "\n", "1. Reshaping data frome wide (fat) to long (tall) formats\n", "2. Reshaping data from long (tall) to wide (fat) formats\n", "3. Merging datasets\n", "\n", "To reshape datasets, we will cover two methods\n", "\n", "* PROC TRANSPOSE\n", "* Using a DATA step with arrays\n", "\n", "In order to understand the second more general method, we will first need to learn about a few SAS programming keywords and structures, such as \n", "\n", "* The OUTPUT and RETAIN statements\n", "* Loops in SAS\n", "* SAS Arrays\n", "* FIRST. and LAST. SAS variables\n", "\n", "## The OUTPUT and RETAIN Statements\n", "\n", "When processing any DATA step, SAS follows two default procedures:\n", "\n", "1. When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. (SAS does not reset variables to missing if they were created by a SUM statement, or if the values came from a SAS data set via a SET or MERGE statement.)\n", "2. At the end of the DATA step after completing an iteration of the DATA step, SAS outputs the values of the variables in the program data vector to the SAS data set being created.\n", "\n", "In this lesson, we'll learn how to modify these default processes by using the OUTPUT and RETAIN statements:\n", "\n", "* The **OUTPUT** statement allows you to control when and to which data set you want an observation written.\n", "* The **RETAIN** statement causes a variable created in the DATA step to retain its value from the current observation into the next observation rather than it being set to missing at the beginning of each iteration of the DATA step.\n", "\n", "### The OUTPUT Statement\n", "\n", "An OUTPUT statement overrides the default process by telling SAS to output the current observation when the OUTPUT statement is processed — not at the end of the DATA step. The OUTPUT statement takes the form:\n", "\n", "`OUTPUT dataset1 dataset2 ... datasetn;`;\n", "\n", "where you may name as few or as many data sets as you like. If you use an OUTPUT statement without specifying a data set name, SAS writes the current observation to each of the data sets named in the DATA step. Any data set name appearing in the OUTPUT statement must also appear in the DATA statement.\n", "\n", "The OUTPUT statement is pretty powerful in that, among other things, it gives us a way:\n", "\n", "* to write observations to multiple data sets\n", "* to control output of observations to data sets based on certain conditions\n", "* to transpose datasets using the OUTPUT statement in conjunction with the RETAIN statement, BY group processing and the LAST.variable statement\n", "\n", "Throughout the rest of this section, we'll look at examples that illustrate how to use OUTPUT statements correctly. We'll work with the following subset of the ICDB Study's log data set (see the course website for icdblog.sas7bdat):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SAS Connection established. Subprocess id is 2262\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "
\n", "\n", "\n", "Obs | \n", "SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "
---|---|---|---|---|
1 | \n", "210006 | \n", "12 | \n", "05/06/94 | \n", "cmed | \n", "
2 | \n", "210006 | \n", "12 | \n", "05/06/94 | \n", "diet | \n", "
3 | \n", "210006 | \n", "12 | \n", "05/06/94 | \n", "med | \n", "
4 | \n", "210006 | \n", "12 | \n", "05/06/94 | \n", "phytrt | \n", "
5 | \n", "210006 | \n", "12 | \n", "05/06/94 | \n", "purg | \n", "
This example uses the OUTPUT statement to tell SAS to write observations to data sets based on certain conditions. Specifically, the following program uses the OUTPUT statement to create three SAS data sets — s210006, s310032, and s410010 — based on whether the subject identification numbers in the icdblog data set meet a certain condition:
\n", "SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "
---|---|---|---|
210006 | \n", "12 | \n", "05/06/94 | \n", "cmed | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "diet | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "med | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "phytrt | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "purg | \n", "
SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "
---|---|---|---|
310032 | \n", "24 | \n", "09/19/95 | \n", "backf | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "cmed | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "diet | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "med | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "medhxf | \n", "
SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "
---|---|---|---|
410010 | \n", "6 | \n", "05/12/94 | \n", "cmed | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "diet | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "med | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "phytrt | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "purg | \n", "
As you can see, the DATA statement contains three data set names — s210006, s310032, and s410010. That tells SAS that we want to create three data sets with the given names. The SET statement, of course, tells SAS to read observations from the permanent data set called stat481.icdblog. Then comes the IF-THEN-ELSE and OUTPUT statements that make it all work. The first IF-THEN tells SAS to output any observations pertaining to subject 210006 to the s210006 data set; the second IF-THEN tells SAS to output any observations pertaining to subject 310032 to the s310032 data set; and, the third IF-THEN statement tells SAS to output any observations pertaining to subject 410010 to the s410010 data set. SAS will hiccup if you have a data set name that appears in an OUTPUT statement without it also appearing in the DATA statement.
\n", "The PRINT procedures, of course, tell SAS to print the three newly created data sets. Note that the last PRINT procedure does not have a DATA= option. That's because when you name more than one data set in a single DATA statement, the last name on the DATA statement is the most recently created data set, and the one that subsequent procedures use by default. Therefore, the last PRINT procedure will print the s410010 data set by default.
\n", "Note that the IF-THEN-ELSE construct used here in conjunction with the OUTPUT statement is comparable to attaching the WHERE= option to each of the data sets appearing in the DATA statement.
\n", "Before running the code be sure that you have saved the icdblog dataset and changed the LIBNAME statement to the folder where you saved it.
\n", "Using an OUTPUT statement suppresses the automatic output of observations at the end of the DATA step. Therefore, if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step. The following SAS program illustrates what happens if you fail to direct all of the observations to output:
\n", "SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "
---|---|---|---|
210006 | \n", "12 | \n", "05/06/94 | \n", "cmed | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "diet | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "med | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "phytrt | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "purg | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "qul | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "sympts | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "urn | \n", "
210006 | \n", "12 | \n", "05/06/94 | \n", "void | \n", "
198 ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
198! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
199
200 PROC PRINT data = subj310032 NOOBS;
201 title 'The subj310032 data set';
202 RUN;
NOTE: No observations in data set WORK.SUBJ310032.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
203
204 ods html5 (id=saspy_internal) close;ods listing;
205
The DATA statement contains two data set names, subj210006 and subj310032, telling SAS that we intend to create two data sets. However, as you can see, the IF statement contains an OUTPUT statement that directs output to the subj210006 data set, but no OUTPUT statement directs output to the subj310032 data set. Launch and run the SAS program to convince yourself that the subj210006 data set contains data for subject 210006, while the subj310032 data set contains 0 observations. You should see a message in the log window like the one shown above as well as see that no output for the subj310032 data set appears in the output window.
\n", "If you use an assignment statement to create a new variable in a DATA step in the presence of OUTPUT statements, you have to make sure that you place the assignment statement before the OUTPUT statements. Otherwise, SAS will have already written the observation to the SAS data set, and the newly created variable will be set to missing. The following SAS program illustrates an example of how two variables, current and days_vis, get set to missing in the output data sets because their values get calculated after SAS has already written the observation to the SAS data set:
\n", "SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "current | \n", "days_vis | \n", "
---|---|---|---|---|---|
310032 | \n", "24 | \n", "09/19/95 | \n", "backf | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "cmed | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "diet | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "med | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "medhxf | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "phs | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "phytrt | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "preg | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "purg | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "qul | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "sympts | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "urn | \n", ". | \n", ". | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "void | \n", ". | \n", ". | \n", "
The main thing to note in this program is that the current and days_vis assignment statements appear after the IF-THEN-ELSE and OUTPUT statements. That means that each observation will be written to one of the three output data sets before the current and days_vis values are even calculated. Because SAS sets variables created in the DATA step to missing at the beginning of each iteration of the DATA step, the values of current and days_vis will remain missing for each observation.
\n", "By the way, the today( ) function, which is assigned to the variable current, creates a date variable containing today's date. Therefore, the variable days_vis is meant to contain the number of days since the subject's recorded visit v_date. However, as described above, the values of current and days_vis get set to missing. Launch and run the SAS program to convince yourself that the current and days_vis variables in the subj310032 data set contain only missing values. If we were to print the subj210006 and subj410020 data sets, we would see the same thing.
\n", "The following SAS program illustrates the corrected code for the previous DATA step, that is, for creating new variables with assignment statements in the presence of OUTPUT statements:
\n", "SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "current | \n", "days_vis | \n", "
---|---|---|---|---|---|
310032 | \n", "24 | \n", "09/19/95 | \n", "backf | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "cmed | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "diet | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "med | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "medhxf | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "phs | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "phytrt | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "preg | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "purg | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "qul | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "sympts | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "urn | \n", "09/30/20 | \n", "9143 | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "void | \n", "09/30/20 | \n", "9143 | \n", "
Now, since the assignment statements precede the OUTPUT statements, the variables are correctly written to the output data sets. That is, now the variable current contains the date in which the program was run and the variable days_vis contains the number of days since that date and the date of the subject's visit. Launch and run the SAS program to convince yourself that the current and days_vis variables are properly written to the subj310032 data set. If we were to print the subj210006 and subj410020 data sets, we would see similar results.
\n", "After SAS processes an OUTPUT statement within a DATA step, the observation remains in the program data vector and you can continue programming with it. You can even output the observation again to the same SAS data set or to a different one! The following SAS program illustrates how you can create different data sets with the some of the same observations. That is, the data sets created in your DATA statement do not have to be mutually exclusive:
\n", "SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "
---|---|---|---|
210006 | \n", "12 | \n", "05/06/94 | \n", "sympts | \n", "
310032 | \n", "24 | \n", "09/19/95 | \n", "sympts | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "sympts | \n", "
SUBJ | \n", "V_TYPE | \n", "V_DATE | \n", "FORM | \n", "
---|---|---|---|
410010 | \n", "6 | \n", "05/12/94 | \n", "cmed | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "diet | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "med | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "phytrt | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "purg | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "qul | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "sympts | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "urn | \n", "
410010 | \n", "6 | \n", "05/12/94 | \n", "void | \n", "
The DATA step creates two temporary data sets, symptoms and visitsix. The symptoms data set contains only those observations containing a form code of sympts. The visitsix data set, on the other hand, contains observations for which v_type equals 6. The observations in the two data sets are therefore not necessarily mutually exclusive. In fact, launch and run the SAS program and review the output from the PRINT procedures. Note that the observation for subject 410010 in which form = sympts is contained in both the symptoms and visitsix data sets.
\n", "idno | \n", "l_name | \n", "gtype | \n", "grade | \n", "
---|---|---|---|
10 | \n", "Smith | \n", "E1 | \n", "78 | \n", "
10 | \n", "Smith | \n", "E2 | \n", "82 | \n", "
10 | \n", "Smith | \n", "E3 | \n", "86 | \n", "
10 | \n", "Smith | \n", "E4 | \n", "69 | \n", "
10 | \n", "Smith | \n", "P1 | \n", "97 | \n", "
The following SAS program illustrates the SAS variables FIRST. and LAST. that can be obtained when using the BY statement on a sorted dataset in a DATA step to identify the first and last grade record for each student in the dataset.
\n", "Obs | \n", "idno | \n", "l_name | \n", "gtype | \n", "grade | \n", "firstGrade | \n", "lastGrade | \n", "
---|---|---|---|---|---|---|
1 | \n", "10 | \n", "Smith | \n", "E1 | \n", "78 | \n", "1 | \n", "0 | \n", "
2 | \n", "10 | \n", "Smith | \n", "E2 | \n", "82 | \n", "0 | \n", "0 | \n", "
3 | \n", "10 | \n", "Smith | \n", "E3 | \n", "86 | \n", "0 | \n", "0 | \n", "
4 | \n", "10 | \n", "Smith | \n", "E4 | \n", "69 | \n", "0 | \n", "0 | \n", "
5 | \n", "10 | \n", "Smith | \n", "P1 | \n", "97 | \n", "0 | \n", "0 | \n", "
6 | \n", "10 | \n", "Smith | \n", "F1 | \n", "160 | \n", "0 | \n", "1 | \n", "
7 | \n", "11 | \n", "Simon | \n", "E1 | \n", "88 | \n", "1 | \n", "0 | \n", "
8 | \n", "11 | \n", "Simon | \n", "E2 | \n", "72 | \n", "0 | \n", "0 | \n", "
9 | \n", "11 | \n", "Simon | \n", "E3 | \n", "86 | \n", "0 | \n", "0 | \n", "
10 | \n", "11 | \n", "Simon | \n", "E4 | \n", "99 | \n", "0 | \n", "0 | \n", "
11 | \n", "11 | \n", "Simon | \n", "P1 | \n", "100 | \n", "0 | \n", "0 | \n", "
12 | \n", "11 | \n", "Simon | \n", "F1 | \n", "170 | \n", "0 | \n", "1 | \n", "
13 | \n", "12 | \n", "Jones | \n", "E1 | \n", "98 | \n", "1 | \n", "0 | \n", "
14 | \n", "12 | \n", "Jones | \n", "E2 | \n", "92 | \n", "0 | \n", "0 | \n", "
15 | \n", "12 | \n", "Jones | \n", "E3 | \n", "92 | \n", "0 | \n", "0 | \n", "
16 | \n", "12 | \n", "Jones | \n", "E4 | \n", "99 | \n", "0 | \n", "0 | \n", "
17 | \n", "12 | \n", "Jones | \n", "P1 | \n", "99 | \n", "0 | \n", "0 | \n", "
18 | \n", "12 | \n", "Jones | \n", "F1 | \n", "185 | \n", "0 | \n", "1 | \n", "
Because we are doing BY group processing on the variable idno, we must have the dataset sorted by idno. In this case the dataset was actually already sorted by idno, but I added the PROC SORT anyway to emphasize that the dataset must be sorted first.
\n", "The SET and BY statement tell SAS to process the data by grouping observations with the same idno together. To do this, SAS automatically creats two temporary variables for each variable name in the BY statement. One of the temporary variables is called FIRST.variable, where variable is the variable name appearing the BY statement. The other temporary variable is called LAST.variable. Both take the values 0 or 1:
\n", "SAS uses the values of the FIRST.variable and LAST.variable temporary variables to identify the first and last observations in a group, and therefore the group itself. Oh, a comment about that adjective temporary ... SAS places FIRST.variable and LAST.variable in the program data vector and they are therefore available for DATA step programming, but SAS does not add them to the SAS data set being created. It is in that sense that they are temporary.
\n", "Because SAS does not write FIRST.variables and LAST.variables to output data sets, we have to do some finagling to see their contents. The two assignment statements:
\n", "\n",
" firstGrade = FIRST.idno;\n",
" lastGrade = LAST.idno;\n",
"
\n",
" simply tell SAS to assign the values of the temporary variables, FIRST.idno and LAST.idno, to permanent variables, firstGrade and lastGrade, respectively. The PRINT procedure tells SAS to print the resulting data set so that we can take an inside peek at the values of the FIRST.variables and LAST.variables.
\n", "One of the most powerful uses of a RETAIN statement is to compare values across observations. The following program uses the RETAIN statement to compare values across observations, and in doing so determines each student's lowest grade of the four semester exams:
\n", "Obs | \n", "idno | \n", "l_name | \n", "grade | \n", "lowgrade | \n", "gtype | \n", "
---|---|---|---|---|---|
1 | \n", "10 | \n", "Smith | \n", "69 | \n", "69 | \n", "E4 | \n", "
2 | \n", "11 | \n", "Simon | \n", "99 | \n", "72 | \n", "E2 | \n", "
3 | \n", "12 | \n", "Jones | \n", "99 | \n", "92 | \n", "E3 | \n", "
Because the instructor only wants to drop the lowest exam grade, the first DATA step tells SAS to create a data set called exams by selecting only the exam grades (E1, E2, E3, and E4) from the data set grades.
\n", "It's the second DATA step that is the meat of the program and the challenging one to understand. The DATA step searches through the exams data set for each subject (\"by idno\") and looks for the lowest grade (\"min(lowgrade, grade)\"). Because SAS would otherwise set the variables lowgrade and lowtype to missing for each new iteration, the RETAIN statement is used to keep track of the observation that contains the lowest grade. When SAS reads the last observation of the student (\"last.idno\") it outputs the data corresponding to the lowest exam type (lowtype) and grade (lowgrade) to the lowest data set. (Note that the statement \"if last.idno then output;\" effectively collapses multiple observations per student into one observation per student.) So that we can merge the lowest data set back into the grades data set, by idno and gtype, the variable lowtype is renamed back to gtype.
\n", "The following program uses a DO loop to tell SAS to determine what four times three (4 × 3) equals:
\n", "answer | \n", "i | \n", "
---|---|
12 | \n", "5 | \n", "
Okay... admittedly, we could accomplish our goal of determining four times three in a much simpler way, but then we wouldn't have the pleasure of seeing how we can accomplish it using an iterative DO loop! The key to understanding the DATA step here is to recall that multiplication is just repeated addition. That is, four times three (4 × 3) is the same as adding three together four times, that is, 3 + 3 + 3 + 3. That's all that the iterative DO loop in the DATA step is telling SAS to do. After having initialized answer to 0, add 3 to answer, then add 3 to answer again, and add 3 to answer again, and add 3 to answer again. After SAS has added 3 to the answer variable four times, SAS exits the DO loop, and since that's the end of the DATA step, SAS moves onto the next procedure and prints the result.
\n", "The other thing you might want to notice about the DATA step is that there is no input data set or input data file. We are generating data from scratch here, rather than from some input source. Now, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that our code properly calculates four times three.
\n", "Ahhh, what about that i variable that shows up in our multiply data set? If you look at our DATA step again, you can see that it comes from the DO loop. It is what is called the index variable (or counter variable). Most often, you'll want to drop it from your output data set, but its presence here is educational. As you can see, its current value is 5. That's what allows SAS to exit the DO loop... we tell SAS only to take the actions inside the loop until i equals 4. Once i becomes greater than 4, SAS jumps out of the loop, and moves on to the next statements in the DATA step. Let's take a look at the general form of iterative DO loops.
\n", "\n",
"DO index-variable = start TO stop BY increment;\n",
" action statements;\n",
"END;\n",
"
\n",
"\n",
"where\n",
"\n",
"* DO, index-variable, start, TO, stop, and END are required in every iterative DO loop\n",
"* index-variable, which stores the value of the current iteration of the DO loop, can be any valid SAS variable name. It is common, however, to use a single letter, with i and j being the most used.\n",
"* start is the value of the index variable at which you want SAS to start the loop\n",
"* stop is the value of the index variable at which you want SAS to stop the loop\n",
"* increment is by how much you want SAS to change the index variable after each iteration. The most commonly used increment is 1. In fact, if you don't specify a BY clause, SAS uses the default increment of 1.\n",
"\n",
"For example,\n",
"\n",
"`do jack = 1 to 5;`\n",
"\n",
"tells SAS to create an index variable called jack, start at 1, increment by 1, and end at 5, so that the values of jack from iteration to iteration are 1, 2, 3, 4, and 5. And, this DO statement:\n",
"\n",
"`do jill = 2 to 12 by 2;`\n",
"\n",
"tells SAS to create an index variable called jill, start at 2, increment by 2, and end at 12, so that the values of jill from iteration to iteration are 2, 4, 6, 8, 10, and 12.\n",
"\n",
"The following SAS program uses an iterative DO loop to count backwards by 1:
\n", "i | \n", "
---|
20 | \n", "
19 | \n", "
18 | \n", "
17 | \n", "
16 | \n", "
15 | \n", "
14 | \n", "
13 | \n", "
12 | \n", "
11 | \n", "
10 | \n", "
9 | \n", "
8 | \n", "
7 | \n", "
6 | \n", "
5 | \n", "
4 | \n", "
3 | \n", "
2 | \n", "
1 | \n", "
As you can see in this DO statement, you can decrement a DO loop's index variable by specifying a negative value for the BY clause. Here, we tell SAS to start at 20, and decrease the index variable by 1, until it reaches 1. The OUTPUT statement tells SAS to output the value of the index variable i for each iteration of the DO loop. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that our code properly counts backwards from 20 to 1.
\n", "\n",
"DO index-variable = value1, value2, value3, ...;\n",
" action statements;\n",
"END;\n",
"
\n",
"\n",
"where the values can be character or numeric. When the DO loop executes, it executes once for each item in the series. The index variable equals the value of the current item. You must use commas to separate items in a series. To list items in a series, you must specify\n",
"\n",
"either all numeric values: \n",
"\n",
"`DO i = 1, 2, 3, 4, 5;`\n",
"\n",
"all character values, with each value enclosed in quotation marks \n",
"\n",
"`DO j = 'winter', 'spring', 'summer', 'fall';`\n",
"\n",
"or all variable names: \n",
"\n",
"`DO k = first, second, third;`\n",
"\n",
"In this case, the index variable takes on the values of the specified variables. Note that the variable names are not enclosed in quotation marks, while quotation marks are required for character values.\n",
"\n",
"### Nested DO Loops\n",
"\n",
"Just like in other programming languages. We can nest loops within each other.\n",
"\n",
"Suppose you are interested in conducting an experiment with two factors A and B. Suppose factor A is, say, the amount of water with levels 1, 2, 3, and 4; and factor B is, say, the amount of sunlight, say with levels 1, 2, 3, 4, and 5. Then, the following SAS code uses nested iterative DO loops to generate the 4 by 5 factorial design:
\n", "Obs | \n", "i | \n", "j | \n", "
---|---|---|
1 | \n", "1 | \n", "1 | \n", "
2 | \n", "1 | \n", "2 | \n", "
3 | \n", "1 | \n", "3 | \n", "
4 | \n", "1 | \n", "4 | \n", "
5 | \n", "1 | \n", "5 | \n", "
6 | \n", "2 | \n", "1 | \n", "
7 | \n", "2 | \n", "2 | \n", "
8 | \n", "2 | \n", "3 | \n", "
9 | \n", "2 | \n", "4 | \n", "
10 | \n", "2 | \n", "5 | \n", "
11 | \n", "3 | \n", "1 | \n", "
12 | \n", "3 | \n", "2 | \n", "
13 | \n", "3 | \n", "3 | \n", "
14 | \n", "3 | \n", "4 | \n", "
15 | \n", "3 | \n", "5 | \n", "
16 | \n", "4 | \n", "1 | \n", "
17 | \n", "4 | \n", "2 | \n", "
18 | \n", "4 | \n", "3 | \n", "
19 | \n", "4 | \n", "4 | \n", "
20 | \n", "4 | \n", "5 | \n", "
First, launch and run the SAS program. Then, review the output from the PRINT procedure to see the contents of the design data set. By doing so, you can get a good feel for how the nested DO loops work. First, SAS sets the value of the index variable i to 1, then proceeds to the next step which happens to be another iterative DO loop. While i is 1:
\n", "SAS then sets the value of the index variable i to 2, then proceeds through the inside DO loop again just as described above. This process continues until SAS sets the value of index variable i to 5, jumps out of the outside DO loop, and ends the DATA step.
\n", "\n",
"DO UNTIL (expression);\n",
" action statements;\n",
"END;\n",
"
\n",
"\n",
"where expression is any valid SAS expression enclosed in parentheses. The key thing to remember is that the expression is not evaluated until the bottom of the loop. Therefore, a DO UNTIL loop always executes at least once. As soon as the expression is determined to be true, the DO loop does not execute again.\n",
"\n",
"Suppose you want to know how many years it would take to accumulate 50,000 if you deposit 1200 each year into an account that earns 5% interest. The following program uses a DO UNTIL loop to perform the calculation for us:
\n", "value | \n", "year | \n", "
---|---|
1260.00 | \n", "1 | \n", "
2583.00 | \n", "2 | \n", "
3972.15 | \n", "3 | \n", "
5430.76 | \n", "4 | \n", "
6962.30 | \n", "5 | \n", "
8570.41 | \n", "6 | \n", "
10258.93 | \n", "7 | \n", "
12031.88 | \n", "8 | \n", "
13893.47 | \n", "9 | \n", "
15848.14 | \n", "10 | \n", "
17900.55 | \n", "11 | \n", "
20055.58 | \n", "12 | \n", "
22318.36 | \n", "13 | \n", "
24694.28 | \n", "14 | \n", "
27188.99 | \n", "15 | \n", "
29808.44 | \n", "16 | \n", "
32558.86 | \n", "17 | \n", "
35446.80 | \n", "18 | \n", "
38479.14 | \n", "19 | \n", "
41663.10 | \n", "20 | \n", "
45006.26 | \n", "21 | \n", "
48516.57 | \n", "22 | \n", "
52202.40 | \n", "23 | \n", "
Recall that the expression in the DO UNTIL statement is not evaluated until the bottom of the loop. Therefore, the DO UNTIL loop executes at least once. On the first iteration, the value variable is increased by 1200, or in this case, set to 1200. Then, the value variable is updated by calculating 1200 + 1200*0.05 to get 1260. Then, the year variable is increased by 1, or in this case, set to 1. The first observation, for which year = 1 and value = 1260, is then written to the output data set called investment. Having reached the bottom of the DO UNTIL loop, the expression (value >= 50000) is evaluated to determine if it is true. Since value is just 1260, the expression is not true, and so the DO UNTIL loop is executed once again. The process continues as described until SAS determines that value is at least 50000 and therefore stops executing the DO UNTIL loop.
\n", "Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that it would take 23 years to accumulate at least $50,000.
\n", "\n",
"DO WHILE (expression);\n",
" action statements;\n",
"END;\n",
"
\n",
"\n",
"where expression is any valid SAS expression enclosed in parentheses. An important difference between the DO UNTIL and DO WHILE statements is that the DO WHILE expression is evaluated at the top of the DO loop. If the expression is false the first time it is evaluated, then the DO loop doesn't even execute once.\n",
"\n",
"The following program attempts to use a DO WHILE loop to accomplish the same goal as the program above, namely to determine how many years it would take to accumulate \\$50,000 if you deposit \\$1200 each year into an account that earns 5% interest:
\n", "value | \n", "year | \n", "
---|---|
1260.00 | \n", "1 | \n", "
2583.00 | \n", "2 | \n", "
3972.15 | \n", "3 | \n", "
5430.76 | \n", "4 | \n", "
6962.30 | \n", "5 | \n", "
8570.41 | \n", "6 | \n", "
10258.93 | \n", "7 | \n", "
12031.88 | \n", "8 | \n", "
13893.47 | \n", "9 | \n", "
15848.14 | \n", "10 | \n", "
17900.55 | \n", "11 | \n", "
20055.58 | \n", "12 | \n", "
22318.36 | \n", "13 | \n", "
24694.28 | \n", "14 | \n", "
27188.99 | \n", "15 | \n", "
29808.44 | \n", "16 | \n", "
32558.86 | \n", "17 | \n", "
35446.80 | \n", "18 | \n", "
38479.14 | \n", "19 | \n", "
41663.10 | \n", "20 | \n", "
45006.26 | \n", "21 | \n", "
48516.57 | \n", "22 | \n", "
52202.40 | \n", "23 | \n", "
The calculations proceed as before. First, the value variable is updated to by calculating 0 + 1200, to get 1200. Then, the value variable is updated by calculating 1200 + 1200*0.05 to get 1260. Then, the year variable is increased by 1, or in this case, set to 1. The first observation, for which year = 1 and value = 1260, is then written to the output data set called investthree. SAS then returns to the top of the DO WHILE loop, to determine if the expression (value < 50000) is true. Since value is just 1260, the expression is true, and so the DO WHILE loop executes once again. The process continues as described until SAS determines that value is as least 50000 and therefore stops executing the DO WHILE loop.
\n", "Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that this program also determines that it would take 23 years to accumulate at least \\$50,000.
\n", "You should also try changing the WHILE condition from value < 50000 to value ≥ 50000 to see what happens. (Hint: you will get no output. Why?)
\n", "Suppose again that you want to know how many years it would take to accumulate 50,000 if you deposit 1200 each year into an account that earns 5% interest. But this time, suppose you also want to limit the number of years that you invest to 15 years. The following program uses a conditional iterative DO loop to accumulate our investment until we reach 15 years or until the value of our investment exceeds 50000, whichever comes first:
\n", "value | \n", "year | \n", "
---|---|
1260.00 | \n", "1 | \n", "
2583.00 | \n", "2 | \n", "
3972.15 | \n", "3 | \n", "
5430.76 | \n", "4 | \n", "
6962.30 | \n", "5 | \n", "
8570.41 | \n", "6 | \n", "
10258.93 | \n", "7 | \n", "
12031.88 | \n", "8 | \n", "
13893.47 | \n", "9 | \n", "
15848.14 | \n", "10 | \n", "
17900.55 | \n", "11 | \n", "
20055.58 | \n", "12 | \n", "
22318.36 | \n", "13 | \n", "
24694.28 | \n", "14 | \n", "
27188.99 | \n", "15 | \n", "
Note that there are just two differences between this program and that of the program in the previous example that uses the DO UNTIL loop: i) The iteration i = 1 to 15 has been inserted into the DO UNTIL statement; and ii) because the index variable i is created for the DO loop, it is dropped before writing the contents from the program data vector to the output data set investfour.
\n", "The following program simply reads in the average montly temperatures (in Celsius) for ten different cities in the United States into a temporary SAS data set called avgcelsius:
\n", "City | \n", "jan | \n", "feb | \n", "mar | \n", "apr | \n", "may | \n", "jun | \n", "jul | \n", "aug | \n", "sep | \n", "oct | \n", "nov | \n", "dec | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
State College, PA | \n", "-2 | \n", "-2 | \n", "2 | \n", "8 | \n", "14 | \n", "19 | \n", "21 | \n", "20 | \n", "16 | \n", "10 | \n", "4 | \n", "-1 | \n", "
Miami, FL | \n", "20 | \n", "20 | \n", "22 | \n", "23 | \n", "26 | \n", "27 | \n", "28 | \n", "28 | \n", "27 | \n", "26 | \n", "23 | \n", "20 | \n", "
St. Louis, MO | \n", "-1 | \n", "1 | \n", "6 | \n", "13 | \n", "18 | \n", "23 | \n", "26 | \n", "25 | \n", "21 | \n", "15 | \n", "7 | \n", "1 | \n", "
New Orleans, LA | \n", "11 | \n", "13 | \n", "16 | \n", "20 | \n", "23 | \n", "27 | \n", "27 | \n", "27 | \n", "26 | \n", "21 | \n", "16 | \n", "12 | \n", "
Madison, WI | \n", "-8 | \n", "-5 | \n", "0 | \n", "7 | \n", "14 | \n", "19 | \n", "22 | \n", "20 | \n", "16 | \n", "10 | \n", "2 | \n", "-5 | \n", "
Houston, TX | \n", "10 | \n", "12 | \n", "16 | \n", "20 | \n", "23 | \n", "27 | \n", "28 | \n", "28 | \n", "26 | \n", "21 | \n", "16 | \n", "12 | \n", "
Phoenix, AZ | \n", "12 | \n", "14 | \n", "16 | \n", "21 | \n", "26 | \n", "31 | \n", "33 | \n", "32 | \n", "30 | \n", "23 | \n", "16 | \n", "12 | \n", "
Seattle, WA | \n", "5 | \n", "6 | \n", "7 | \n", "10 | \n", "13 | \n", "16 | \n", "18 | \n", "18 | \n", "16 | \n", "12 | \n", "8 | \n", "6 | \n", "
San Francisco, CA | \n", "10 | \n", "12 | \n", "12 | \n", "13 | \n", "14 | \n", "15 | \n", "15 | \n", "16 | \n", "17 | \n", "16 | \n", "14 | \n", "11 | \n", "
San Diego, CA | \n", "13 | \n", "14 | \n", "15 | \n", "16 | \n", "17 | \n", "19 | \n", "21 | \n", "22 | \n", "21 | \n", "19 | \n", "16 | \n", "14 | \n", "
Now, suppose that we don't feel particularly comfortable with understanding Celsius temperatures, and therefore, we want to convert the Celsius temperatures into Fahrenheit temperatures for which we have a better feel. The following SAS program uses the standard conversion formula:
\n", "Fahrenheit temperature = 1.8*Celsius temperature + 32
\n",
" to convert the Celsius temperatures in the avgcelsius data set to Fahrenheit temperatures stored in a new data set called avgfahrenheit:
\n", "City | \n", "janf | \n", "febf | \n", "marf | \n", "aprf | \n", "mayf | \n", "junf | \n", "julf | \n", "augf | \n", "sepf | \n", "octf | \n", "novf | \n", "decf | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
State College, PA | \n", "28.4 | \n", "28.4 | \n", "35.6 | \n", "46.4 | \n", "57.2 | \n", "66.2 | \n", "69.8 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "39.2 | \n", "30.2 | \n", "
Miami, FL | \n", "68.0 | \n", "68.0 | \n", "71.6 | \n", "73.4 | \n", "78.8 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "80.6 | \n", "78.8 | \n", "73.4 | \n", "68.0 | \n", "
St. Louis, MO | \n", "30.2 | \n", "33.8 | \n", "42.8 | \n", "55.4 | \n", "64.4 | \n", "73.4 | \n", "78.8 | \n", "77.0 | \n", "69.8 | \n", "59.0 | \n", "44.6 | \n", "33.8 | \n", "
New Orleans, LA | \n", "51.8 | \n", "55.4 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "80.6 | \n", "80.6 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Madison, WI | \n", "17.6 | \n", "23.0 | \n", "32.0 | \n", "44.6 | \n", "57.2 | \n", "66.2 | \n", "71.6 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "35.6 | \n", "23.0 | \n", "
Houston, TX | \n", "50.0 | \n", "53.6 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Phoenix, AZ | \n", "53.6 | \n", "57.2 | \n", "60.8 | \n", "69.8 | \n", "78.8 | \n", "87.8 | \n", "91.4 | \n", "89.6 | \n", "86.0 | \n", "73.4 | \n", "60.8 | \n", "53.6 | \n", "
Seattle, WA | \n", "41.0 | \n", "42.8 | \n", "44.6 | \n", "50.0 | \n", "55.4 | \n", "60.8 | \n", "64.4 | \n", "64.4 | \n", "60.8 | \n", "53.6 | \n", "46.4 | \n", "42.8 | \n", "
San Francisco, CA | \n", "50.0 | \n", "53.6 | \n", "53.6 | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "60.8 | \n", "57.2 | \n", "51.8 | \n", "
San Diego, CA | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "66.2 | \n", "69.8 | \n", "71.6 | \n", "69.8 | \n", "66.2 | \n", "60.8 | \n", "57.2 | \n", "
As you can see by the number of assignment statements necessary to make the conversions, the exercise becomes one of patience. Because there are twelve average monthly temperatures, we must write twelve assignment statements. Each assignment statement performs the same calculation. Only the name of the variable changes in each statement. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were properly converted to Fahrenheit temperatures.
\n", "The above program is crying out for the use of an array. One of the primary arguments for using an array is to reduce the number of statements that are required for processing variables. Let's take a look at an example.
\n", "The following program uses a one-dimensional array called fahr to convert the average Celsius temperatures in the avgcelsius data set to average Fahrenheit temperatures stored in a new data set called avgfahrenheit:
\n", "City | \n", "jan | \n", "feb | \n", "mar | \n", "apr | \n", "may | \n", "jun | \n", "jul | \n", "aug | \n", "sep | \n", "oct | \n", "nov | \n", "dec | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
State College, PA | \n", "28.4 | \n", "28.4 | \n", "35.6 | \n", "46.4 | \n", "57.2 | \n", "66.2 | \n", "69.8 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "39.2 | \n", "30.2 | \n", "
Miami, FL | \n", "68.0 | \n", "68.0 | \n", "71.6 | \n", "73.4 | \n", "78.8 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "80.6 | \n", "78.8 | \n", "73.4 | \n", "68.0 | \n", "
St. Louis, MO | \n", "30.2 | \n", "33.8 | \n", "42.8 | \n", "55.4 | \n", "64.4 | \n", "73.4 | \n", "78.8 | \n", "77.0 | \n", "69.8 | \n", "59.0 | \n", "44.6 | \n", "33.8 | \n", "
New Orleans, LA | \n", "51.8 | \n", "55.4 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "80.6 | \n", "80.6 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Madison, WI | \n", "17.6 | \n", "23.0 | \n", "32.0 | \n", "44.6 | \n", "57.2 | \n", "66.2 | \n", "71.6 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "35.6 | \n", "23.0 | \n", "
Houston, TX | \n", "50.0 | \n", "53.6 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Phoenix, AZ | \n", "53.6 | \n", "57.2 | \n", "60.8 | \n", "69.8 | \n", "78.8 | \n", "87.8 | \n", "91.4 | \n", "89.6 | \n", "86.0 | \n", "73.4 | \n", "60.8 | \n", "53.6 | \n", "
Seattle, WA | \n", "41.0 | \n", "42.8 | \n", "44.6 | \n", "50.0 | \n", "55.4 | \n", "60.8 | \n", "64.4 | \n", "64.4 | \n", "60.8 | \n", "53.6 | \n", "46.4 | \n", "42.8 | \n", "
San Francisco, CA | \n", "50.0 | \n", "53.6 | \n", "53.6 | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "60.8 | \n", "57.2 | \n", "51.8 | \n", "
San Diego, CA | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "66.2 | \n", "69.8 | \n", "71.6 | \n", "69.8 | \n", "66.2 | \n", "60.8 | \n", "57.2 | \n", "
If you compare this program with the previous program, you can see the statements that replaced the twelve assignment statements. The ARRAY statement defines an array called fahr. It tells SAS that you want to group the twelve month variables, jan , feb, ... dec, into an array called fahr. The (12) that appears in parentheses is a required part of the array declaration. Called the dimension of the array, it tells SAS how many elements, that is, variables, you want to group together. When specifying the variable names to be grouped in the array, we simply list the elements, separating each element with a space. As with all SAS statements, the ARRAY statement is closed with a semicolon (;).
\n", "Once we've defined the array fahr, we can use it in our code instead of the individual variable names. We refer to the individual elements of the array using its name and an index, such as, fahr(i). The order in which the variables appear in the ARRAY statement determines the variable's position in the array. For example, fahr(1) corresponds to the jan variable, fahr(2) corresponds to the feb variable, and fahr(12) corresponds to the dec variable. It's when you use an array like fahr, in conjunction with an iterative DO loop, that you can really simplify your code, as we did in this program.
\n", "The DO loop tells SAS to process through the elements of the fahr array, each time converting the Celsius temperature to a Fahrenheit temperature. For example, when the index variable i is 1, the assignment statement becomes:
\n", "fahr(1) = 1.8*fahr(1) + 32;
\n",
" which you could think of as saying:
\n", "jan = 1.8*jan + 32;
\n",
" The value of jan on the right side of the equal sign is the Celsius temperature. After the assignment statement is executed, the value of jan on the left side of the equal sign is updated to reflect the Fahrenheit temperature.
\n", "Now, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were again properly converted to Fahrenheit temperatures. Oh, one more thing to point out! Note that the variables listed in the PRINT procedure's VAR statement are the original variable names jan, feb, ..., dec, not the variables as they were grouped into an array, fahr(1), fahr(2), ..., fahr(12). That's because an array exists only for the duration of the DATA step. If in the PRINT procedure, you instead tell SAS to print fahr(1), fahr(2), ... you'll see that SAS will hiccup. Let's summarize!
\n", "The following program is identical to the program in the previous example, except the 12 in the ARRAY statement has been changed to an asterisk (*) and we use a SAS list to grab the variables for the array:
\n", "City | \n", "jan | \n", "feb | \n", "mar | \n", "apr | \n", "may | \n", "jun | \n", "jul | \n", "aug | \n", "sep | \n", "oct | \n", "nov | \n", "dec | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
State College, PA | \n", "28.4 | \n", "28.4 | \n", "35.6 | \n", "46.4 | \n", "57.2 | \n", "66.2 | \n", "69.8 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "39.2 | \n", "30.2 | \n", "
Miami, FL | \n", "68.0 | \n", "68.0 | \n", "71.6 | \n", "73.4 | \n", "78.8 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "80.6 | \n", "78.8 | \n", "73.4 | \n", "68.0 | \n", "
St. Louis, MO | \n", "30.2 | \n", "33.8 | \n", "42.8 | \n", "55.4 | \n", "64.4 | \n", "73.4 | \n", "78.8 | \n", "77.0 | \n", "69.8 | \n", "59.0 | \n", "44.6 | \n", "33.8 | \n", "
New Orleans, LA | \n", "51.8 | \n", "55.4 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "80.6 | \n", "80.6 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Madison, WI | \n", "17.6 | \n", "23.0 | \n", "32.0 | \n", "44.6 | \n", "57.2 | \n", "66.2 | \n", "71.6 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "35.6 | \n", "23.0 | \n", "
Houston, TX | \n", "50.0 | \n", "53.6 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Phoenix, AZ | \n", "53.6 | \n", "57.2 | \n", "60.8 | \n", "69.8 | \n", "78.8 | \n", "87.8 | \n", "91.4 | \n", "89.6 | \n", "86.0 | \n", "73.4 | \n", "60.8 | \n", "53.6 | \n", "
Seattle, WA | \n", "41.0 | \n", "42.8 | \n", "44.6 | \n", "50.0 | \n", "55.4 | \n", "60.8 | \n", "64.4 | \n", "64.4 | \n", "60.8 | \n", "53.6 | \n", "46.4 | \n", "42.8 | \n", "
San Francisco, CA | \n", "50.0 | \n", "53.6 | \n", "53.6 | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "60.8 | \n", "57.2 | \n", "51.8 | \n", "
San Diego, CA | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "66.2 | \n", "69.8 | \n", "71.6 | \n", "69.8 | \n", "66.2 | \n", "60.8 | \n", "57.2 | \n", "
Simple enough! Rather than you having to tell SAS how many variables and listing out exactly which ones you are grouping in an array, you can let SAS to the dirty work of counting the number of elements and listing the ones you include in your variable list. To do so, you simply define the dimension using an asterisk (*) and use the SAS list shortcut. You might find this strategy particularly helpful if you are grouping so many variables together into an array that you don't want to spend the time counting and listing them individually. Incidentally, throughout this lesson, we enclose the array's dimension (or index variable) in parentheses ( ). We could alternatively use braces { } or brackets [ ].
\n", "The above program used a SAS list to shorten the list of variable names grouped into the fahr array. In some cases, you could also consider using the special name lists _ALL_, _CHARACTER_ and _NUMERIC_:
\n", "In this case, we could have used the _NUMERIC_ keyword instead as shown in the following program.
\n", "City | \n", "jan | \n", "feb | \n", "mar | \n", "apr | \n", "may | \n", "jun | \n", "jul | \n", "aug | \n", "sep | \n", "oct | \n", "nov | \n", "dec | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
State College, PA | \n", "28.4 | \n", "28.4 | \n", "35.6 | \n", "46.4 | \n", "57.2 | \n", "66.2 | \n", "69.8 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "39.2 | \n", "30.2 | \n", "
Miami, FL | \n", "68.0 | \n", "68.0 | \n", "71.6 | \n", "73.4 | \n", "78.8 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "80.6 | \n", "78.8 | \n", "73.4 | \n", "68.0 | \n", "
St. Louis, MO | \n", "30.2 | \n", "33.8 | \n", "42.8 | \n", "55.4 | \n", "64.4 | \n", "73.4 | \n", "78.8 | \n", "77.0 | \n", "69.8 | \n", "59.0 | \n", "44.6 | \n", "33.8 | \n", "
New Orleans, LA | \n", "51.8 | \n", "55.4 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "80.6 | \n", "80.6 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Madison, WI | \n", "17.6 | \n", "23.0 | \n", "32.0 | \n", "44.6 | \n", "57.2 | \n", "66.2 | \n", "71.6 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "35.6 | \n", "23.0 | \n", "
Houston, TX | \n", "50.0 | \n", "53.6 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Phoenix, AZ | \n", "53.6 | \n", "57.2 | \n", "60.8 | \n", "69.8 | \n", "78.8 | \n", "87.8 | \n", "91.4 | \n", "89.6 | \n", "86.0 | \n", "73.4 | \n", "60.8 | \n", "53.6 | \n", "
Seattle, WA | \n", "41.0 | \n", "42.8 | \n", "44.6 | \n", "50.0 | \n", "55.4 | \n", "60.8 | \n", "64.4 | \n", "64.4 | \n", "60.8 | \n", "53.6 | \n", "46.4 | \n", "42.8 | \n", "
San Francisco, CA | \n", "50.0 | \n", "53.6 | \n", "53.6 | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "60.8 | \n", "57.2 | \n", "51.8 | \n", "
San Diego, CA | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "66.2 | \n", "69.8 | \n", "71.6 | \n", "69.8 | \n", "66.2 | \n", "60.8 | \n", "57.2 | \n", "
The following program again converts the average monthly Celsius temperatures in ten cities to average montly Fahrenheit temperatures. To do so, the already existing Celsius temperatures, jan, feb, ..., and dec, are grouped into an array called celsius, and the resulting Fahrenheit temperatures are stored in new variables janf, febf, ..., decf, which are grouped into an array called fahr:
\n", "City | \n", "jan | \n", "janf | \n", "feb | \n", "febf | \n", "mar | \n", "marf | \n", "apr | \n", "aprf | \n", "may | \n", "mayf | \n", "jun | \n", "junf | \n", "jul | \n", "julf | \n", "aug | \n", "augf | \n", "sep | \n", "sepf | \n", "oct | \n", "octf | \n", "nov | \n", "novf | \n", "dec | \n", "decf | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
State College, PA | \n", "-2 | \n", "28.4 | \n", "-2 | \n", "28.4 | \n", "2 | \n", "35.6 | \n", "8 | \n", "46.4 | \n", "14 | \n", "57.2 | \n", "19 | \n", "66.2 | \n", "21 | \n", "69.8 | \n", "20 | \n", "68.0 | \n", "16 | \n", "60.8 | \n", "10 | \n", "50.0 | \n", "4 | \n", "39.2 | \n", "-1 | \n", "30.2 | \n", "
Miami, FL | \n", "20 | \n", "68.0 | \n", "20 | \n", "68.0 | \n", "22 | \n", "71.6 | \n", "23 | \n", "73.4 | \n", "26 | \n", "78.8 | \n", "27 | \n", "80.6 | \n", "28 | \n", "82.4 | \n", "28 | \n", "82.4 | \n", "27 | \n", "80.6 | \n", "26 | \n", "78.8 | \n", "23 | \n", "73.4 | \n", "20 | \n", "68.0 | \n", "
St. Louis, MO | \n", "-1 | \n", "30.2 | \n", "1 | \n", "33.8 | \n", "6 | \n", "42.8 | \n", "13 | \n", "55.4 | \n", "18 | \n", "64.4 | \n", "23 | \n", "73.4 | \n", "26 | \n", "78.8 | \n", "25 | \n", "77.0 | \n", "21 | \n", "69.8 | \n", "15 | \n", "59.0 | \n", "7 | \n", "44.6 | \n", "1 | \n", "33.8 | \n", "
New Orleans, LA | \n", "11 | \n", "51.8 | \n", "13 | \n", "55.4 | \n", "16 | \n", "60.8 | \n", "20 | \n", "68.0 | \n", "23 | \n", "73.4 | \n", "27 | \n", "80.6 | \n", "27 | \n", "80.6 | \n", "27 | \n", "80.6 | \n", "26 | \n", "78.8 | \n", "21 | \n", "69.8 | \n", "16 | \n", "60.8 | \n", "12 | \n", "53.6 | \n", "
Madison, WI | \n", "-8 | \n", "17.6 | \n", "-5 | \n", "23.0 | \n", "0 | \n", "32.0 | \n", "7 | \n", "44.6 | \n", "14 | \n", "57.2 | \n", "19 | \n", "66.2 | \n", "22 | \n", "71.6 | \n", "20 | \n", "68.0 | \n", "16 | \n", "60.8 | \n", "10 | \n", "50.0 | \n", "2 | \n", "35.6 | \n", "-5 | \n", "23.0 | \n", "
Houston, TX | \n", "10 | \n", "50.0 | \n", "12 | \n", "53.6 | \n", "16 | \n", "60.8 | \n", "20 | \n", "68.0 | \n", "23 | \n", "73.4 | \n", "27 | \n", "80.6 | \n", "28 | \n", "82.4 | \n", "28 | \n", "82.4 | \n", "26 | \n", "78.8 | \n", "21 | \n", "69.8 | \n", "16 | \n", "60.8 | \n", "12 | \n", "53.6 | \n", "
Phoenix, AZ | \n", "12 | \n", "53.6 | \n", "14 | \n", "57.2 | \n", "16 | \n", "60.8 | \n", "21 | \n", "69.8 | \n", "26 | \n", "78.8 | \n", "31 | \n", "87.8 | \n", "33 | \n", "91.4 | \n", "32 | \n", "89.6 | \n", "30 | \n", "86.0 | \n", "23 | \n", "73.4 | \n", "16 | \n", "60.8 | \n", "12 | \n", "53.6 | \n", "
Seattle, WA | \n", "5 | \n", "41.0 | \n", "6 | \n", "42.8 | \n", "7 | \n", "44.6 | \n", "10 | \n", "50.0 | \n", "13 | \n", "55.4 | \n", "16 | \n", "60.8 | \n", "18 | \n", "64.4 | \n", "18 | \n", "64.4 | \n", "16 | \n", "60.8 | \n", "12 | \n", "53.6 | \n", "8 | \n", "46.4 | \n", "6 | \n", "42.8 | \n", "
San Francisco, CA | \n", "10 | \n", "50.0 | \n", "12 | \n", "53.6 | \n", "12 | \n", "53.6 | \n", "13 | \n", "55.4 | \n", "14 | \n", "57.2 | \n", "15 | \n", "59.0 | \n", "15 | \n", "59.0 | \n", "16 | \n", "60.8 | \n", "17 | \n", "62.6 | \n", "16 | \n", "60.8 | \n", "14 | \n", "57.2 | \n", "11 | \n", "51.8 | \n", "
San Diego, CA | \n", "13 | \n", "55.4 | \n", "14 | \n", "57.2 | \n", "15 | \n", "59.0 | \n", "16 | \n", "60.8 | \n", "17 | \n", "62.6 | \n", "19 | \n", "66.2 | \n", "21 | \n", "69.8 | \n", "22 | \n", "71.6 | \n", "21 | \n", "69.8 | \n", "19 | \n", "66.2 | \n", "16 | \n", "60.8 | \n", "14 | \n", "57.2 | \n", "
The DATA step should look eerily similar to that of Example 7.6. The only thing that differs here is rather than writing over the Celsius temperatures, they are preserved by storing the calculated Fahrenheit temperatures in new variables called janf, febf, ..., and decf. The first ARRAY statement tells SAS to group the jan, feb, ..., dec variables in the avgcelsius data set into a one-dimensional array called celsius. The second ARRAY statement tells SAS to create twelve new variables called janf, febf, ..., and decf and to group them into an array called fahr. The DO loop processes through the twelve elements of the celsius array, converts the Celsius temperatures to Fahrenheit temperatures, and stores the results in the fahr array. The PRINT procedure then tells SAS to print the contents of the twelve Celsius temperatures and twelve Fahrenheit temperatures side-by-side. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were properly converted to Fahrenheit temperatures.
\n", "Alternatively, we could let SAS do the naming for us in the fahr array.
\n", "City | \n", "fahr1 | \n", "fahr2 | \n", "fahr3 | \n", "fahr4 | \n", "fahr5 | \n", "fahr6 | \n", "fahr7 | \n", "fahr8 | \n", "fahr9 | \n", "fahr10 | \n", "fahr11 | \n", "fahr12 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
State College, PA | \n", "28.4 | \n", "28.4 | \n", "35.6 | \n", "46.4 | \n", "57.2 | \n", "66.2 | \n", "69.8 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "39.2 | \n", "30.2 | \n", "
Miami, FL | \n", "68.0 | \n", "68.0 | \n", "71.6 | \n", "73.4 | \n", "78.8 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "80.6 | \n", "78.8 | \n", "73.4 | \n", "68.0 | \n", "
St. Louis, MO | \n", "30.2 | \n", "33.8 | \n", "42.8 | \n", "55.4 | \n", "64.4 | \n", "73.4 | \n", "78.8 | \n", "77.0 | \n", "69.8 | \n", "59.0 | \n", "44.6 | \n", "33.8 | \n", "
New Orleans, LA | \n", "51.8 | \n", "55.4 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "80.6 | \n", "80.6 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Madison, WI | \n", "17.6 | \n", "23.0 | \n", "32.0 | \n", "44.6 | \n", "57.2 | \n", "66.2 | \n", "71.6 | \n", "68.0 | \n", "60.8 | \n", "50.0 | \n", "35.6 | \n", "23.0 | \n", "
Houston, TX | \n", "50.0 | \n", "53.6 | \n", "60.8 | \n", "68.0 | \n", "73.4 | \n", "80.6 | \n", "82.4 | \n", "82.4 | \n", "78.8 | \n", "69.8 | \n", "60.8 | \n", "53.6 | \n", "
Phoenix, AZ | \n", "53.6 | \n", "57.2 | \n", "60.8 | \n", "69.8 | \n", "78.8 | \n", "87.8 | \n", "91.4 | \n", "89.6 | \n", "86.0 | \n", "73.4 | \n", "60.8 | \n", "53.6 | \n", "
Seattle, WA | \n", "41.0 | \n", "42.8 | \n", "44.6 | \n", "50.0 | \n", "55.4 | \n", "60.8 | \n", "64.4 | \n", "64.4 | \n", "60.8 | \n", "53.6 | \n", "46.4 | \n", "42.8 | \n", "
San Francisco, CA | \n", "50.0 | \n", "53.6 | \n", "53.6 | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "60.8 | \n", "57.2 | \n", "51.8 | \n", "
San Diego, CA | \n", "55.4 | \n", "57.2 | \n", "59.0 | \n", "60.8 | \n", "62.6 | \n", "66.2 | \n", "69.8 | \n", "71.6 | \n", "69.8 | \n", "66.2 | \n", "60.8 | \n", "57.2 | \n", "
Note that when we define the fahr array in the second ARRAY statement, we specify how many elements the fahr array should contain (12), but we do not specify any variables to group into the array. That tells SAS two things: i) we want to create twelve new variables, and ii) we want to leave the naming of the variables to SAS. In this situation, SAS creates default names by concatenating the array name and the numbers 1, 2, 3, and so on, up to the array dimension. Here, for example, SAS creates the names fahr1, fahr2, fahr3, ..., up to fahr12. That's why we refer to the Fahrenheit temperatures as fahr1 to fahr12 in the PRINT procedure's VAR statement. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were again properly converted to Fahrenheit temperatures.
\n", "The following program first reads a subset of Quality of Life data (variables qul3a, qul3b, ..., and qul3j) into a SAS data set called qul. Then, the program checks to make sure that the values for each variable have been recorded as either a 1, 2, or 3 as would be expected from the data form. If a value for one of the variables does not equal 1, 2, or 3, then that observation is output to a data set called errors. Otherwise, the observation is output to the qul data set. Because the error checking takes places without using arrays, the program contains a series of ten if/then statements, corresponding to each of the ten concerned variables:
\n", "Obs | \n", "subj | \n", "qul3a | \n", "qul3b | \n", "qul3c | \n", "qul3d | \n", "qul3e | \n", "qul3f | \n", "qul3g | \n", "qul3h | \n", "qul3i | \n", "qul3j | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "110011 | \n", "1 | \n", "2 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "1 | \n", "1 | \n", "3 | \n", "
2 | \n", "211011 | \n", "1 | \n", "2 | \n", "3 | \n", "2 | \n", "1 | \n", "2 | \n", "3 | \n", "2 | \n", "1 | \n", "3 | \n", "
3 | \n", "310017 | \n", "1 | \n", "2 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "2 | \n", "1 | \n", "
4 | \n", "510001 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "2 | \n", "1 | \n", "2 | \n", "2 | \n", "
Obs | \n", "subj | \n", "qul3a | \n", "qul3b | \n", "qul3c | \n", "qul3d | \n", "qul3e | \n", "qul3f | \n", "qul3g | \n", "qul3h | \n", "qul3i | \n", "qul3j | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "210012 | \n", "2 | \n", "3 | \n", "4 | \n", "1 | \n", "2 | \n", "2 | \n", "3 | \n", "3 | \n", "1 | \n", "1 | \n", "
2 | \n", "411020 | \n", "4 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "
The INPUT statement first reads an observation of data containing one subject's quality of life data. An observation is assumed to be error-free (flag is initially set to 0) until it is found to be in error (flag is set to 1 if any of the ten values are out of range). If an observation is deemed to contain an error (flag = 1) after looking at each of the ten values, it is output to the errors data set. Otherwise (flag = 0) , it is output to the qul data set.
\n", "First, note that two of the observations in the input data set contain data recording errors. The qul3c value for subject 210012 was recorded as 4, as was the qul3a value for subject 411020. Then, launch and run the SAS program. Review the output to convince yourself that qul contains the four observations with clean data, and errors contains the two observations with bad data.
\n", "You should also appreciate that this is a classic situation that cries out for using arrays. If you aren't yet convinced, imagine how long the above program would be if you had to write similar if/then statements to check for errors in, say, a hundred such variables.
\n", "The following program performs the same error checking as the previous program except here the error checking is accomplished using two arrays, bounds and quldata:
\n", "Obs | \n", "subj | \n", "qul3a | \n", "qul3b | \n", "qul3c | \n", "qul3d | \n", "qul3e | \n", "qul3f | \n", "qul3g | \n", "qul3h | \n", "qul3i | \n", "qul3j | \n", "error1 | \n", "error2 | \n", "error3 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "110011 | \n", "1 | \n", "2 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "1 | \n", "1 | \n", "3 | \n", "1 | \n", "2 | \n", "3 | \n", "
2 | \n", "211011 | \n", "1 | \n", "2 | \n", "3 | \n", "2 | \n", "1 | \n", "2 | \n", "3 | \n", "2 | \n", "1 | \n", "3 | \n", "1 | \n", "2 | \n", "3 | \n", "
3 | \n", "310017 | \n", "1 | \n", "2 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "2 | \n", "1 | \n", "1 | \n", "2 | \n", "3 | \n", "
4 | \n", "510001 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "2 | \n", "1 | \n", "2 | \n", "2 | \n", "1 | \n", "2 | \n", "3 | \n", "
Obs | \n", "subj | \n", "qul3a | \n", "qul3b | \n", "qul3c | \n", "qul3d | \n", "qul3e | \n", "qul3f | \n", "qul3g | \n", "qul3h | \n", "qul3i | \n", "qul3j | \n", "error1 | \n", "error2 | \n", "error3 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "210012 | \n", "2 | \n", "3 | \n", "4 | \n", "1 | \n", "2 | \n", "2 | \n", "3 | \n", "3 | \n", "1 | \n", "1 | \n", "1 | \n", "2 | \n", "3 | \n", "
2 | \n", "411020 | \n", "4 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "1 | \n", "2 | \n", "3 | \n", "
If you compare this program to the previous program, you'll see that the only differences here are the presence of two ARRAY definition statements and the IF/THEN statement within the iterative DO loop that does the error checking.
\n", "The first ARRAY statement uses a numbered range list to define an array called bounds that contains three new variables — error1, error2, and error3. The \"(1 2 3)\" that appears after the variable list error1-error3 tells SAS to set, or initialize, the elements of the array to equal 1, 2, and 3. In general, you initialize an array in this manner, namely listing as many values as their are elements of the array and separating each pair of values with a space. If you intend for your array to contain character constants, you must put the values in single quotes. For example, the following ARRAY statement tells SAS to define a character array (hence the dollar sign \\$) called weekdays:
\n", "ARRAY weekdays(5) $ ('M' 'T' 'W' 'R' 'F');
\n",
" and to initialize the elements of the array as M, T, W, R, and F.
\n", "The second ARRAY statement uses a name range list to define an array called quldata that contains the ten quality of life variables. The IF/THEN statement uses slightly different logic than the previous program to tell SAS to compare the elements of the quldata array to the elements of the bounds array to determine whether any of the values are out of range.
\n", "Now, launch and run the SAS program. Review the output to convince yourself that just as before qul contains the four observations with clean data, and errors contains the two observations with bad data. Also, note that the three new error variables error1, error2, and error3 remain present in the data set.
\n", "The valid values 1, 2, and 3 are needed only temporarily in the previous program. Therefore, we alternatively could have used temporary array elements in defining the bounds array. The following program does just that. It is identical to the previous program except here the bounds array is defined using temporary array elements rather than using three new variables error1, error2, and error3:
\n", "Obs | \n", "subj | \n", "qul3a | \n", "qul3b | \n", "qul3c | \n", "qul3d | \n", "qul3e | \n", "qul3f | \n", "qul3g | \n", "qul3h | \n", "qul3i | \n", "qul3j | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "110011 | \n", "1 | \n", "2 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "1 | \n", "1 | \n", "3 | \n", "
2 | \n", "211011 | \n", "1 | \n", "2 | \n", "3 | \n", "2 | \n", "1 | \n", "2 | \n", "3 | \n", "2 | \n", "1 | \n", "3 | \n", "
3 | \n", "310017 | \n", "1 | \n", "2 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "2 | \n", "1 | \n", "
4 | \n", "510001 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "2 | \n", "1 | \n", "2 | \n", "2 | \n", "
Obs | \n", "subj | \n", "qul3a | \n", "qul3b | \n", "qul3c | \n", "qul3d | \n", "qul3e | \n", "qul3f | \n", "qul3g | \n", "qul3h | \n", "qul3i | \n", "qul3j | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "210012 | \n", "2 | \n", "3 | \n", "4 | \n", "1 | \n", "2 | \n", "2 | \n", "3 | \n", "3 | \n", "1 | \n", "1 | \n", "
2 | \n", "411020 | \n", "4 | \n", "3 | \n", "3 | \n", "3 | \n", "3 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "2 | \n", "
If you compare this program to the previous program, you'll see that the only difference here is the presence of the _TEMPORARY_ argument in the definition of the bounds array. The bounds array is again initialized to the three valid values \"(1 2 3)\".
\n", "Launch and run the SAS program. Review the output to convince yourself that just as before qul contains the four observations with clean data, and errors contains the two observations with bad data. Also, note that the temporary array elements do not appear in the data set.
\n", "The following program reads the yes/no responses of five subjects to six survey questions (q1, q2, ..., q6) into a temporary SAS data set called survey. A yes response is coded and entered as a 2, while a no response is coded and entered as a 1. Just four of the variables (q3, q4, q5, and q6) are stored in a one-dimensional array called qxs. Then, a DO LOOP, in conjunction with the DIM function, is used to recode the responses to the four variables so that a 2 is changed to a 1, and a 1 is changed to a 0:
\n", "Obs | \n", "subj | \n", "q1 | \n", "q2 | \n", "q3 | \n", "q4 | \n", "q5 | \n", "q6 | \n", "
---|---|---|---|---|---|---|---|
1 | \n", "1001 | \n", "1 | \n", "2 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
2 | \n", "1002 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "
3 | \n", "1003 | \n", "2 | \n", "2 | \n", "1 | \n", "0 | \n", ". | \n", "1 | \n", "
4 | \n", "1004 | \n", "1 | \n", ". | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
5 | \n", "1005 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "
First, note that although all of the survey variables (q1, ..., q6) are read into the survey data set, the ARRAY statement groups only 4 of the variables (q3, q4, q5, q6) into the one-dimensional array qxs. For example, qxs(1) corresponds to the q3 variable, qxs(2) corresponds to the q4 variable, and so on. Then, rather than telling SAS to process the array from element 1 to element 4, the DO loop tells SAS to process the array from element 1 to the more general DIM(qxs). In general, the DIM function returns the number of the elements in the array, which in this case is 4. The DO loop tells SAS to recode the values by simply subtracting 1 from each value. And, the index variable i is output to the survey data set by default and is therefore dropped.
\n", "As previously discussed and illustrated, if you do not specifically tell SAS the lower bound of an array, SAS assumes that the lower bound is 1. For most arrays, 1 is a convenient lower bound and the number of elements is a convenient upper bound, so you usually don't need to specify both the lower and upper bounds. However, in cases where it is more convenient, you can modify both bounds for any array dimension.
\n", "In the previous example, perhaps you find it a little awkward that the array element qxs(1) corresponds to the q3 variable, the array element qxs(2) corresponds to the q4 variable, and so on. Perhaps you would find it more clear for the array element qxs(3) to correspond to the q3 variable, the array element qxs(4) to correspond to the q4 variable, ..., and the array element qxs(6) to correspond to the q6 variable. The following program is similar in function to the previous program, except here the task of recoding is accomplished by defining the lower bound of the qxs array to be 3 and the upper bound to be 6:
\n", "Obs | \n", "subj | \n", "q1 | \n", "q2 | \n", "q3 | \n", "q4 | \n", "q5 | \n", "q6 | \n", "
---|---|---|---|---|---|---|---|
1 | \n", "1001 | \n", "1 | \n", "2 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
2 | \n", "1002 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "
3 | \n", "1003 | \n", "2 | \n", "2 | \n", "1 | \n", "0 | \n", ". | \n", "1 | \n", "
4 | \n", "1004 | \n", "1 | \n", ". | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
5 | \n", "1005 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "
If you compare this program with the previous program, you'll see that only two things differ. The first difference is that the ARRAY statement here defines the lower bound of the qxs array to be 3 and the upper bound to be 6. In general, you can always define the lower and upper bounds of any array dimension in this way, namely by specifying the lower bound, then a colon (:), and then the upper bound. The second difference is that, for the DO loop, the bounds on the index variable i are specifically defined here to be between 3 and 6 rather than 1 to DIM(qxs) (which in this case is 4).
\n", "Now, there's still a little bit more that we can do to automate the handling of the bounds of an array dimension. The following program again uses a one-dimensional array qxs to recode four survey variables as did the previous two programs. Here, though, an asterisk (*) is used to tell SAS to determine the dimension of the qxs array, and the LBOUND and HBOUND functions are used to tell SAS to determine, respectively, the lower and upper bounds of the DO loop's index variable dynamically:
\n", "Obs | \n", "subj | \n", "q1 | \n", "q2 | \n", "q3 | \n", "q4 | \n", "q5 | \n", "q6 | \n", "
---|---|---|---|---|---|---|---|
1 | \n", "1001 | \n", "1 | \n", "2 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
2 | \n", "1002 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "
3 | \n", "1003 | \n", "2 | \n", "2 | \n", "1 | \n", "0 | \n", ". | \n", "1 | \n", "
4 | \n", "1004 | \n", "1 | \n", ". | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "
5 | \n", "1005 | \n", "2 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "
If you compare this program with the previous program, you'll see that only two things differ. The first difference is that the asterisk (*) that appears in the the ARRAY statement tells SAS to determine the bounds on the dimensions of the array during the declaration of qxs. SAS counts the number of elements in the array and determines that the dimension of qxs is 4. The second difference is that, for the DO loop, the bounds on the index variable i are determined dynamically to be between LBOUND(qxs) and HBOUND(qxs).
\n", "\n",
"dog1 dog2 dog3 dog4\n",
"cat1 cat2 cat3 cat4\n",
"
\n",
"\n",
"As the previous ARRAY statement suggests, to define a two-dimensional array, you specify the number of elements in each dimension, separated by a comma. In general, the first dimension number tells SAS how many rows your array needs, while the second dimension number tells SAS how many columns your array needs.\n",
"\n",
"When you define a two-dimensional array, the array elements are grouped in the order in which they appear in the ARRAY statement. For example, SAS assigns the elements of the array horse:\n",
"\n",
"`ARRAY horse(3,5) x1-x15;`\n",
"\n",
"as follows:\n",
"\n",
"\n",
"x1 x2 x3 x4 x5\n",
"x6 x7 x8 x9 x10\n",
"x11 x12 x13 x14 x15\n",
"
\n",
"\n",
"In this section, we'll look at two examples that involve checking a subset of Family History data for missing values. We'll use one two-dimensional array — the first dimension to store the actual data and the second dimension to store binary status variables that indicate whether a particular data value is missing or not.\n",
"\n",
"This program searches a subset of the family history data for missing values. Here, we use one two-dimensional array called edit. The first row contains the actual data and the second row contains a 0/1 indicator of missingness for the observed data in the corresponding column of the two-dimensional array.
\n", "Obs | \n", "fhx1 | \n", "fhx2 | \n", "fhx3 | \n", "fhx4 | \n", "fhx5 | \n", "fhx6 | \n", "fhx7 | \n", "fhx8 | \n", "fhx9 | \n", "fhx10 | \n", "fhx11 | \n", "fhx12 | \n", "fhx13 | \n", "fhx14 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", ". | \n", "8 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", ". | \n", "1 | \n", "0 | \n", "
2 | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", ". | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", ". | \n", "0 | \n", "0 | \n", "
3 | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", ". | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", ". | \n", "0 | \n", "0 | \n", "
4 | \n", "5 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "1 | \n", "
5 | \n", "5 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "8 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "
Obs | \n", "stat1 | \n", "stat2 | \n", "stat3 | \n", "stat4 | \n", "stat5 | \n", "stat6 | \n", "stat7 | \n", "stat8 | \n", "stat9 | \n", "stat10 | \n", "stat11 | \n", "stat12 | \n", "stat13 | \n", "stat14 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
We have just one ARRAY statement that defines the two-dimensional array edit containing 2 rows and 14 columns. The ARRAY statement tells SAS to group the family history variables (fhx1, ..., fhx14) into the first dimension and to group the status variables (stat1, ..., stat14) into the second dimension. Then, the DO loop tells SAS to review the contents of the 14 variables and to assign each element of the status dimension a value of 0 (\"edit(2,i) = 0;\"). If the element of the edit dimension is missing, however, then SAS is told to change the element of the status dimension from a 0 to a 1 (\"if edit(1,i) = . then edit(2,i) = 1\").
\n", "Obs | \n", "id | \n", "visit1 | \n", "visit2 | \n", "visit3 | \n", "
---|---|---|---|---|
1 | \n", "1 | \n", "10 | \n", "4 | \n", "3 | \n", "
2 | \n", "2 | \n", "5 | \n", "6 | \n", ". | \n", "
Obs | \n", "id | \n", "visit | \n", "value | \n", "
---|---|---|---|
1 | \n", "1 | \n", "1 | \n", "10 | \n", "
2 | \n", "1 | \n", "2 | \n", "4 | \n", "
3 | \n", "1 | \n", "3 | \n", "3 | \n", "
4 | \n", "2 | \n", "1 | \n", "5 | \n", "
5 | \n", "2 | \n", "2 | \n", "6 | \n", "
6 | \n", "2 | \n", "3 | \n", ". | \n", "
idno | \n", "l_name | \n", "gtype | \n", "grade | \n", "
---|---|---|---|
10 | \n", "Smith | \n", "E1 | \n", "78 | \n", "
10 | \n", "Smith | \n", "E2 | \n", "82 | \n", "
10 | \n", "Smith | \n", "E3 | \n", "86 | \n", "
10 | \n", "Smith | \n", "E4 | \n", "69 | \n", "
10 | \n", "Smith | \n", "P1 | \n", "97 | \n", "
10 | \n", "Smith | \n", "F1 | \n", "160 | \n", "
11 | \n", "Simon | \n", "E1 | \n", "88 | \n", "
11 | \n", "Simon | \n", "E2 | \n", "72 | \n", "
11 | \n", "Simon | \n", "E3 | \n", "86 | \n", "
11 | \n", "Simon | \n", "E4 | \n", "99 | \n", "
11 | \n", "Simon | \n", "P1 | \n", "100 | \n", "
11 | \n", "Simon | \n", "F1 | \n", "170 | \n", "
12 | \n", "Jones | \n", "E1 | \n", "98 | \n", "
12 | \n", "Jones | \n", "E2 | \n", "92 | \n", "
12 | \n", "Jones | \n", "E3 | \n", "92 | \n", "
12 | \n", "Jones | \n", "E4 | \n", "99 | \n", "
12 | \n", "Jones | \n", "P1 | \n", "99 | \n", "
12 | \n", "Jones | \n", "F1 | \n", "185 | \n", "
In this example, we will transpose the tallgrades dataset from long to wide format by using a DATA step. This will require the use of an array and both the RETAIN and OUTPUT statements, and the FIRST. and LAST. SAS variables.
\n", "Obs | \n", "idno | \n", "l_name | \n", "E1 | \n", "E2 | \n", "E3 | \n", "E4 | \n", "P1 | \n", "F1 | \n", "
---|---|---|---|---|---|---|---|---|
1 | \n", "10 | \n", "Smith | \n", "78 | \n", "82 | \n", "86 | \n", "69 | \n", "97 | \n", "160 | \n", "
2 | \n", "11 | \n", "Simon | \n", "88 | \n", "72 | \n", "86 | \n", "99 | \n", "100 | \n", "170 | \n", "
3 | \n", "12 | \n", "Jones | \n", "98 | \n", "92 | \n", "92 | \n", "99 | \n", "99 | \n", "185 | \n", "
Yikes! This code looks scary! Let's dissect it a bit. First, the tallgrades data set is processed BY idno. Doing so, makes the first.idno and last.idno variables available for us to use. The ARRAY statement defines an array called allgrades and, using a numbered range list, associates the array with six (uninitialized) variables E1, E2, E2, E4, P1, and F1. The allgrades array is used to hold the six grades for each student before they are output in their transposed direction to the fatgrades data set. Because the elements of any array, and therefore allgrades, must be assigned using an index variable, this is how the transposition takes place:
\n", "The program would keep cycling through the above five steps until it encountered the last observation in the data set. Then, the variables i, gtype, and grade would be dropped from the output fatgrades data set.
\n", "SAS also has procedure called PROC TRANSPOSE that can be used to transpose datasets between wide and tall formats. Personally, I find this function somewhat unintuitive compared to the DATA step method (at least once you get used to using the DATA step), so I tend to always use the DATA step. However, in this next example we will show how to perform the same transposition using PROC TRANSPOSE and leave it to the reader to decide which method is preferred.
\n", "Obs | \n", "idno | \n", "l_name | \n", "E1 | \n", "E2 | \n", "E3 | \n", "E4 | \n", "P1 | \n", "F1 | \n", "
---|---|---|---|---|---|---|---|---|
1 | \n", "10 | \n", "Smith | \n", "78 | \n", "82 | \n", "86 | \n", "69 | \n", "97 | \n", "160 | \n", "
2 | \n", "11 | \n", "Simon | \n", "88 | \n", "72 | \n", "86 | \n", "99 | \n", "100 | \n", "170 | \n", "
3 | \n", "12 | \n", "Jones | \n", "98 | \n", "92 | \n", "92 | \n", "99 | \n", "99 | \n", "185 | \n", "
In PROC TRANSPOSE,
\n", "In this example, we will use a DATA step to transpose the grades dataset from wide back to tall format.
\n", "idno | \n", "l_name | \n", "gtype | \n", "grade | \n", "
---|---|---|---|
10 | \n", "Smith | \n", "E1 | \n", "78 | \n", "
10 | \n", "Smith | \n", "E2 | \n", "82 | \n", "
10 | \n", "Smith | \n", "E3 | \n", "86 | \n", "
10 | \n", "Smith | \n", "E4 | \n", "69 | \n", "
10 | \n", "Smith | \n", "P1 | \n", "97 | \n", "
10 | \n", "Smith | \n", "F1 | \n", "160 | \n", "
11 | \n", "Simon | \n", "E1 | \n", "88 | \n", "
11 | \n", "Simon | \n", "E2 | \n", "72 | \n", "
11 | \n", "Simon | \n", "E3 | \n", "86 | \n", "
11 | \n", "Simon | \n", "E4 | \n", "99 | \n", "
11 | \n", "Simon | \n", "P1 | \n", "100 | \n", "
11 | \n", "Simon | \n", "F1 | \n", "170 | \n", "
12 | \n", "Jones | \n", "E1 | \n", "98 | \n", "
12 | \n", "Jones | \n", "E2 | \n", "92 | \n", "
12 | \n", "Jones | \n", "E3 | \n", "92 | \n", "
12 | \n", "Jones | \n", "E4 | \n", "99 | \n", "
12 | \n", "Jones | \n", "P1 | \n", "99 | \n", "
12 | \n", "Jones | \n", "F1 | \n", "185 | \n", "
We create two arrays: gtypes is a temporary character array to return the columns names (the assessment types) to the variable gtypes and grades to store the grades on the current row (i.e. current student) to be assigned to the grade variable. Each iteration of the DO loop has an OUPUT statement to put each grade in its own row. Note that the idno and l_name are carried over from row to row until we run out of grades to OUTPUT. Then we exit the DO loop and move on the next row (student). We drop the (wide) columns E1, E2, E3, E4, P1, F1, and the index i from the final dataset to obtain the original tall dataset.
\n", "An alternative way to get the column names is to use the vname function instead of manually listing out the names in an character array. The vname function when applied to a variable returns the variable name as a character value.
\n", "idno | \n", "l_name | \n", "gtype | \n", "grade | \n", "
---|---|---|---|
10 | \n", "Smith | \n", "E1 | \n", "78 | \n", "
10 | \n", "Smith | \n", "E2 | \n", "82 | \n", "
10 | \n", "Smith | \n", "E3 | \n", "86 | \n", "
10 | \n", "Smith | \n", "E4 | \n", "69 | \n", "
10 | \n", "Smith | \n", "P1 | \n", "97 | \n", "
10 | \n", "Smith | \n", "F1 | \n", "160 | \n", "
11 | \n", "Simon | \n", "E1 | \n", "88 | \n", "
11 | \n", "Simon | \n", "E2 | \n", "72 | \n", "
11 | \n", "Simon | \n", "E3 | \n", "86 | \n", "
11 | \n", "Simon | \n", "E4 | \n", "99 | \n", "
11 | \n", "Simon | \n", "P1 | \n", "100 | \n", "
11 | \n", "Simon | \n", "F1 | \n", "170 | \n", "
12 | \n", "Jones | \n", "E1 | \n", "98 | \n", "
12 | \n", "Jones | \n", "E2 | \n", "92 | \n", "
12 | \n", "Jones | \n", "E3 | \n", "92 | \n", "
12 | \n", "Jones | \n", "E4 | \n", "99 | \n", "
12 | \n", "Jones | \n", "P1 | \n", "99 | \n", "
12 | \n", "Jones | \n", "F1 | \n", "185 | \n", "
Now we will do the same operation as the previous example but using PROC TRANSPOSE.
\n", "Obs | \n", "idno | \n", "l_name | \n", "gtype | \n", "grade | \n", "
---|---|---|---|---|
1 | \n", "10 | \n", "Smith | \n", "E1 | \n", "78 | \n", "
2 | \n", "10 | \n", "Smith | \n", "E2 | \n", "82 | \n", "
3 | \n", "10 | \n", "Smith | \n", "E3 | \n", "86 | \n", "
4 | \n", "10 | \n", "Smith | \n", "E4 | \n", "69 | \n", "
5 | \n", "10 | \n", "Smith | \n", "P1 | \n", "97 | \n", "
6 | \n", "10 | \n", "Smith | \n", "F1 | \n", "160 | \n", "
7 | \n", "11 | \n", "Simon | \n", "E1 | \n", "88 | \n", "
8 | \n", "11 | \n", "Simon | \n", "E2 | \n", "72 | \n", "
9 | \n", "11 | \n", "Simon | \n", "E3 | \n", "86 | \n", "
10 | \n", "11 | \n", "Simon | \n", "E4 | \n", "99 | \n", "
11 | \n", "11 | \n", "Simon | \n", "P1 | \n", "100 | \n", "
12 | \n", "11 | \n", "Simon | \n", "F1 | \n", "170 | \n", "
13 | \n", "12 | \n", "Jones | \n", "E1 | \n", "98 | \n", "
14 | \n", "12 | \n", "Jones | \n", "E2 | \n", "92 | \n", "
15 | \n", "12 | \n", "Jones | \n", "E3 | \n", "92 | \n", "
16 | \n", "12 | \n", "Jones | \n", "E4 | \n", "99 | \n", "
17 | \n", "12 | \n", "Jones | \n", "P1 | \n", "99 | \n", "
18 | \n", "12 | \n", "Jones | \n", "F1 | \n", "185 | \n", "
The BY statement defines the grouping variables that define a single observation and are copied to each new row when going from wide to long. The VAR statement defines all the columns that should be gatherd into a multiple rows by defining two new columns _NAME_ which holds the former column name and _COL1_ which holds the data value from that former column in the current row. Typically, we will want to change these default names by using the RENAME dataset option.
\n", "Store | \n", "\t\t\tDay | \n", "\t\t\tSale | \n", "\t\t
---|---|---|
1 | \n", "\t\t\tM | \n", "\t\t\t1200 | \n", "\t\t
1 | \n", "\t\t\tT | \n", "\t\t\t1435 | \n", "\t\t
1 | \n", "\t\t\tW | \n", "\t\t\t1712 | \n", "\t\t
1 | \n", "\t\t\tR | \n", "\t\t\t1529 | \n", "\t\t
1 | \n", "\t\t\tF | \n", "\t\t\t1920 | \n", "\t\t
1 | \n", "\t\t\tS | \n", "\t\t\t2325 | \n", "\t\t
Store | \n", "\t\t\tDay | \n", "\t\t\tSales | \n", "\t\t
---|---|---|
2 | \n", "\t\t\tM | \n", "\t\t\t2215 | \n", "\t\t
2 | \n", "\t\t\tT | \n", "\t\t\t2458 | \n", "\t\t
2 | \n", "\t\t\tW | \n", "\t\t\t1789 | \n", "\t\t
2 | \n", "\t\t\tR | \n", "\t\t\t1692 | \n", "\t\t
2 | \n", "\t\t\tF | \n", "\t\t\t2105 | \n", "\t\t
2 | \n", "\t\t\tS | \n", "\t\t\t2847 | \n", "\t\t
Store | \n", "\t\t\tDay | \n", "\t\t\tSales | \n", "\t\t
---|---|---|
1 | \n", "\t\t\tM | \n", "\t\t\t1200 | \n", "\t\t
1 | \n", "\t\t\tT | \n", "\t\t\t1435 | \n", "\t\t
1 | \n", "\t\t\tW | \n", "\t\t\t1712 | \n", "\t\t
1 | \n", "\t\t\tR | \n", "\t\t\t1529 | \n", "\t\t
1 | \n", "\t\t\tF | \n", "\t\t\t1920 | \n", "\t\t
1 | \n", "\t\t\tS | \n", "\t\t\t2325 | \n", "\t\t
2 | \n", "\t\t\tM | \n", "\t\t\t2215 | \n", "\t\t
2 | \n", "\t\t\tT | \n", "\t\t\t2458 | \n", "\t\t
2 | \n", "\t\t\tW | \n", "\t\t\t1789 | \n", "\t\t
2 | \n", "\t\t\tR | \n", "\t\t\t1692 | \n", "\t\t
2 | \n", "\t\t\tF | \n", "\t\t\t2105 | \n", "\t\t
2 | \n", "\t\t\tS | \n", "\t\t\t2847 | \n", "\t\t
The following program concatenates the store1 and store2 data sets to create a new \"tall\" data set called bothstores:
\n", "Store | \n", "Day | \n", "Sales | \n", "
---|---|---|
1 | \n", "M | \n", "1200 | \n", "
1 | \n", "T | \n", "1435 | \n", "
1 | \n", "W | \n", "1712 | \n", "
1 | \n", "R | \n", "1529 | \n", "
1 | \n", "F | \n", "1920 | \n", "
1 | \n", "S | \n", "2325 | \n", "
2 | \n", "M | \n", "2215 | \n", "
2 | \n", "T | \n", "2458 | \n", "
2 | \n", "W | \n", "1798 | \n", "
2 | \n", "R | \n", "1692 | \n", "
2 | \n", "F | \n", "2105 | \n", "
2 | \n", "S | \n", "2847 | \n", "
Note that the input data sets — store1 and store2 — contain the same variables — Store, Day, and Sales — with identical attributes. In the third DATA step, the DATA statement tells SAS to create a new data set called bothstores, and the SET statement tells SAS that the data set should contain first the observations from store1 and then the observations from store2. Note that although we have specified only two input data sets here, the SET statement can contain any number of input data sets.
\n", "Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that SAS did indeed concatenate the store1 and store2 data sets to make one \"tall\" data set called bothstores. You might then want to edit the SET statement so that store1 follows store2, and re-run the SAS program to see that then the contents of store1 follow the contents of store2 in the bothstores data set.
\n", "In general, a data set that is created by concatenating data sets contains all of the variables and all of the observations from all of the input data sets. Therefore, the number of variables the new data set contains always equals the total number of unique variables among all of the input data sets. And, the number of observations in the new data set is the sum of the numbers of observations in the input data sets. Let's return to the contrived example we've used throughout this lesson.
\n", "The following program concatenates the one and two data sets to create a new \"tall\" data set called onetopstwo:
\n", "ID | \n", "VarA | \n", "VarB | \n", "VarC | \n", "
---|---|---|---|
10 | \n", "A1 | \n", "B1 | \n", "\n", " |
20 | \n", "A2 | \n", "B2 | \n", "\n", " |
30 | \n", "A3 | \n", "B3 | \n", "\n", " |
40 | \n", "\n", " | B4 | \n", "C1 | \n", "
50 | \n", "\n", " | B5 | \n", "C2 | \n", "
As you review the first two DATA steps, in which SAS reads in the respective one and two data sets, note that the total number of unique variables is four — ID, VarA, VarB, and VarC. The total number of observations among the two input data sets is 3 + 2 = 5. Therefore, we can expect the concatenated data set onetopstwo to contain four variables and five observations. Launch and run the SAS program, and review the output to convince yourself that SAS did grab first all of the variables and all of the observations from the one data set and then all of the variables and all of the observations from the two data set. As you can see, to make it all work out okay, observations arising from the one data set have missing values for VarC, and observations from the two data set have missing values for VarA.
\n", "In the following SAS program, we will perform an outer join of the base and visits dataset by merging based in the patient id variable.
\n", "Obs | \n", "id | \n", "age | \n", "
---|---|---|
1 | \n", "1 | \n", "50 | \n", "
2 | \n", "2 | \n", "51 | \n", "
3 | \n", "3 | \n", "52 | \n", "
4 | \n", "4 | \n", "53 | \n", "
5 | \n", "5 | \n", "54 | \n", "
6 | \n", "6 | \n", "55 | \n", "
7 | \n", "7 | \n", "56 | \n", "
8 | \n", "8 | \n", "57 | \n", "
9 | \n", "9 | \n", "58 | \n", "
10 | \n", "10 | \n", "59 | \n", "
Obs | \n", "id | \n", "visit | \n", "outcome | \n", "
---|---|---|---|
1 | \n", "1 | \n", "1 | \n", "11 | \n", "
2 | \n", "1 | \n", "2 | \n", "12 | \n", "
3 | \n", "1 | \n", "3 | \n", "13 | \n", "
4 | \n", "2 | \n", "1 | \n", "21 | \n", "
5 | \n", "2 | \n", "2 | \n", "22 | \n", "
6 | \n", "2 | \n", "3 | \n", "23 | \n", "
7 | \n", "3 | \n", "1 | \n", "31 | \n", "
8 | \n", "3 | \n", "2 | \n", "32 | \n", "
9 | \n", "3 | \n", "3 | \n", "33 | \n", "
10 | \n", "4 | \n", "1 | \n", "41 | \n", "
11 | \n", "4 | \n", "2 | \n", "42 | \n", "
12 | \n", "4 | \n", "3 | \n", "43 | \n", "
13 | \n", "5 | \n", "1 | \n", "51 | \n", "
14 | \n", "5 | \n", "2 | \n", "52 | \n", "
15 | \n", "5 | \n", "3 | \n", "53 | \n", "
16 | \n", "6 | \n", "1 | \n", "61 | \n", "
17 | \n", "6 | \n", "2 | \n", "62 | \n", "
18 | \n", "6 | \n", "3 | \n", "63 | \n", "
19 | \n", "7 | \n", "1 | \n", "71 | \n", "
20 | \n", "7 | \n", "2 | \n", "72 | \n", "
21 | \n", "7 | \n", "3 | \n", "73 | \n", "
22 | \n", "8 | \n", "1 | \n", "81 | \n", "
23 | \n", "8 | \n", "2 | \n", "82 | \n", "
24 | \n", "8 | \n", "3 | \n", "83 | \n", "
25 | \n", "11 | \n", "3 | \n", "50 | \n", "
To peform an outer join between the base and visits dataset, we simply use the MERGE statemet with these two datasets and a BY statement with the common variable id. Note that these two datasets are already sorted by id. If they were not, we would also need to sort both datasets by id first.
\n", "Obs | \n", "id | \n", "age | \n", "visit | \n", "outcome | \n", "
---|---|---|---|---|
1 | \n", "1 | \n", "50 | \n", "1 | \n", "11 | \n", "
2 | \n", "1 | \n", "50 | \n", "2 | \n", "12 | \n", "
3 | \n", "1 | \n", "50 | \n", "3 | \n", "13 | \n", "
4 | \n", "2 | \n", "51 | \n", "1 | \n", "21 | \n", "
5 | \n", "2 | \n", "51 | \n", "2 | \n", "22 | \n", "
6 | \n", "2 | \n", "51 | \n", "3 | \n", "23 | \n", "
7 | \n", "3 | \n", "52 | \n", "1 | \n", "31 | \n", "
8 | \n", "3 | \n", "52 | \n", "2 | \n", "32 | \n", "
9 | \n", "3 | \n", "52 | \n", "3 | \n", "33 | \n", "
10 | \n", "4 | \n", "53 | \n", "1 | \n", "41 | \n", "
11 | \n", "4 | \n", "53 | \n", "2 | \n", "42 | \n", "
12 | \n", "4 | \n", "53 | \n", "3 | \n", "43 | \n", "
13 | \n", "5 | \n", "54 | \n", "1 | \n", "51 | \n", "
14 | \n", "5 | \n", "54 | \n", "2 | \n", "52 | \n", "
15 | \n", "5 | \n", "54 | \n", "3 | \n", "53 | \n", "
16 | \n", "6 | \n", "55 | \n", "1 | \n", "61 | \n", "
17 | \n", "6 | \n", "55 | \n", "2 | \n", "62 | \n", "
18 | \n", "6 | \n", "55 | \n", "3 | \n", "63 | \n", "
19 | \n", "7 | \n", "56 | \n", "1 | \n", "71 | \n", "
20 | \n", "7 | \n", "56 | \n", "2 | \n", "72 | \n", "
21 | \n", "7 | \n", "56 | \n", "3 | \n", "73 | \n", "
22 | \n", "8 | \n", "57 | \n", "1 | \n", "81 | \n", "
23 | \n", "8 | \n", "57 | \n", "2 | \n", "82 | \n", "
24 | \n", "8 | \n", "57 | \n", "3 | \n", "83 | \n", "
25 | \n", "9 | \n", "58 | \n", ". | \n", ". | \n", "
26 | \n", "10 | \n", "59 | \n", ". | \n", ". | \n", "
27 | \n", "11 | \n", ". | \n", "3 | \n", "50 | \n", "
In this merge, all variables and rows from both datasets are kept with the id columns from both datasets merged to a single column. Note that for id's that appeared in base and not in visits, the variables in visits, visit and outcome, that were not in base were set to missing (see id 9 and 10), and for ids in the visit dataset that were not in the base dataset, the variables that were in base but not in visit, age in this case, were set to missing (see id 11).
\n", "Let's merge base and visit again, but this time create the IN= variables for each dataset and store their values in two permanent variables in_base and in_visit:
\n", "Obs | \n", "id | \n", "age | \n", "visit | \n", "outcome | \n", "in_base | \n", "in_visit | \n", "
---|---|---|---|---|---|---|
22 | \n", "8 | \n", "57 | \n", "1 | \n", "81 | \n", "1 | \n", "1 | \n", "
23 | \n", "8 | \n", "57 | \n", "2 | \n", "82 | \n", "1 | \n", "1 | \n", "
24 | \n", "8 | \n", "57 | \n", "3 | \n", "83 | \n", "1 | \n", "1 | \n", "
25 | \n", "9 | \n", "58 | \n", ". | \n", ". | \n", "1 | \n", "0 | \n", "
26 | \n", "10 | \n", "59 | \n", ". | \n", ". | \n", "1 | \n", "0 | \n", "
27 | \n", "11 | \n", ". | \n", "3 | \n", "50 | \n", "0 | \n", "1 | \n", "
The ids for patients 1 through 8 appear in both dataset, so for all of the rows corresponding to these ids both in_base and in_visit have the value 1 since we combined data from both datasets to create these rows. But for ids 9 and 10, these only appeared in the base dataset so in_base is 1 and in_visit is 0, since there was no data from visit to use to make these rows. Similarly for id 11, the only data was contained in the visit dataset so in_visit is 1 and in_base is 0.
\n", "The following SAS program performs an inner join between the base and visit dataset. This means that we will only keep rows in which the merging variable, id, appear in both datasets.
\n", "Obs | \n", "id | \n", "age | \n", "visit | \n", "outcome | \n", "
---|---|---|---|---|
1 | \n", "1 | \n", "50 | \n", "1 | \n", "11 | \n", "
2 | \n", "1 | \n", "50 | \n", "2 | \n", "12 | \n", "
3 | \n", "1 | \n", "50 | \n", "3 | \n", "13 | \n", "
4 | \n", "2 | \n", "51 | \n", "1 | \n", "21 | \n", "
5 | \n", "2 | \n", "51 | \n", "2 | \n", "22 | \n", "
6 | \n", "2 | \n", "51 | \n", "3 | \n", "23 | \n", "
7 | \n", "3 | \n", "52 | \n", "1 | \n", "31 | \n", "
8 | \n", "3 | \n", "52 | \n", "2 | \n", "32 | \n", "
9 | \n", "3 | \n", "52 | \n", "3 | \n", "33 | \n", "
10 | \n", "4 | \n", "53 | \n", "1 | \n", "41 | \n", "
11 | \n", "4 | \n", "53 | \n", "2 | \n", "42 | \n", "
12 | \n", "4 | \n", "53 | \n", "3 | \n", "43 | \n", "
13 | \n", "5 | \n", "54 | \n", "1 | \n", "51 | \n", "
14 | \n", "5 | \n", "54 | \n", "2 | \n", "52 | \n", "
15 | \n", "5 | \n", "54 | \n", "3 | \n", "53 | \n", "
16 | \n", "6 | \n", "55 | \n", "1 | \n", "61 | \n", "
17 | \n", "6 | \n", "55 | \n", "2 | \n", "62 | \n", "
18 | \n", "6 | \n", "55 | \n", "3 | \n", "63 | \n", "
19 | \n", "7 | \n", "56 | \n", "1 | \n", "71 | \n", "
20 | \n", "7 | \n", "56 | \n", "2 | \n", "72 | \n", "
21 | \n", "7 | \n", "56 | \n", "3 | \n", "73 | \n", "
22 | \n", "8 | \n", "57 | \n", "1 | \n", "81 | \n", "
23 | \n", "8 | \n", "57 | \n", "2 | \n", "82 | \n", "
24 | \n", "8 | \n", "57 | \n", "3 | \n", "83 | \n", "
Now only the rows made from ids 1 to 8 are in the merged dataset because these were the only ids that occurred in both the base and visit datasets.
\n", "In this example, we will perform left and right joins using the IN= variables. A left join will keep all records from the left (i.e. first dataset in the MERGE statement) and join matching records from the right dataset and dropping the rest. The reverse happesn for a right join.
\n", "Obs | \n", "id | \n", "age | \n", "visit | \n", "outcome | \n", "
---|---|---|---|---|
1 | \n", "1 | \n", "50 | \n", "1 | \n", "11 | \n", "
2 | \n", "1 | \n", "50 | \n", "2 | \n", "12 | \n", "
3 | \n", "1 | \n", "50 | \n", "3 | \n", "13 | \n", "
4 | \n", "2 | \n", "51 | \n", "1 | \n", "21 | \n", "
5 | \n", "2 | \n", "51 | \n", "2 | \n", "22 | \n", "
6 | \n", "2 | \n", "51 | \n", "3 | \n", "23 | \n", "
7 | \n", "3 | \n", "52 | \n", "1 | \n", "31 | \n", "
8 | \n", "3 | \n", "52 | \n", "2 | \n", "32 | \n", "
9 | \n", "3 | \n", "52 | \n", "3 | \n", "33 | \n", "
10 | \n", "4 | \n", "53 | \n", "1 | \n", "41 | \n", "
11 | \n", "4 | \n", "53 | \n", "2 | \n", "42 | \n", "
12 | \n", "4 | \n", "53 | \n", "3 | \n", "43 | \n", "
13 | \n", "5 | \n", "54 | \n", "1 | \n", "51 | \n", "
14 | \n", "5 | \n", "54 | \n", "2 | \n", "52 | \n", "
15 | \n", "5 | \n", "54 | \n", "3 | \n", "53 | \n", "
16 | \n", "6 | \n", "55 | \n", "1 | \n", "61 | \n", "
17 | \n", "6 | \n", "55 | \n", "2 | \n", "62 | \n", "
18 | \n", "6 | \n", "55 | \n", "3 | \n", "63 | \n", "
19 | \n", "7 | \n", "56 | \n", "1 | \n", "71 | \n", "
20 | \n", "7 | \n", "56 | \n", "2 | \n", "72 | \n", "
21 | \n", "7 | \n", "56 | \n", "3 | \n", "73 | \n", "
22 | \n", "8 | \n", "57 | \n", "1 | \n", "81 | \n", "
23 | \n", "8 | \n", "57 | \n", "2 | \n", "82 | \n", "
24 | \n", "8 | \n", "57 | \n", "3 | \n", "83 | \n", "
25 | \n", "9 | \n", "58 | \n", ". | \n", ". | \n", "
26 | \n", "10 | \n", "59 | \n", ". | \n", ". | \n", "
Obs | \n", "id | \n", "age | \n", "visit | \n", "outcome | \n", "
---|---|---|---|---|
1 | \n", "1 | \n", "50 | \n", "1 | \n", "11 | \n", "
2 | \n", "1 | \n", "50 | \n", "2 | \n", "12 | \n", "
3 | \n", "1 | \n", "50 | \n", "3 | \n", "13 | \n", "
4 | \n", "2 | \n", "51 | \n", "1 | \n", "21 | \n", "
5 | \n", "2 | \n", "51 | \n", "2 | \n", "22 | \n", "
6 | \n", "2 | \n", "51 | \n", "3 | \n", "23 | \n", "
7 | \n", "3 | \n", "52 | \n", "1 | \n", "31 | \n", "
8 | \n", "3 | \n", "52 | \n", "2 | \n", "32 | \n", "
9 | \n", "3 | \n", "52 | \n", "3 | \n", "33 | \n", "
10 | \n", "4 | \n", "53 | \n", "1 | \n", "41 | \n", "
11 | \n", "4 | \n", "53 | \n", "2 | \n", "42 | \n", "
12 | \n", "4 | \n", "53 | \n", "3 | \n", "43 | \n", "
13 | \n", "5 | \n", "54 | \n", "1 | \n", "51 | \n", "
14 | \n", "5 | \n", "54 | \n", "2 | \n", "52 | \n", "
15 | \n", "5 | \n", "54 | \n", "3 | \n", "53 | \n", "
16 | \n", "6 | \n", "55 | \n", "1 | \n", "61 | \n", "
17 | \n", "6 | \n", "55 | \n", "2 | \n", "62 | \n", "
18 | \n", "6 | \n", "55 | \n", "3 | \n", "63 | \n", "
19 | \n", "7 | \n", "56 | \n", "1 | \n", "71 | \n", "
20 | \n", "7 | \n", "56 | \n", "2 | \n", "72 | \n", "
21 | \n", "7 | \n", "56 | \n", "3 | \n", "73 | \n", "
22 | \n", "8 | \n", "57 | \n", "1 | \n", "81 | \n", "
23 | \n", "8 | \n", "57 | \n", "2 | \n", "82 | \n", "
24 | \n", "8 | \n", "57 | \n", "3 | \n", "83 | \n", "
25 | \n", "11 | \n", ". | \n", "3 | \n", "50 | \n", "