{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Manipulation\n", "\n", "In this section, we will cover three main topics\n", "\n", "1. Reshaping data frome wide (fat) to long (tall) formats\n", "2. Reshaping data from long (tall) to wide (fat) formats\n", "3. Merging datasets\n", "\n", "To reshape datasets, we will cover two methods\n", "\n", "* PROC TRANSPOSE\n", "* Using a DATA step with arrays\n", "\n", "In order to understand the second more general method, we will first need to learn about a few SAS programming keywords and structures, such as \n", "\n", "* The OUTPUT and RETAIN statements\n", "* Loops in SAS\n", "* SAS Arrays\n", "* FIRST. and LAST. SAS variables\n", "\n", "## The OUTPUT and RETAIN Statements\n", "\n", "When processing any DATA step, SAS follows two default procedures:\n", "\n", "1. When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. (SAS does not reset variables to missing if they were created by a SUM statement, or if the values came from a SAS data set via a SET or MERGE statement.)\n", "2. At the end of the DATA step after completing an iteration of the DATA step, SAS outputs the values of the variables in the program data vector to the SAS data set being created.\n", "\n", "In this lesson, we'll learn how to modify these default processes by using the OUTPUT and RETAIN statements:\n", "\n", "* The **OUTPUT** statement allows you to control when and to which data set you want an observation written.\n", "* The **RETAIN** statement causes a variable created in the DATA step to retain its value from the current observation into the next observation rather than it being set to missing at the beginning of each iteration of the DATA step.\n", "\n", "### The OUTPUT Statement\n", "\n", "An OUTPUT statement overrides the default process by telling SAS to output the current observation when the OUTPUT statement is processed — not at the end of the DATA step. The OUTPUT statement takes the form:\n", "\n", "`OUTPUT dataset1 dataset2 ... datasetn;`;\n", "\n", "where you may name as few or as many data sets as you like. If you use an OUTPUT statement without specifying a data set name, SAS writes the current observation to each of the data sets named in the DATA step. Any data set name appearing in the OUTPUT statement must also appear in the DATA statement.\n", "\n", "The OUTPUT statement is pretty powerful in that, among other things, it gives us a way:\n", "\n", "* to write observations to multiple data sets\n", "* to control output of observations to data sets based on certain conditions\n", "* to transpose datasets using the OUTPUT statement in conjunction with the RETAIN statement, BY group processing and the LAST.variable statement\n", "\n", "Throughout the rest of this section, we'll look at examples that illustrate how to use OUTPUT statements correctly. We'll work with the following subset of the ICDB Study's log data set (see the course website for icdblog.sas7bdat):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SAS Connection established. Subprocess id is 2262\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsSUBJV_TYPEV_DATEFORM
12100061205/06/94cmed
22100061205/06/94diet
32100061205/06/94med
42100061205/06/94phytrt
52100061205/06/94purg
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 \"/folders/myfolders/SAS_Notes/data\";\n", "\n", "PROC PRINT data = phc6089.icdblog (obs=5);\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, this log data set contains four variables:\n", "\n", "* `subj`: the subject's identification number\n", "* `v_type`: the type of clinic visit, which means the number of months since the subject was first seen in the clinic\n", "* `v_date`: the date of the clinic visit\n", "* `form`: codes that indicate the data forms that were completed during the subject's clinic visit\n", "\n", "The log data set is a rather typical data set that arises from large national clinical studies in which there are a number of sites around the country where data are collected. Typically, the clinical sites collect the data on data forms and then \"ship\" the data forms either electronically or by mail to a centralized location called a Data Coordinating Center (DCC). As you can well imagine, keeping track of the data forms at the DCC is a monumental task. For the ICDB Study, for example, the DCC received more than 68,000 data forms over the course of the study.\n", "\n", "In order to keep track of the data forms that arrive at the DCC, they are \"logged\" into a data base and subsequently tracked as they are processed at the DCC. In reality, a log data base will contain many more variables than we have in our subset, such as dates the data on the forms were entered into the data base, who entered the data, the dates the entered data were verified, who verified the data, and so on. To keep our life simple, we'll just use the four variables described above.\n", "\n", "
\n", "

Example

\n", "

This example uses the OUTPUT statement to tell SAS to write observations to data sets based on certain conditions. Specifically, the following program uses the OUTPUT statement to create three SAS data sets — s210006, s310032, and s410010 — based on whether the subject identification numbers in the icdblog data set meet a certain condition:

\n", "
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The s210006 data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SUBJV_TYPEV_DATEFORM
2100061205/06/94cmed
2100061205/06/94diet
2100061205/06/94med
2100061205/06/94phytrt
2100061205/06/94purg
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

The s310032 data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SUBJV_TYPEV_DATEFORM
3100322409/19/95backf
3100322409/19/95cmed
3100322409/19/95diet
3100322409/19/95med
3100322409/19/95medhxf
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

The s410010 data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SUBJV_TYPEV_DATEFORM
410010605/12/94cmed
410010605/12/94diet
410010605/12/94med
410010605/12/94phytrt
410010605/12/94purg
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 \"/folders/myfolders/SAS_Notes/data\";\n", "\n", "DATA s210006 s310032 s410010;\n", " set phc6089.icdblog;\n", " if (subj = 210006) then output s210006;\n", " else if (subj = 310032) then output s310032;\n", " else if (subj = 410010) then output s410010;\n", "RUN;\n", " \n", "PROC PRINT data = s210006 (obs=5) NOOBS;\n", " title 'The s210006 data set';\n", "RUN;\n", " \n", "PROC PRINT data = s310032 (obs=5) NOOBS;\n", " title 'The s310032 data set';\n", "RUN;\n", " \n", "PROC PRINT data = s410010 (obs=5) NOOBS;\n", " title 'The s410010 data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As you can see, the DATA statement contains three data set names — s210006, s310032, and s410010. That tells SAS that we want to create three data sets with the given names. The SET statement, of course, tells SAS to read observations from the permanent data set called stat481.icdblog. Then comes the IF-THEN-ELSE and OUTPUT statements that make it all work. The first IF-THEN tells SAS to output any observations pertaining to subject 210006 to the s210006 data set; the second IF-THEN tells SAS to output any observations pertaining to subject 310032 to the s310032 data set; and, the third IF-THEN statement tells SAS to output any observations pertaining to subject 410010 to the s410010 data set. SAS will hiccup if you have a data set name that appears in an OUTPUT statement without it also appearing in the DATA statement.

\n", "

The PRINT procedures, of course, tell SAS to print the three newly created data sets. Note that the last PRINT procedure does not have a DATA= option. That's because when you name more than one data set in a single DATA statement, the last name on the DATA statement is the most recently created data set, and the one that subsequent procedures use by default. Therefore, the last PRINT procedure will print the s410010 data set by default.

\n", "

Note that the IF-THEN-ELSE construct used here in conjunction with the OUTPUT statement is comparable to attaching the WHERE= option to each of the data sets appearing in the DATA statement.

\n", "

Before running the code be sure that you have saved the icdblog dataset and changed the LIBNAME statement to the folder where you saved it.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

Using an OUTPUT statement suppresses the automatic output of observations at the end of the DATA step. Therefore, if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step. The following SAS program illustrates what happens if you fail to direct all of the observations to output:

\n", "
" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The subj210006 data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SUBJV_TYPEV_DATEFORM
2100061205/06/94cmed
2100061205/06/94diet
2100061205/06/94med
2100061205/06/94phytrt
2100061205/06/94purg
2100061205/06/94qul
2100061205/06/94sympts
2100061205/06/94urn
2100061205/06/94void
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA subj210006 subj310032;\n", " set phc6089.icdblog;\n", " if (subj = 210006) then output subj210006;\n", "RUN;\n", " \n", "PROC PRINT data = subj210006 NOOBS;\n", " title 'The subj210006 data set';\n", "RUN;" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "

\n", "\n", "
198  ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
198! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
199
200 PROC PRINT data = subj310032 NOOBS;
201 title 'The subj310032 data set';
202 RUN;
NOTE: No observations in data set WORK.SUBJ310032.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

203
204 ods html5 (id=saspy_internal) close;ods listing;

205
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC PRINT data = subj310032 NOOBS;\n", " title 'The subj310032 data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The DATA statement contains two data set names, subj210006 and subj310032, telling SAS that we intend to create two data sets. However, as you can see, the IF statement contains an OUTPUT statement that directs output to the subj210006 data set, but no OUTPUT statement directs output to the subj310032 data set. Launch and run the SAS program to convince yourself that the subj210006 data set contains data for subject 210006, while the subj310032 data set contains 0 observations. You should see a message in the log window like the one shown above as well as see that no output for the subj310032 data set appears in the output window.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

If you use an assignment statement to create a new variable in a DATA step in the presence of OUTPUT statements, you have to make sure that you place the assignment statement before the OUTPUT statements. Otherwise, SAS will have already written the observation to the SAS data set, and the newly created variable will be set to missing. The following SAS program illustrates an example of how two variables, current and days_vis, get set to missing in the output data sets because their values get calculated after SAS has already written the observation to the SAS data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The subj310032 data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SUBJV_TYPEV_DATEFORMcurrentdays_vis
3100322409/19/95backf..
3100322409/19/95cmed..
3100322409/19/95diet..
3100322409/19/95med..
3100322409/19/95medhxf..
3100322409/19/95phs..
3100322409/19/95phytrt..
3100322409/19/95preg..
3100322409/19/95purg..
3100322409/19/95qul..
3100322409/19/95sympts..
3100322409/19/95urn..
3100322409/19/95void..
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA subj210006 subj310032 subj410010;\n", " set phc6089.icdblog;\n", " if (subj = 210006) then output subj210006;\n", " else if (subj = 310032) then output subj310032;\n", " else if (subj = 410010) then output subj410010;\n", " current = today();\n", " days_vis = current - v_date;\n", " format current mmddyy8.;\n", "RUN;\n", " \n", "PROC PRINT data = subj310032 NOOBS;\n", " title 'The subj310032 data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The main thing to note in this program is that the current and days_vis assignment statements appear after the IF-THEN-ELSE and OUTPUT statements. That means that each observation will be written to one of the three output data sets before the current and days_vis values are even calculated. Because SAS sets variables created in the DATA step to missing at the beginning of each iteration of the DATA step, the values of current and days_vis will remain missing for each observation.

\n", "

By the way, the today( ) function, which is assigned to the variable current, creates a date variable containing today's date. Therefore, the variable days_vis is meant to contain the number of days since the subject's recorded visit v_date. However, as described above, the values of current and days_vis get set to missing. Launch and run the SAS program to convince yourself that the current and days_vis variables in the subj310032 data set contain only missing values. If we were to print the subj210006 and subj410020 data sets, we would see the same thing.

\n", "

The following SAS program illustrates the corrected code for the previous DATA step, that is, for creating new variables with assignment statements in the presence of OUTPUT statements:

\n", "
" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The subj310032 data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SUBJV_TYPEV_DATEFORMcurrentdays_vis
3100322409/19/95backf09/30/209143
3100322409/19/95cmed09/30/209143
3100322409/19/95diet09/30/209143
3100322409/19/95med09/30/209143
3100322409/19/95medhxf09/30/209143
3100322409/19/95phs09/30/209143
3100322409/19/95phytrt09/30/209143
3100322409/19/95preg09/30/209143
3100322409/19/95purg09/30/209143
3100322409/19/95qul09/30/209143
3100322409/19/95sympts09/30/209143
3100322409/19/95urn09/30/209143
3100322409/19/95void09/30/209143
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA subj210006 subj310032 subj410010;\n", " set phc6089.icdblog;\n", " current = today();\n", " days_vis = current - v_date;\n", " format current mmddyy8.;\n", " if (subj = 210006) then output subj210006;\n", " else if (subj = 310032) then output subj310032;\n", " else if (subj = 410010) then output subj410010;\n", "RUN;\n", " \n", "PROC PRINT data = subj310032 NOOBS;\n", " title 'The subj310032 data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Now, since the assignment statements precede the OUTPUT statements, the variables are correctly written to the output data sets. That is, now the variable current contains the date in which the program was run and the variable days_vis contains the number of days since that date and the date of the subject's visit. Launch and run the SAS program to convince yourself that the current and days_vis variables are properly written to the subj310032 data set. If we were to print the subj210006 and subj410020 data sets, we would see similar results.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

After SAS processes an OUTPUT statement within a DATA step, the observation remains in the program data vector and you can continue programming with it. You can even output the observation again to the same SAS data set or to a different one! The following SAS program illustrates how you can create different data sets with the some of the same observations. That is, the data sets created in your DATA statement do not have to be mutually exclusive:

\n", "
" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The symptoms data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SUBJV_TYPEV_DATEFORM
2100061205/06/94sympts
3100322409/19/95sympts
410010605/12/94sympts
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

The visitsix data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SUBJV_TYPEV_DATEFORM
410010605/12/94cmed
410010605/12/94diet
410010605/12/94med
410010605/12/94phytrt
410010605/12/94purg
410010605/12/94qul
410010605/12/94sympts
410010605/12/94urn
410010605/12/94void
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA symptoms visitsix;\n", " set phc6089.icdblog;\n", " if form = 'sympts' then output symptoms;\n", " if v_type = 6 then output visitsix;\n", "RUN;\n", " \n", "PROC PRINT data = symptoms NOOBS;\n", " title 'The symptoms data set';\n", "RUN;\n", " \n", "PROC PRINT data = visitsix NOOBS;\n", " title 'The visitsix data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The DATA step creates two temporary data sets, symptoms and visitsix. The symptoms data set contains only those observations containing a form code of sympts. The visitsix data set, on the other hand, contains observations for which v_type equals 6. The observations in the two data sets are therefore not necessarily mutually exclusive. In fact, launch and run the SAS program and review the output from the PRINT procedures. Note that the observation for subject 410010 in which form = sympts is contained in both the symptoms and visitsix data sets.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The RETAIN Statement\n", "\n", "When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. A RETAIN statement effectively overrides this default. That is, a RETAIN statement tells SAS not to set variables whose values are assigned by an INPUT or assignment statement to missing when going from the current iteration of the DATA step to the next. Instead, SAS retains the values. The RETAIN statement takes the generic form:\n", "\n", "`RETAIN variable1 variable2 ... variablen;`\n", "\n", "You can specify as few or as many variables as you want. If you specify no variable names, then SAS retains the values of all of the variables created in an INPUT or assignment statement. You may initialize the values of variables within a RETAIN statement. For example, in the statement:\n", "\n", "`RETAIN var1 0 var2 3 a b c 'XYZ'`\n", "\n", "the variable var1 is assigned the value 0; the variable var2 is assigned the value 3, and the variables a, b, and c are all assigned the character value 'XYZ'. If you do not specify an initial value, SAS sets the initial value of a variable to be retained to missing.\n", "\n", "Finally, since the RETAIN statement is not an executable statement, it can appear anywhere in the DATA step.\n", "\n", "Throughout the remainder of the lesson, we will work with the grades data set that is created in the following DATA step:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The grades data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
idnol_namegtypegrade
10SmithE178
10SmithE282
10SmithE386
10SmithE469
10SmithP197
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA grades;\n", " input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;\n", " cards;\n", "10 Smith E1 78\n", "10 Smith E2 82\n", "10 Smith E3 86\n", "10 Smith E4 69\n", "10 Smith P1 97\n", "10 Smith F1 160\n", "11 Simon E1 88\n", "11 Simon E2 72\n", "11 Simon E3 86\n", "11 Simon E4 99\n", "11 Simon P1 100\n", "11 Simon F1 170\n", "12 Jones E1 98\n", "12 Jones E2 92\n", "12 Jones E3 92\n", "12 Jones E4 99\n", "12 Jones P1 99\n", "12 Jones F1 185\n", ";\n", "RUN;\n", " \n", "PROC PRINT data = grades (obs=5) NOOBS;\n", " title 'The grades data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The grades data set is what we call a \"subject- and grade-specific\" data set. That is, there is one observation for each grade for each student. Students are identified by their id number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. We'll suppose that the instructor agreed to drop the students' lowest exam grades (E1, E2, E3, E4) not including the final exam. Launch and run the SAS program so that we can work with the grades data set in the following examples. Review the output from the PRINT procedure to convince yourself that the data were properly read into the grades data set.\n", "\n", "Before we look at an example using the RETAIN statement, let's look at the SAS variables FIRST. and LAST." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following SAS program illustrates the SAS variables FIRST. and LAST. that can be obtained when using the BY statement on a sorted dataset in a DATA step to identify the first and last grade record for each student in the dataset.

\n", "
" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The grades data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsidnol_namegtypegradefirstGradelastGrade
110SmithE17810
210SmithE28200
310SmithE38600
410SmithE46900
510SmithP19700
610SmithF116001
711SimonE18810
811SimonE27200
911SimonE38600
1011SimonE49900
1111SimonP110000
1211SimonF117001
1312JonesE19810
1412JonesE29200
1512JonesE39200
1612JonesE49900
1712JonesP19900
1812JonesF118501
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SORT data = grades out = srt_grades;\n", " BY idno;\n", "RUN;\n", "\n", "DATA grades_first_last;\n", " SET srt_grades;\n", " BY idno;\n", " firstGrade = FIRST.idno;\n", " lastGrade = LAST.idno;\n", "RUN;\n", "\n", "PROC PRINT data = grades_first_last;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Because we are doing BY group processing on the variable idno, we must have the dataset sorted by idno. In this case the dataset was actually already sorted by idno, but I added the PROC SORT anyway to emphasize that the dataset must be sorted first.

\n", "

The SET and BY statement tell SAS to process the data by grouping observations with the same idno together. To do this, SAS automatically creats two temporary variables for each variable name in the BY statement. One of the temporary variables is called FIRST.variable, where variable is the variable name appearing the BY statement. The other temporary variable is called LAST.variable. Both take the values 0 or 1:

\n", "
    \n", "
  • FIRST.variable = 1 when an observation is the first observation in a BY group
  • \n", "
  • FIRST.variable = 0 when an observation is not the first observation in a BY group
  • \n", "
  • LAST.variable = 1 when an observation is the last observation in a BY group
  • \n", "
  • LAST.variable = 0 when an observation is not the last observation in a BY group
  • \n", "
\n", "

SAS uses the values of the FIRST.variable and LAST.variable temporary variables to identify the first and last observations in a group, and therefore the group itself. Oh, a comment about that adjective temporary ... SAS places FIRST.variable and LAST.variable in the program data vector and they are therefore available for DATA step programming, but SAS does not add them to the SAS data set being created. It is in that sense that they are temporary.

\n", "

Because SAS does not write FIRST.variables and LAST.variables to output data sets, we have to do some finagling to see their contents. The two assignment statements:

\n", "
\n",
    "    firstGrade = FIRST.idno;\n",
    "    lastGrade = LAST.idno;\n",
    "    
\n", "

simply tell SAS to assign the values of the temporary variables, FIRST.idno and LAST.idno, to permanent variables, firstGrade and lastGrade, respectively. The PRINT procedure tells SAS to print the resulting data set so that we can take an inside peek at the values of the FIRST.variables and LAST.variables.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

One of the most powerful uses of a RETAIN statement is to compare values across observations. The following program uses the RETAIN statement to compare values across observations, and in doing so determines each student's lowest grade of the four semester exams:

\n", "
" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: LOWEST

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsidnol_namegradelowgradegtype
110Smith6969E4
211Simon9972E2
312Jones9992E3
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA exams;\n", " set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4')));\n", "RUN;\n", " \n", "DATA lowest (rename = (lowtype = gtype));\n", " set exams;\n", " by idno;\n", " retain lowgrade lowtype;\n", " if first.idno then lowgrade = grade;\n", " lowgrade = min(lowgrade, grade);\n", " if grade = lowgrade then lowtype = gtype;\n", " if last.idno then output;\n", " drop gtype;\n", "RUN;\n", " \n", "PROC PRINT data=lowest;\n", " title 'Output Dataset: LOWEST';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Because the instructor only wants to drop the lowest exam grade, the first DATA step tells SAS to create a data set called exams by selecting only the exam grades (E1, E2, E3, and E4) from the data set grades.

\n", "

It's the second DATA step that is the meat of the program and the challenging one to understand. The DATA step searches through the exams data set for each subject (\"by idno\") and looks for the lowest grade (\"min(lowgrade, grade)\"). Because SAS would otherwise set the variables lowgrade and lowtype to missing for each new iteration, the RETAIN statement is used to keep track of the observation that contains the lowest grade. When SAS reads the last observation of the student (\"last.idno\") it outputs the data corresponding to the lowest exam type (lowtype) and grade (lowgrade) to the lowest data set. (Note that the statement \"if last.idno then output;\" effectively collapses multiple observations per student into one observation per student.) So that we can merge the lowest data set back into the grades data set, by idno and gtype, the variable lowtype is renamed back to gtype.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DO Loops\n", "\n", "When programming, you can find yourself needing to tell SAS to execute the same statements over and over again. That's when a DO loop can come in and save your day. The actions of some DO loops are unconditional in that if you tell SAS to do something 20 times, SAS will do it 20 times regardless. We call those kinds of loops **iterative DO loops**. On the other hand, actions of some DO loops are conditional in that you tell SAS to do something until a particular condition is met or to do something while a particular condition is met. We call the former a **DO UNTIL** loop and the latter a **DO WHILE** loop. In this lesson, we'll explore the ins and outs of these three different kinds of loops, as well as take a look at lots of examples in which they are used. Then, in the next section, we'll use DO loops to help us process arrays.\n", "\n", "### Iterative DO Loops\n", "\n", "In this section, we'll explore the use of iterative DO loops, in which you tell SAS to execute a statement or a group of statements a certain number of times. Let's take a look at some examples.\n", "\n", "
\n", "

Example

\n", "

The following program uses a DO loop to tell SAS to determine what four times three (4 × 3) equals:

\n", "
" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Four Times Three Equals...

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
answeri
125
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA multiply;\n", " answer = 0;\n", " do i = 1 to 4;\n", " answer + 3;\n", " end;\n", "RUN;\n", " \n", "PROC PRINT NOOBS;\n", " title 'Four Times Three Equals...';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Okay... admittedly, we could accomplish our goal of determining four times three in a much simpler way, but then we wouldn't have the pleasure of seeing how we can accomplish it using an iterative DO loop! The key to understanding the DATA step here is to recall that multiplication is just repeated addition. That is, four times three (4 × 3) is the same as adding three together four times, that is, 3 + 3 + 3 + 3. That's all that the iterative DO loop in the DATA step is telling SAS to do. After having initialized answer to 0, add 3 to answer, then add 3 to answer again, and add 3 to answer again, and add 3 to answer again. After SAS has added 3 to the answer variable four times, SAS exits the DO loop, and since that's the end of the DATA step, SAS moves onto the next procedure and prints the result.

\n", "

The other thing you might want to notice about the DATA step is that there is no input data set or input data file. We are generating data from scratch here, rather than from some input source. Now, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that our code properly calculates four times three.

\n", "

Ahhh, what about that i variable that shows up in our multiply data set? If you look at our DATA step again, you can see that it comes from the DO loop. It is what is called the index variable (or counter variable). Most often, you'll want to drop it from your output data set, but its presence here is educational. As you can see, its current value is 5. That's what allows SAS to exit the DO loop... we tell SAS only to take the actions inside the loop until i equals 4. Once i becomes greater than 4, SAS jumps out of the loop, and moves on to the next statements in the DATA step. Let's take a look at the general form of iterative DO loops.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To construct an iterative DO loop, you need to start with a DO statement, then include some action statements, and then end with an END statement. Here's what a simple iterative DO loop should look like:\n", "\n", "
\n",
    "DO index-variable = start TO stop BY increment;\n",
    "    action statements;\n",
    "END;\n",
    "
\n", "\n", "where\n", "\n", "* DO, index-variable, start, TO, stop, and END are required in every iterative DO loop\n", "* index-variable, which stores the value of the current iteration of the DO loop, can be any valid SAS variable name. It is common, however, to use a single letter, with i and j being the most used.\n", "* start is the value of the index variable at which you want SAS to start the loop\n", "* stop is the value of the index variable at which you want SAS to stop the loop\n", "* increment is by how much you want SAS to change the index variable after each iteration. The most commonly used increment is 1. In fact, if you don't specify a BY clause, SAS uses the default increment of 1.\n", "\n", "For example,\n", "\n", "`do jack = 1 to 5;`\n", "\n", "tells SAS to create an index variable called jack, start at 1, increment by 1, and end at 5, so that the values of jack from iteration to iteration are 1, 2, 3, 4, and 5. And, this DO statement:\n", "\n", "`do jill = 2 to 12 by 2;`\n", "\n", "tells SAS to create an index variable called jill, start at 2, increment by 2, and end at 12, so that the values of jill from iteration to iteration are 2, 4, 6, 8, 10, and 12.\n", "\n", "
\n", "

Example

\n", "

The following SAS program uses an iterative DO loop to count backwards by 1:

\n", "
" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Counting Backwards by 1

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
i
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA backwardsbyone;\n", " do i = 20 to 1 by -1;\n", " output;\n", " end;\n", "RUN;\n", " \n", "PROC PRINT data = backwardsbyone NOOBS;\n", " title 'Counting Backwards by 1';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As you can see in this DO statement, you can decrement a DO loop's index variable by specifying a negative value for the BY clause. Here, we tell SAS to start at 20, and decrease the index variable by 1, until it reaches 1. The OUTPUT statement tells SAS to output the value of the index variable i for each iteration of the DO loop. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that our code properly counts backwards from 20 to 1.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than specifying start, stop and increment values in a DO statement, you can tell SAS how many times to execute a DO loop by listing items in a series. In this case, the general form of the iterative DO loop looks like this:\n", "\n", "
\n",
    "DO index-variable = value1, value2, value3, ...;\n",
    "    action statements;\n",
    "END;\n",
    "
\n", "\n", "where the values can be character or numeric. When the DO loop executes, it executes once for each item in the series. The index variable equals the value of the current item. You must use commas to separate items in a series. To list items in a series, you must specify\n", "\n", "either all numeric values: \n", "\n", "`DO i = 1, 2, 3, 4, 5;`\n", "\n", "all character values, with each value enclosed in quotation marks \n", "\n", "`DO j = 'winter', 'spring', 'summer', 'fall';`\n", "\n", "or all variable names: \n", "\n", "`DO k = first, second, third;`\n", "\n", "In this case, the index variable takes on the values of the specified variables. Note that the variable names are not enclosed in quotation marks, while quotation marks are required for character values.\n", "\n", "### Nested DO Loops\n", "\n", "Just like in other programming languages. We can nest loops within each other.\n", "\n", "
\n", "

Example

\n", "

Suppose you are interested in conducting an experiment with two factors A and B. Suppose factor A is, say, the amount of water with levels 1, 2, 3, and 4; and factor B is, say, the amount of sunlight, say with levels 1, 2, 3, 4, and 5. Then, the following SAS code uses nested iterative DO loops to generate the 4 by 5 factorial design:

\n", "
" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

4 by 5 Factorial Design

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsij
111
212
313
414
515
621
722
823
924
1025
1131
1232
1333
1434
1535
1641
1742
1843
1944
2045
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA design;\n", "DO i = 1 to 4;\n", " DO j = 1 to 5;\n", " output;\n", " END;\n", " END;\n", "RUN;\n", " \n", "PROC PRINT data = design;\n", " TITLE '4 by 5 Factorial Design';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

First, launch and run the SAS program. Then, review the output from the PRINT procedure to see the contents of the design data set. By doing so, you can get a good feel for how the nested DO loops work. First, SAS sets the value of the index variable i to 1, then proceeds to the next step which happens to be another iterative DO loop. While i is 1:

\n", "
    \n", "
  • SAS sets the value of j to 1, and outputs the observation in which i = 1 and j = 1.
  • \n", "
  • Then, SAS sets the value j to 2, and outputs the observation in which i = 1 and j = 2.
  • \n", "
  • Then, SAS sets the value j to 3, and outputs the observation in which i = 1 and j = 3.
  • \n", "
  • Then, SAS sets the value j to 4, and outputs the observation in which i = 1 and j = 4.
  • \n", "
  • Then, SAS sets the value j to 5, and outputs the observation in which i = 1 and j = 5.
  • \n", "
  • Then, SAS sets the value j to 6, and jumps out of the inside DO loop and proceeds to the next statement, which happens to be the end of the outside DO loop.
  • \n", "
\n", "

SAS then sets the value of the index variable i to 2, then proceeds through the inside DO loop again just as described above. This process continues until SAS sets the value of index variable i to 5, jumps out of the outside DO loop, and ends the DATA step.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DO UNITL and DO WHILE Loops\n", "\n", "As you now know, the iterative DO loop requires that you specify the number of iterations for the DO loop. However, there are times when you want to execute a DO loop until a condition is reached or while a condition exists, but you don't know how many iterations are needed. That's when the DO UNTIL loop and the DO WHILE loop can help save the day!\n", "\n", "In this section, we'll first learn about the DO UNTIL and DO WHILE loops. Then, we'll look at another form of the iterative DO loop that combines features of both conditional and unconditional DO loops.\n", "\n", "When you use a DO UNTIL loop, SAS executes the DO loop until the expression you've specified is true. Here's the general form of a DO UNTIL loop:\n", "\n", "
\n",
    "DO UNTIL (expression);\n",
    "    action statements;\n",
    "END;\n",
    "
\n", "\n", "where expression is any valid SAS expression enclosed in parentheses. The key thing to remember is that the expression is not evaluated until the bottom of the loop. Therefore, a DO UNTIL loop always executes at least once. As soon as the expression is determined to be true, the DO loop does not execute again.\n", "\n", "
\n", "

Example

\n", "

Suppose you want to know how many years it would take to accumulate 50,000 if you deposit 1200 each year into an account that earns 5% interest. The following program uses a DO UNTIL loop to perform the calculation for us:

\n", "
" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Years until at least $50,000

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
valueyear
1260.001
2583.002
3972.153
5430.764
6962.305
8570.416
10258.937
12031.888
13893.479
15848.1410
17900.5511
20055.5812
22318.3613
24694.2814
27188.9915
29808.4416
32558.8617
35446.8018
38479.1419
41663.1020
45006.2621
48516.5722
52202.4023
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA investment;\n", " RETAIN value 0 year 0;\n", " DO UNTIL (value >= 50000);\n", " value = value + 1200;\n", " value = value + value * 0.05;\n", " year = year + 1;\n", " OUTPUT;\n", " END;\n", "RUN;\n", " \n", "PROC PRINT data = investment NOOBS;\n", " title 'Years until at least $50,000';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Recall that the expression in the DO UNTIL statement is not evaluated until the bottom of the loop. Therefore, the DO UNTIL loop executes at least once. On the first iteration, the value variable is increased by 1200, or in this case, set to 1200. Then, the value variable is updated by calculating 1200 + 1200*0.05 to get 1260. Then, the year variable is increased by 1, or in this case, set to 1. The first observation, for which year = 1 and value = 1260, is then written to the output data set called investment. Having reached the bottom of the DO UNTIL loop, the expression (value >= 50000) is evaluated to determine if it is true. Since value is just 1260, the expression is not true, and so the DO UNTIL loop is executed once again. The process continues as described until SAS determines that value is at least 50000 and therefore stops executing the DO UNTIL loop.

\n", "

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that it would take 23 years to accumulate at least $50,000.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you use a DO WHILE loop, SAS executes the DO loop while the expression you've specified is true. Here's the general form of a DO WHILE loop:\n", "\n", "
\n",
    "DO WHILE (expression);\n",
    "      action statements;\n",
    "END;\n",
    "
\n", "\n", "where expression is any valid SAS expression enclosed in parentheses. An important difference between the DO UNTIL and DO WHILE statements is that the DO WHILE expression is evaluated at the top of the DO loop. If the expression is false the first time it is evaluated, then the DO loop doesn't even execute once.\n", "\n", "
\n", "

Example

\n", "

The following program attempts to use a DO WHILE loop to accomplish the same goal as the program above, namely to determine how many years it would take to accumulate \\$50,000 if you deposit \\$1200 each year into an account that earns 5% interest:

\n", "
" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Years until at least $50,000

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
valueyear
1260.001
2583.002
3972.153
5430.764
6962.305
8570.416
10258.937
12031.888
13893.479
15848.1410
17900.5511
20055.5812
22318.3613
24694.2814
27188.9915
29808.4416
32558.8617
35446.8018
38479.1419
41663.1020
45006.2621
48516.5722
52202.4023
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA investtwo;\n", " RETAIN value 0 year 0;\n", " DO WHILE (value < 50000);\n", " value = value + 1200;\n", " value = value + value * 0.05;\n", " year = year + 1;\n", " OUTPUT;\n", " END;\n", "RUN;\n", " \n", "PROC PRINT data = investtwo NOOBS;\n", " title 'Years until at least $50,000';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The calculations proceed as before. First, the value variable is updated to by calculating 0 + 1200, to get 1200. Then, the value variable is updated by calculating 1200 + 1200*0.05 to get 1260. Then, the year variable is increased by 1, or in this case, set to 1. The first observation, for which year = 1 and value = 1260, is then written to the output data set called investthree. SAS then returns to the top of the DO WHILE loop, to determine if the expression (value < 50000) is true. Since value is just 1260, the expression is true, and so the DO WHILE loop executes once again. The process continues as described until SAS determines that value is as least 50000 and therefore stops executing the DO WHILE loop.

\n", "

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that this program also determines that it would take 23 years to accumulate at least \\$50,000.

\n", "

You should also try changing the WHILE condition from value < 50000 to value ≥ 50000 to see what happens. (Hint: you will get no output. Why?)

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have now seen how the DO WHILE and DO UNTIL loops enable you to execute statements repeatedly, but conditionally so. You have also seen how the iterative DO loop enables you to execute statements a set number of times unconditionally. Now, we'll put the two together to create a form of the iterative DO loop that executes DO loops conditionally as well as unconditionally.\n", "\n", "
\n", "

Example

\n", "

Suppose again that you want to know how many years it would take to accumulate 50,000 if you deposit 1200 each year into an account that earns 5% interest. But this time, suppose you also want to limit the number of years that you invest to 15 years. The following program uses a conditional iterative DO loop to accumulate our investment until we reach 15 years or until the value of our investment exceeds 50000, whichever comes first:

\n", "
" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Value of Investment

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
valueyear
1260.001
2583.002
3972.153
5430.764
6962.305
8570.416
10258.937
12031.888
13893.479
15848.1410
17900.5511
20055.5812
22318.3613
24694.2814
27188.9915
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA investfour (drop = i);\n", " RETAIN value 0 year 0;\n", " DO i = 1 to 15 UNTIL (value >= 50000);\n", " value = value + 1200;\n", " value = value + value * 0.05;\n", " year = year + 1;\n", " OUTPUT;\n", " END;\n", "RUN;\n", " \n", "PROC PRINT data = investfour NOOBS;\n", " title 'Value of Investment';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note that there are just two differences between this program and that of the program in the previous example that uses the DO UNTIL loop: i) The iteration i = 1 to 15 has been inserted into the DO UNTIL statement; and ii) because the index variable i is created for the DO loop, it is dropped before writing the contents from the program data vector to the output data set investfour.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SAS Arrays\n", "\n", "In this section, we'll learn about basic array processing in SAS. In DATA step programming, you often need to perform the same action on more than one variable at a time. Although you can process the variables individually, it is typically easier to handle the variables as a group. Arrays offer you that option. For example, until now, if you wanted to take the square root of the 50 numeric variables in your SAS data set, you'd have to write 50 SAS assignment statements to accomplish the task. Instead, you can use an array to simplify your task.\n", "\n", "Arrays can be used to simplify your code when you need to:\n", "\n", "* perform repetitive calculations\n", "* create many variables that have the same attributes\n", "* read data\n", "* transpose \"fat\" data sets to \"tall\" data sets, that is, change the variables in a data set to observations\n", "* transpose \"tall\" data sets to \"fat\" data sets, that is, change the observations in a data set to variables\n", "* compare variables\n", "\n", "In this lesson, we'll learn how to accomplish such tasks using arrays. Using arrays in appropriate situations can seriously simplify and shorten your SAS programs!\n", "\n", "### One-Dimensional Arrays\n", "\n", "A SAS **array** is a temporary grouping of SAS variables under a single name. For example, suppose you have four variables named winter, spring, summer, and, fall. Rather than referring to the variables by their four different names, you could associate the variables with an array name, say seasons, and refer to the variables as seasons(1), seasons(2), seasons(3), and seasons(4). When you pair an array up with an iterative DO loop, you create a powerful and efficient way of writing your computer programs. Let's take a look at an example!\n", "\n", "
\n", "

Example

\n", "

The following program simply reads in the average montly temperatures (in Celsius) for ten different cities in the United States into a temporary SAS data set called avgcelsius:

\n", "
" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Average Monthly Temperatures in Celsius

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Cityjanfebmaraprmayjunjulaugsepoctnovdec
State College, PA-2-2281419212016104-1
Miami, FL202022232627282827262320
St. Louis, MO-1161318232625211571
New Orleans, LA111316202327272726211612
Madison, WI-8-5071419222016102-5
Houston, TX101216202327282826211612
Phoenix, AZ121416212631333230231612
Seattle, WA5671013161818161286
San Francisco, CA101212131415151617161411
San Diego, CA131415161719212221191614
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA avgcelsius;\n", " input City $ 1-18 jan feb mar apr may jun\n", " jul aug sep oct nov dec;\n", " DATALINES;\n", "State College, PA -2 -2 2 8 14 19 21 20 16 10 4 -1\n", "Miami, FL 20 20 22 23 26 27 28 28 27 26 23 20\n", "St. Louis, MO -1 1 6 13 18 23 26 25 21 15 7 1\n", "New Orleans, LA 11 13 16 20 23 27 27 27 26 21 16 12\n", "Madison, WI -8 -5 0 7 14 19 22 20 16 10 2 -5\n", "Houston, TX 10 12 16 20 23 27 28 28 26 21 16 12\n", "Phoenix, AZ 12 14 16 21 26 31 33 32 30 23 16 12\n", "Seattle, WA 5 6 7 10 13 16 18 18 16 12 8 6\n", "San Francisco, CA 10 12 12 13 14 15 15 16 17 16 14 11\n", "San Diego, CA 13 14 15 16 17 19 21 22 21 19 16 14\n", ";\n", "RUN;\n", " \n", "PROC PRINT data = avgcelsius;\n", " title 'Average Monthly Temperatures in Celsius';\n", " id City;\n", " var jan feb mar apr may jun \n", " jul aug sep oct nov dec;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Now, suppose that we don't feel particularly comfortable with understanding Celsius temperatures, and therefore, we want to convert the Celsius temperatures into Fahrenheit temperatures for which we have a better feel. The following SAS program uses the standard conversion formula:

\n", "
Fahrenheit temperature = 1.8*Celsius temperature + 32
\n", "

to convert the Celsius temperatures in the avgcelsius data set to Fahrenheit temperatures stored in a new data set called avgfahrenheit:

\n", "
" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Average Monthly Temperatures in Fahrenheit

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Cityjanffebfmarfaprfmayfjunfjulfaugfsepfoctfnovfdecf
State College, PA28.428.435.646.457.266.269.868.060.850.039.230.2
Miami, FL68.068.071.673.478.880.682.482.480.678.873.468.0
St. Louis, MO30.233.842.855.464.473.478.877.069.859.044.633.8
New Orleans, LA51.855.460.868.073.480.680.680.678.869.860.853.6
Madison, WI17.623.032.044.657.266.271.668.060.850.035.623.0
Houston, TX50.053.660.868.073.480.682.482.478.869.860.853.6
Phoenix, AZ53.657.260.869.878.887.891.489.686.073.460.853.6
Seattle, WA41.042.844.650.055.460.864.464.460.853.646.442.8
San Francisco, CA50.053.653.655.457.259.059.060.862.660.857.251.8
San Diego, CA55.457.259.060.862.666.269.871.669.866.260.857.2
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA avgfahrenheit;\n", " set avgcelsius;\n", " janf = 1.8*jan + 32;\n", " febf = 1.8*feb + 32;\n", " marf = 1.8*mar + 32;\n", " aprf = 1.8*apr + 32;\n", " mayf = 1.8*may + 32;\n", " junf = 1.8*jun + 32;\n", " julf = 1.8*jul + 32;\n", " augf = 1.8*aug + 32;\n", " sepf = 1.8*sep + 32;\n", " octf = 1.8*oct + 32;\n", " novf = 1.8*nov + 32;\n", " decf = 1.8*dec + 32;\n", " drop jan feb mar apr may jun\n", " jul aug sep oct nov dec;\n", "RUN;\n", " \n", "PROC PRINT data = avgfahrenheit;\n", " title 'Average Monthly Temperatures in Fahrenheit';\n", " id City;\n", " var janf febf marf aprf mayf junf \n", " julf augf sepf octf novf decf;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As you can see by the number of assignment statements necessary to make the conversions, the exercise becomes one of patience. Because there are twelve average monthly temperatures, we must write twelve assignment statements. Each assignment statement performs the same calculation. Only the name of the variable changes in each statement. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were properly converted to Fahrenheit temperatures.

\n", "

The above program is crying out for the use of an array. One of the primary arguments for using an array is to reduce the number of statements that are required for processing variables. Let's take a look at an example.

\n", "

The following program uses a one-dimensional array called fahr to convert the average Celsius temperatures in the avgcelsius data set to average Fahrenheit temperatures stored in a new data set called avgfahrenheit:

\n", "
" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Average Monthly Temperatures in Fahrenheit

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Cityjanfebmaraprmayjunjulaugsepoctnovdec
State College, PA28.428.435.646.457.266.269.868.060.850.039.230.2
Miami, FL68.068.071.673.478.880.682.482.480.678.873.468.0
St. Louis, MO30.233.842.855.464.473.478.877.069.859.044.633.8
New Orleans, LA51.855.460.868.073.480.680.680.678.869.860.853.6
Madison, WI17.623.032.044.657.266.271.668.060.850.035.623.0
Houston, TX50.053.660.868.073.480.682.482.478.869.860.853.6
Phoenix, AZ53.657.260.869.878.887.891.489.686.073.460.853.6
Seattle, WA41.042.844.650.055.460.864.464.460.853.646.442.8
San Francisco, CA50.053.653.655.457.259.059.060.862.660.857.251.8
San Diego, CA55.457.259.060.862.666.269.871.669.866.260.857.2
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA avgfahrenheit;\n", " set avgcelsius;\n", " array fahr(12) jan feb mar apr may jun\n", " jul aug sep oct nov dec;\n", " do i = 1 to 12;\n", " fahr(i) = 1.8*fahr(i) + 32;\n", " end;\n", "RUN;\n", " \n", "PROC PRINT data = avgfahrenheit;\n", " title 'Average Monthly Temperatures in Fahrenheit';\n", " id City;\n", " var jan feb mar apr may jun \n", " jul aug sep oct nov dec;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

If you compare this program with the previous program, you can see the statements that replaced the twelve assignment statements. The ARRAY statement defines an array called fahr. It tells SAS that you want to group the twelve month variables, jan , feb, ... dec, into an array called fahr. The (12) that appears in parentheses is a required part of the array declaration. Called the dimension of the array, it tells SAS how many elements, that is, variables, you want to group together. When specifying the variable names to be grouped in the array, we simply list the elements, separating each element with a space. As with all SAS statements, the ARRAY statement is closed with a semicolon (;).

\n", "

Once we've defined the array fahr, we can use it in our code instead of the individual variable names. We refer to the individual elements of the array using its name and an index, such as, fahr(i). The order in which the variables appear in the ARRAY statement determines the variable's position in the array. For example, fahr(1) corresponds to the jan variable, fahr(2) corresponds to the feb variable, and fahr(12) corresponds to the dec variable. It's when you use an array like fahr, in conjunction with an iterative DO loop, that you can really simplify your code, as we did in this program.

\n", "

The DO loop tells SAS to process through the elements of the fahr array, each time converting the Celsius temperature to a Fahrenheit temperature. For example, when the index variable i is 1, the assignment statement becomes:

\n", "
fahr(1) = 1.8*fahr(1) + 32;
\n", "

which you could think of as saying:

\n", "
jan = 1.8*jan + 32;
\n", "

The value of jan on the right side of the equal sign is the Celsius temperature. After the assignment statement is executed, the value of jan on the left side of the equal sign is updated to reflect the Fahrenheit temperature.

\n", "

Now, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were again properly converted to Fahrenheit temperatures. Oh, one more thing to point out! Note that the variables listed in the PRINT procedure's VAR statement are the original variable names jan, feb, ..., dec, not the variables as they were grouped into an array, fahr(1), fahr(2), ..., fahr(12). That's because an array exists only for the duration of the DATA step. If in the PRINT procedure, you instead tell SAS to print fahr(1), fahr(2), ... you'll see that SAS will hiccup. Let's summarize!

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To define an array, you must use an ARRAY statement having the following general form in order to group previously defined data set variables into an array:\n", "\n", "`ARRAY array-name(dimension) ;`\n", "\n", "where:\n", "\n", "* array-name must be a valid SAS name that specifies the name of the array\n", "* dimension describes the number and arrangement of array elements. The default dimension is one.\n", "* elements list the variables to be grouped together to form the array. The array elements must be either all numeric or all character. Using standard SAS Help notation, the term elements appears in <> brackets to indicate that they are optional. That is, you do not have to specify elements in the ARRAY statement. If no elements are listed, new variables are created with default names.\n", "\n", "A few more points must be made about the array-name. Unless you are interested in confusing SAS, you should not give an array the same name as a variable that appears in the same DATA step. You should also avoid giving an array the same name as a valid SAS function. SAS allows you to do so, but then you won't be able to use the function in the same DATA step. For example, if you named an array mean in a DATA step, you would not be able to use the mean function in the DATA step. SAS will print a warning message in your log window to let you know such. Finally, array names cannot be used in LABEL, FORMAT, DROP, KEEP, or LENGTH statements.\n", "\n", "Let's look at another example to see a different way to define the array used to convert degrees Celsius to Farenheit.\n", "\n", "
\n", "

Example

\n", "

The following program is identical to the program in the previous example, except the 12 in the ARRAY statement has been changed to an asterisk (*) and we use a SAS list to grab the variables for the array:

\n", "
" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Average Monthly Temperatures in Fahrenheit

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Cityjanfebmaraprmayjunjulaugsepoctnovdec
State College, PA28.428.435.646.457.266.269.868.060.850.039.230.2
Miami, FL68.068.071.673.478.880.682.482.480.678.873.468.0
St. Louis, MO30.233.842.855.464.473.478.877.069.859.044.633.8
New Orleans, LA51.855.460.868.073.480.680.680.678.869.860.853.6
Madison, WI17.623.032.044.657.266.271.668.060.850.035.623.0
Houston, TX50.053.660.868.073.480.682.482.478.869.860.853.6
Phoenix, AZ53.657.260.869.878.887.891.489.686.073.460.853.6
Seattle, WA41.042.844.650.055.460.864.464.460.853.646.442.8
San Francisco, CA50.053.653.655.457.259.059.060.862.660.857.251.8
San Diego, CA55.457.259.060.862.666.269.871.669.866.260.857.2
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA avgfahrenheittwo;\n", " set avgcelsius;\n", " array fahr(*) jan -- dec;\n", " do i = 1 to 12;\n", " fahr(i) = 1.8*fahr(i) + 32;\n", " end;\n", "RUN;\n", " \n", "PROC PRINT data = avgfahrenheittwo;\n", " title 'Average Monthly Temperatures in Fahrenheit';\n", " id City;\n", " var jan feb mar apr may jun \n", " jul aug sep oct nov dec;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Simple enough! Rather than you having to tell SAS how many variables and listing out exactly which ones you are grouping in an array, you can let SAS to the dirty work of counting the number of elements and listing the ones you include in your variable list. To do so, you simply define the dimension using an asterisk (*) and use the SAS list shortcut. You might find this strategy particularly helpful if you are grouping so many variables together into an array that you don't want to spend the time counting and listing them individually. Incidentally, throughout this lesson, we enclose the array's dimension (or index variable) in parentheses ( ). We could alternatively use braces { } or brackets [ ].

\n", "

The above program used a SAS list to shorten the list of variable names grouped into the fahr array. In some cases, you could also consider using the special name lists _ALL_, _CHARACTER_ and _NUMERIC_:

\n", "
    \n", "
  • Use _ALL_ when you want SAS to use all of the same type of variables (all numeric or all character) in your SAS data set.
  • \n", "
  • Use _CHARACTER_ when you want SAS to use all of the character variables in your data set.\n", "
  • \n", "
  • Use _NUMERIC_ when you want SAS to use all of the numeric variables in your data set.
  • \n", "
\n", "

In this case, we could have used the _NUMERIC_ keyword instead as shown in the following program.

\n", "
" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "