{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with Your Data\n", "\n", "Now that we know how to read data into SAS and work with SAS variables in the DATA step, we want to learn the basics of working with our dataset. In this section, we will see how to\n", " \n", "1. Subset rows of a SAS dataset using WHERE, IF, OBS, and FIRSTOBS.\n", "2. Subset columns of a SAS dataset by using DROP or KEEP with SAS variable lists.\n", "3. Add a column/new variable to a SAS dataset.\n", "4. Modify attributes of a variable by renaming, adding a label to a SAS variable, and changing the variable's length.\n", "6. Order the rows of a SAS dataset using PROC SORT." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Subsetting Rows of a SAS Dataset\n", "\n", "Working with subsets created from an existing SAS data set can make more efficient use of computer resources than working with the original, larger data set. Reading fewer observations means that fewer iterations of the DATA step must occur. \n", "\n", "There are a few ways to subset rows in SAS\n", "\n", "* The FIRSTOBS= and OBS= set statement options.\n", "* A subsetting IF or WHERE statement.\n", "\n", "### The FIRSTOBS= and OBS= Options\n", "\n", "Let's begin by working with the FIRSTOBS and OBS set statement options.\n", "\n", "* The SET statement's **FIRSTOBS=** option tells SAS to begin reading the data from the input SAS data set at the line number specified by FIRSTOBS.\n", "* The SET statement's **OBS=** option tells SAS to stop reading the data from the input SAS data set at the line number specified by OBS.\n", "\n", "We'll start by using the OBS= option to create the data set that we'll be working with throughout this lesson. The data set we'll use contains demographic (or \"background\") information collected on 638 subjects once enrolled in the National Institute of Health's Interstitial Cystitis Data Base (ICDB) Study. Not surprisingly, the ICDB Study collected data on people who were diagnosed as having interstitial cystitis! The primary reason for conducting the study was that interstitial cystitis is a poorly understood condition that causes severe bladder and pelvic pain, urinary frequency, and painful urination in the absence of any identifiable cause. Although the disease is more prevalent in women, it affects both men and women of all ages. (If you want to learn more about the ICDB Study, I refer you to one of the National Institute of Health's web sites in which a general description is given.\n", "\n", "Given that we'll use the ICDB Study's background data, it would probably be helpful for you to take a peek at the background data form on which the data were collected. In order to run the SAS programs in this lesson, you'll need to download and save the background data set, back.sas7bdat, in a folder on your computer (see the DATA in CANVAS).\n", "\n", "
\n", "

Example

\n", "

The DATA step in the following program uses the OBS= option to tell SAS to create a temporary data set called back by selecting the first 25 observations from the permanent background data set icdb.back:

\n", "
" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

A Subset of the Background Data Set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_typev_dater_idb_datesexstatecountryraceethnicreligmar_sted_levelemp_stjob_chngincome
1110027010/05/93206807/05/62251400131.2
2110029010/05/93206809/07/26251402158.2
3110039012/07/93206807/24/242221403138.2
4110040011/30/93206810/20/672321407151.2
5110045001/11/94206804/18/25136140311802
6110049001/25/94206810/05/23237140115802
7110051001/25/94206812/02/42242140313102
8110052001/27/94180801/04/2525140014802
9110053002/22/94180803/15/2225141311801
10110055003/15/94180803/31/4125140013102
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", " \n", "DATA back;\n", " set PHC6089.back (obs=25);\n", "RUN;\n", " \n", "PROC PRINT data=back (obs = 10);\n", " title 'A Subset of the Background Data Set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The program is pretty straightforward. The main thing to keep in mind is that you have to enclose the OBS= option in parentheses.

\n", "

If you haven't already done so, download and save the background data set (see CANVAS) to a convenient location on your computer. Then, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you've saved the data set. Then, run the SAS program, and review the output from the PRINT procedure to familiarize yourself with the data set.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program uses the SET statement's FIRSTOBS= and OBS= options to tell SAS to include fourteen observations — observations 7, 8, 9, ..., and 20 — from the permanent icdb.back data set in the temporary back data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK1

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_typev_dater_idb_datesexstatecountryraceethnicreligmar_sted_levelemp_stjob_chngincome
1110051001/25/94206812/02/42242140313102
2110052001/27/94180801/04/2525140014802
3110053002/22/94180803/15/2225141311801
4110055003/15/94180803/31/4125140013102
5110057003/15/94206807/10/4425140314202
6110058003/18/94180809/09/502.1341013101
7110059003/18/94180807/25/34213140113801
8110060006/14/94180805/29/36213140313102
9110062003/31/94180804/21/3623140114502
10110065004/04/94180810/12/5225140314411
11110066004/12/94180808/28/6225140014602
12110067004/26/94180802/22/7225140164202
13110068006/13/94180809/10/552251401113702
14110069005/31/94180808/17/38232140013102
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back1;\n", " set phc6089.back (FIRSTOBS=7 OBS=20);\n", "RUN;\n", " \n", "PROC PRINT data=back1;\n", " title 'Output Dataset: BACK1';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the background data set. Then, run the SAS program, and review the output from the PRINT procedure. Compare the output to the output of that from the previous example to convince yourself that the temporary data set back1 indeed contains fourteen observations — observations 7, 8, ..., 20 in the original background data set.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### WHERE= Option, WHERE clause and Subsetting IF\n", "\n", "The WHERE= option allows one to select only those observations from a SAS data set that meet a certain condition. The WHERE= option can be attached to the SET statement or the DATA statement. If the WHERE= option is attached to the SET statement, SAS selects the observations that meet the condition as it reads in the data. If the WHERE= option is attached to the DATA statement, SAS selects the observations as it writes the data from the program data vector to the output data set.\n", "\n", "
\n", "

Example

\n", "

The following program illustrates the use of the WHERE= option to select observations from a SAS data set that meet a certain condition. Because the WHERE= option is attached to the DATA statement, the selection process takes place as SAS writes the data from the program data vector to the output data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: TEMPLE

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_typev_dater_idb_datesexstatecountryraceethnicreligmar_sted_levelemp_stjob_chngincome
1230003007/09/93539610/25/472221401431.1
2230004001/04/94539608/15/23238140153501
3230005001/06/94539605/25/49210130324102
4230006001/06/94539604/24/49221140315512
5230008010/03/96539608/09/60238140114202
6230009010/31/96539611/13/50138140113102
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Output Dataset: OKLA

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_typev_dater_idb_datesexstatecountryraceethnicreligmar_sted_levelemp_stjob_chngincome
1310020006/18/93191608/09/432431403151.2
2310032008/03/93191604/02/342131401118.1
3310037011/08/93191604/25/392361401137.2
4310041010/27/93191611/01/262431403145.2
5310049002/04/94191610/03/31225140314812
6310055007/12/94191612/11/56236140314111
7310056004/01/94191611/16/61236140314102
8310059006/13/94191608/22/34243140114802
9310065006/30/94191607/27/56236140014102
10310069009/23/94191607/14/42236140333711
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA temple (where = (int(subj/10000)=23)) \n", " okla (where = (int(subj/10000)=31));\n", " set phc6089.back;\n", "RUN;\n", " \n", "PROC PRINT data=temple;\n", " title 'Output Dataset: TEMPLE';\n", "RUN;\n", " \n", "PROC PRINT data=okla (obs = 10);\n", " title 'Output Dataset: OKLA';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Well, it's not really that simple. This program also illustrates the use of the WHERE= option to divide a large data set up into smaller data sets, based on the values of a certain variable. As you can see, there are actually two data set names — temple and okla — appearing in the DATA statement. That tells SAS that we want to create two data sets within this one DATA step. One data set, temple, should contain only those subjects enrolled at Temple University, while the other data set, okla, should contain only those subjects enrolled at the University of Oklahoma. Accomplishing that requires a bit of understanding about the subject number used in the ICDB Study, as well as the functionality of the INT function:

\n", "
    \n", "
  • The variable subj contains a six-digit subject number. The first two digits of the subject number indicate the location where the subject was enrolled. If the first two digits of the subject id number is 23, then the subject was enrolled at Temple University. If the first two digits of the subject id number is 31, the subject was enrolled at the University of Oklahoma.
  • \n", "
  • You might recall that the INT function — that is, the integer function— returns the integer part of the expression contained within parentheses. So, if the subject number is 230007, then int(subj/10000) = int(23.0007) = 23, the location id for Temple University.
  • \n", "
\n", "

Pay particular attention to the syntax of the WHERE= option ... it can trip you up if you aren't careful. The entire WHERE= option must be contained in parentheses immediately following the data set to which you want the condition to apply. The condition itself is also placed in parentheses. So, in general, the syntax, when applied to a DATA statement, should look like this:

\n", "
DATA dsname (WHERE = (condition));
\n", "

where dsname is the data set name and condition is the condition that you want SAS to evaluate when selecting the observations.

\n", "

Now, before running the program, change the directory in the LIBNAME statement so that it reflects the location in which you have saved the background data set. After you've edited the LIBNAME statement, run the SAS program. Review the output to convince yourself that the temporary data set temple contains the observations in the ICDB background data set that correspond to subjects enrolled at Temple University, while the temporary data set okla contains the observations that correspond to subjects enrolled at the University of Oklahoma.

\n", "

A few more comments. When you use the WHERE= option, the condition applies only to the data set which immediately precedes it. Alternatively, you can use a WHERE statement, in which the condition applies to all of the input data sets. Also beware that the WHERE= option cannot be used at the same time as the FIRSTOBS= or OBS= options.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program illustrates efficient use of the WHERE= option in the SET statement. Because the WHERE= option appears in the SET statement, the selection process takes places as SAS reads in the observations from the icdb.back data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: TEMPLE2

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_typev_dater_idb_datesexstatecountryraceethnicreligmar_sted_levelemp_stjob_chngincome
1230003007/09/93539610/25/472221401431.1
2230004001/04/94539608/15/23238140153501
3230005001/06/94539605/25/49210130324102
4230006001/06/94539604/24/49221140315512
5230008010/03/96539608/09/60238140114202
6230009010/31/96539611/13/50138140113102
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA temple2;\n", " set phc6089.back (where = (int(subj/10000)=23));\n", "RUN;\n", " \n", "PROC PRINT data = temple2;\n", " title 'Output Dataset: TEMPLE2';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Before running the program, change the directory in the LIBNAME statement so that it reflects the location in which you have saved the background data set. After you've edited the LIBNAME statement, run the SAS program. Review the output from the PRINT procedure. Convince yourself that the temple2 data set contains the observations in the ICDB background data set that correspond to subjects enrolled at Temple University.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, if you want to exclude certain observations from all output datasets, you can use\n", "\n", "* WHERE clause or subsetting IF to include rows that meet a certain set of conditions\n", "* If-Then-Delete to exclude rows that meet a certain condition.\n", "\n", "
\n", "

Example

\n", "

The following program illustrates using the WHERE statement to only keep the records whose Subject ID starts with 23.

\n", "
" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: TEMPLE3

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_typev_dater_idb_datesexstatecountryraceethnicreligmar_sted_levelemp_stjob_chngincome
1230003007/09/93539610/25/472221401431.1
2230004001/04/94539608/15/23238140153501
3230005001/06/94539605/25/49210130324102
4230006001/06/94539604/24/49221140315512
5230008010/03/96539608/09/60238140114202
6230009010/31/96539611/13/50138140113102
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA temple3;\n", " set phc6089.back;\n", " where (int(subj/10000)=23);\n", "RUN;\n", " \n", "PROC PRINT data = temple3;\n", " title 'Output Dataset: TEMPLE3';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note that there is no = sign with the WHERE statement as opposed to using the WHERE= dataset option. Alternatively, we could have done the same thing using a subsetting If.

\n", "
" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: TEMPLE4

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_typev_dater_idb_datesexstatecountryraceethnicreligmar_sted_levelemp_stjob_chngincome
1230003007/09/93539610/25/472221401431.1
2230004001/04/94539608/15/23238140153501
3230005001/06/94539605/25/49210130324102
4230006001/06/94539604/24/49221140315512
5230008010/03/96539608/09/60238140114202
6230009010/31/96539611/13/50138140113102
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA temple4;\n", " set phc6089.back;\n", " if (int(subj/10000)=23);\n", "RUN;\n", " \n", "PROC PRINT data = temple4;\n", " title 'Output Dataset: TEMPLE4';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

If instead we want to exclude observations that meet a certain condition, then we would need to use an If-Then-Delete statement. The following program removes rows where the subject id does not start with 23.

\n", "
" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: TEMPLE5

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_typev_dater_idb_datesexstatecountryraceethnicreligmar_sted_levelemp_stjob_chngincome
1230003007/09/93539610/25/472221401431.1
2230004001/04/94539608/15/23238140153501
3230005001/06/94539605/25/49210130324102
4230006001/06/94539604/24/49221140315512
5230008010/03/96539608/09/60238140114202
6230009010/31/96539611/13/50138140113102
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA temple5;\n", " set phc6089.back;\n", " if (int(subj/10000) ne 23) Then DELETE;\n", "RUN;\n", " \n", "PROC PRINT data = temple5;\n", " title 'Output Dataset: TEMPLE5';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Subsetting Columns of a SAS Dataset\n", "\n", "There may be situations in which you'd like to reduce the number of variables in a data set in order to work with a data set of a much more manageable size. I like to think of doing such a thing as \"making a data set thinner.\" We can select certain variables from our dataset by either\n", "\n", "* Using a DROP statement or DROP= dataset option to select which variables to *exclude*.\n", "* Or by using a KEEP statement or KEEP=dataset option to select which variables in *include*.\n", "\n", "The advantage is that, when you can get away with reading in fewer variables, SAS uses a smaller program data vector and therefore runs much more efficiently.\n", "\n", "Which option/statement should you use? The choice of whether to use the DROP or the KEEP in a given situation is personal. Or rather, let's say that most people will choose the option that requires the least amount of typing! I still try though, whenever possible, to use the KEEP, as then when I read my programs I know exactly which variables my data sets contain. Using DROP, on the other hand, requires one to think about what variables were in the data set before the dropping took place.\n", "\n", "
\n", "

Example

\n", "

The following program tells SAS to keep just three variables — subj, v_date, and b_date — when reading from the back1 data set in order to create the back2 data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK2

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_dateb_dateage
111002710/05/9307/05/6231.3
211002910/05/9309/07/2667.1
311003912/07/9307/24/2469.4
411004011/30/9310/20/6726.1
511004501/11/9404/18/2568.8
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back2;\n", " set phc6089.back (keep = subj v_date b_date);\n", " age = (v_date - b_date)/365; * Calculate AGE in years;\n", " format age 4.1;\n", "RUN;\n", " \n", "PROC PRINT data=back2 (obs=5);\n", " title 'Output Dataset: BACK2';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The SET statement's KEEP= option is used to tell SAS to read three variables — subject (subj), visit date (v_date), and birth date (b_date) — from the data set back and to store them in a new temporary data set back2. Note that the KEEP= option must be placed in parentheses and follow the name of the data set from you which you want SAS to select variables. The list of variables that you specify to keep must be separated by at least one blank space.

\n", "

Then, the subject's age (age) at the time of the visit is calculated by subtracting the difference between visit date (v_date) and birth date (b_date) and dividing by 365 to get (an approximate) age in years.

\n", "

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the temporary data set back2 contains the three kept variables (subj, v_date, and b_date), as well as the newly calculated variable (age).

\n", "

Alternative, we could have used the keep statement. The benefit to using the SET option keep= in the example above, is that we have only read in the variables that we need. The following version reads in all the variables to the DATA step and then only keeps the selected varaibles to output to the dataset back3.

\n", "

Note that now, we need to also put the new variable age in the keep statement. Before, the keep= option in the SET statement only applied to the input dataset in the SET statement, but the keep statement applies to all variables in the DATA step.

\n", "
" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK3

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_dateb_dateage
111002710/05/9307/05/6231.3
211002910/05/9309/07/2667.1
311003912/07/9307/24/2469.4
411004011/30/9310/20/6726.1
511004501/11/9404/18/2568.8
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back3;\n", " set phc6089.back; \n", " keep subj v_date b_date age;\n", " age = (v_date - b_date)/365; * Calculate AGE in years;\n", " format age 4.1;\n", "RUN;\n", " \n", "PROC PRINT data=back3 (obs=5);\n", " title 'Output Dataset: BACK3';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

We can also use the keep= option as a DATA statement option.

\n", "
" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK3a

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_dateb_dateage
111002710/05/9307/05/6231.3
211002910/05/9309/07/2667.1
311003912/07/9307/24/2469.4
411004011/30/9310/20/6726.1
511004501/11/9404/18/2568.8
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back3a (keep= subj v_date b_date age);\n", " set phc6089.back; \n", " age = (v_date - b_date)/365; * Calculate AGE in years;\n", " format age 4.1;\n", "RUN;\n", " \n", "PROC PRINT data=back3a (obs=5);\n", " title 'Output Dataset: BACK3a';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Another way to acheive the same result is to use DROP to exclude all the variables that we don't want

\n", "
" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK4

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_dateb_dateage
111002710/05/9307/05/6231.3
211002910/05/9309/07/2667.1
311003912/07/9307/24/2469.4
411004011/30/9310/20/6726.1
511004501/11/9404/18/2568.8
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back4;\n", " set phc6089.back (drop = v_type r_id sex state country \n", " race ethnic relig mar_st\n", " ed_level emp_st job_chng income);\n", " age = (v_date - b_date)/365; * Calculate AGE in years;\n", " format age 4.1;\n", "RUN;\n", " \n", "PROC PRINT data=back4 (obs = 5);\n", " title 'Output Dataset: BACK4';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The SET statement's DROP= option tells SAS to drop thirteen variables — the visit type (v_type), the id number of the reviewer of the data form (r_id), as well as the subject's gender (sex), state (state) and country (country) of residence, race (race), ethnicity (ethnic), religion (relig), marital status (mar_st), education level (ed_level), employment status (emp_st), job change status (job_chng), and income (income) — from the data set back and to store the remaining variables in a new temporary data set back3. Therefore, the program data vector used to create the SAS data set back3 contains only three variables — subject id (subj), visit date (v_date), and birth date (b_date).

\n", "

Note that, just like the KEEP= option, the DROP= option must be placed in parentheses and follow the name of the data set from you which you want SAS to exclude variables. The list of variables that you specify to drop must be separated by at least one blank space.

\n", "

We could also have accomplished this by using a DROP statement. This however will result in SAS reading in all variables to the SAS data vector, and then only outputting the variables we decide not to drop to the output dataset.

\n", "
" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK5

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_dateb_dateage
111002710/05/9307/05/6231.3
211002910/05/9309/07/2667.1
311003912/07/9307/24/2469.4
411004011/30/9310/20/6726.1
511004501/11/9404/18/2568.8
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back5;\n", " set phc6089.back;\n", " drop v_type r_id sex state country \n", " race ethnic relig mar_st\n", " ed_level emp_st job_chng income;\n", " age = (v_date - b_date)/365; * Calculate AGE in years;\n", " format age 4.1;\n", "RUN;\n", " \n", "PROC PRINT data=back5 (obs = 5);\n", " title 'Output Dataset: BACK5';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Alternatively, we could have also used the DATA statement's drop= option.

\n", "
" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK5a

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_dateb_dateage
111002710/05/9307/05/6231.3
211002910/05/9309/07/2667.1
311003912/07/9307/24/2469.4
411004011/30/9310/20/6726.1
511004501/11/9404/18/2568.8
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back5a (drop = v_type r_id sex state country \n", " race ethnic relig mar_st\n", " ed_level emp_st job_chng income);\n", " set phc6089.back;\n", " age = (v_date - b_date)/365; * Calculate AGE in years;\n", " format age 4.1;\n", "RUN;\n", " \n", "PROC PRINT data=back5a (obs = 5);\n", " title 'Output Dataset: BACK5a';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Listing out all of these variables can be very tedious, so SAS provides a shortcut to create variable lists. We can specify a list of the form var1 -- var2 where var1 and var2 are two variable names in our SAS dataset. The -- tells SAS to \"fill in\" all variables that occur in the dataset inbetween these two columns.

\n", "

If we look at the order of the columns in our dataset by using PROC CONTENTS with the VARNUM option, then we can see that we all the columns from sex to income are in order sequentially. This will allow us to shorten our DROP statement to

\n", "
DROP v_type r_id sex -- income;
\n", "
" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK6

\n", "
\n", "
\n", "

The CONTENTS Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Variables in Creation Order
#VariableTypeLenFormat
1subjNum8 
2v_typeNum8 
3v_dateNum8MMDDYY8.
4r_idNum8 
5b_dateNum8MMDDYY8.
6sexNum8 
7stateNum8 
8countryNum8 
9raceNum8 
10ethnicNum8 
11religNum8 
12mar_stNum8 
13ed_levelNum8 
14emp_stNum8 
15job_chngNum8 
16incomeNum8 
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ods select Position;\n", "PROC contents data = phc6089.back order=varnum;\n", "RUN;\n", "ods select ALL;" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK6

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_dateb_dateage
111002710/05/9307/05/6231.3
211002910/05/9309/07/2667.1
311003912/07/9307/24/2469.4
411004011/30/9310/20/6726.1
511004501/11/9404/18/2568.8
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back6;\n", " set phc6089.back;\n", " drop v_type r_id sex -- income;\n", " age = (v_date - b_date)/365; * Calculate AGE in years;\n", " format age 4.1;\n", "RUN;\n", " \n", "PROC PRINT data=back6 (obs = 5);\n", " title 'Output Dataset: BACK6';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What to Use When

\n", "\n", "After reviewing the previous four examples, your head might be spinning and you might be wondering how you would know which to use — the KEEP= option on the SET statement? the DROP= option on the SET statement? the KEEP= option on the DATA statement? or the DROP= option on the DATA statement? Here are the key facts that will help us devise a strategy: \n", "\n", "* As stated earlier, the choice between KEEP and DROP is a matter of personal choice. It's whether you place the options on the SET statement, the DATA statement, or use the individual statement that makes a difference. \n", "* In the SET statement, the DROP= and KEEP= options determine which variables are read from the input SAS dataset. That is, when you specify the DROP= or KEEP= option in the SET statement, the excluded variables are never even read into the program data vector. \n", "* In the DATA statement, the DROP= and KEEP= options determine which variables are written from the program data vector to the resulting SAS data set. That is, when you specify the DROP= or KEEP= option in the DATA statement, all of the variables in the input data set are read in to the program data vector, but only the selected variables are written from the program data vector to the output data set. This also allows you to create multiple datasets in a single DATA step by keeping/dropping different sets of variables in each of the output dataset.\n", "* The individual DROP and KEEP statements work similarly to the DATA statement options except that they will apply to all datasets listed in the DATA statement.\n", "\n", "Those facts stated, it becomes obvious that we can construct a more efficient DATA step by not reading unneeded variables from the input data set. That said, we might have a working strategy: \n", "\n", "* use the KEEP= option (or DROP= option) in the SET statement to tell SAS only those variables that you need from the input data set read into the program data vector in order to achieve the goals of your program\n", "* use the KEEP= option (or DROP= option) in the DATA statement or the individual statements to tell SAS all of the variables that you want written from the program data vector into the output data set. You must use the DATA statement options if you want to select different variables for different datasets listed in the DATA statement.\n", "\n", "Let's try our working strategy out on an example.\n", "\n", "
\n", "

Example

\n", "

The following SAS program illustrates our working strategy of when to use the KEEP= option in the SET statement and when to use the KEEP= option in the DATA statement:

\n", "
" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK6

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjage
111002731.3
211002967.1
311003969.4
411004026.1
511004568.8
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back6 (keep = subj age);\n", " set phc6089.back (keep= subj v_date b_date);\n", " age = (v_date - b_date)/365;\n", " format age 4.1;\n", "RUN;\n", " \n", "PROC PRINT data=back6 (obs=5);\n", " title 'Output Dataset: BACK6';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Here, the KEEP= option on the SET statement tells SAS to keep only three variables — subj, v_date, and b_date — from the input data set. The v_date and b_date variables are kept because we need them in order to calculate the age variable. The subject's age is calculated by subtracting b_date from v_date and dividing by 365. Therefore, the program data vector contains four variables — subj, v_date, b_date, and age —all, none or any of which can be written to the output back6 data set. The KEEP= option of the DATA statement tells SAS which variables to write from the program data vector to the output data set. Here, SAS is told to write only the subject's id (subj) and age (age) to back6.

\n", "

Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the temporary data set back6 does indeed contain only thesubject's id (subj) and age (age).

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modifying Variable Attributes: RENAME= and LABEL options\n", "\n", "Recall that the descriptor portion of a data set containing information about the attributes of each variable in the data set. The attribute information includes the variable's name, type, length, format, informat, and label. In this section, we will learn how to \n", "\n", "* Use the LENGTH statement to adjust the default length of a variable.\n", "* Use the RENAME= option to change the name of a variable.\n", "* Use the LABEL statement to add a descriptive label to a variable.\n", "\n", "We will cover formats in a later section.\n", "\n", "### The RENAME= Option\n", "\n", "There may be occasions in which you want to change some of the variable names in your SAS data set. To do so, you'll want to use the RENAME= option. As its name suggests, the RENAME= option allows you to change the variable names within a SAS data set.\n", "\n", "The format of the RENAME= option is:\n", "\n", "`RENAME = (old1=new1 old2=new2 .... oldk=newk);`\n", "\n", "where old1, old2, ... oldk are the variable names as they appear in the data set that precedes the RENAME= option, and new1, new2, ..., newk are the corresponding new variable names.\n", "\n", "The effect of the RENAME= option depends on where it appears:\n", "\n", "* If the RENAME= option appears in the SET statement, then the new variable name takes effect when the program data vector is created. Therefore, all programming statements within the DATA step must refer to the new variable name.\n", "* If the RENAME= option appears in the DATA statement, then the new variable name takes effect only when the data are written to the SAS data set. Therefore, all programming statements within the DATA step must refer to the old variable name.\n", "\n", "
\n", "

Example

\n", "

The following program illustrates the use of the RENAME= option in the SET statement. Specifically, the variable sex is changed to gender, and b_date is changed to birth, when the program data vector is created:

\n", "
" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK7

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_datebirthgenderage
111002710/05/9307/05/62231.2740
211002910/05/9309/07/26267.1233
311003912/07/9307/24/24269.4192
411004011/30/9310/20/67226.1315
511004501/11/9404/18/25168.7808
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back7 (keep = subj gender v_date birth age);\n", " set phc6089.back (rename=(sex=gender b_date=birth));\n", " age = (v_date - birth)/365; *MUST use NEW name for date of birth;\n", "RUN;\n", " \n", "PROC PRINT data=back7 (obs=5);\n", " title 'Output Dataset: BACK7';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Because the RENAME= option appears in the SET statement, SAS no longer recognizes the variable name sex as the gender, nor b_date as the birth date, of the subject. Instead, SAS recognizes the variable names gender and birth. Hence, when we subsequently calculate the subjects' ages (age) in the DATA step, we must refer to the new variable name birth.

\n", "

Again, pay particular attention to the syntax of the RENAME= option ... it too can be tricky. The entire RENAME= option must be contained in parentheses immediately following the data set to which you want the name changes to apply. The variable names must also be placed in parentheses. So, in general, the syntax, when applied to a DATA statement, should look like this:

\n", "
DATA dsname (RENAME = (o1=n1 o2=n2 ...));
\n", "

where dsname is the data set name and o1 and o2 are the old variable names, and n1 and n2 are the new variable names.

\n", "

Launch and run the SAS program. Review the output from the PRINT procedure. Convince yourself that the variable names sex and b_date have been changed as advertised to gender and birth, respectively. Also, verify that the ages of the subjects have been calculated appropriately. Then, in the SAS program, change the variable name birth back to the variable name b_date, and re-run the program. Does SAS indeed hiccup?

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program illustrates use of the RENAME= option, when it appears in the DATA statement. Specifically, the variable sex is changed to gender, and b_date is changed to birth when SAS writes the data to the output data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: BACK8

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjv_datebirthgenderage
111002710/05/9307/05/62231.2740
211002910/05/9309/07/26267.1233
311003912/07/9307/24/24269.4192
411004011/30/9310/20/67226.1315
511004501/11/9404/18/25168.7808
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back8 (rename=(sex=gender b_date=birth)\n", " keep = subj sex v_date b_date age);\n", " set phc6089.back;\n", " age = (v_date - b_date)/365; *MUST use OLD name \n", " for date of birth;\n", "RUN;\n", " \n", "PROC PRINT data=back8 (obs=5);\n", " title 'Output Dataset: BACK8';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Because the RENAME= option appears in the DATA statement, SAS only recognizes the variable names as they appear in the input data set back3. That is, for example, SAS recognizes the variable name b_date as the birth date of the subjects. Hence, when we subsequently calculate the subjects' ages in the DATA step, we must refer to the old variable name b_date. Also note that the KEEP= option in the DATA statement must refer to the original variable names as they appear in the back3 data set.

\n", "

This program also illustrates how to use more than one DATA step option at a time. Specifically, the RENAME= and KEEP= options are used to modify the back8 data set. As such, both options are placed within one set of parentheses immediately following the data set to which you want the changes to apply. Then, within those parentheses, the basic syntax for each option is followed.

\n", "

Launch and run the SAS program, and review the output from the PRINT procedure. Convince yourself that the variable names sex and b_date have been changed as advertised to gender and birth, respectively. Also, verify that the ages of the subjects have been calculated appropriately.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The LABEL Statement\n", "\n", "The LABEL statement associates a descriptive label to a variable. This label is used in place of the variable name when printing such as when creating plots, the label is used as the axis label.\n", "\n", "
\n", "

Example

\n", "

The following program uses a LABEL statement to associate descriptive labels to the variables subj, sex, v_date, b_date, and age.

\n", "
" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The CONTENTS Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Variables in Creation Order
#VariableTypeLenFormatLabel
1subjNum8 Subject ID
2v_dateNum8MMDDYY8.Visit Date
3birthNum8MMDDYY8.Birth Date
4genderNum8 Gender
5ageNum8 Age in Years
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';\n", "\n", "DATA back8a (rename=(sex=gender b_date=birth)\n", " keep = subj sex v_date b_date age);\n", " set phc6089.back;\n", " age = (v_date - b_date)/365; *MUST use OLD name \n", " for date of birth;\n", " LABEL subj = \"Subject ID\"\n", " sex = \"Gender\"\n", " v_date = \"Visit Date\"\n", " b_date = \"Birth Date\"\n", " age = \"Age in Years\";\n", "RUN;\n", "\n", "title ;\n", "ODS SELECT Position;\n", "PROC contents data = back8a order = varnum;\n", "RUN;\n", "ODS SELECT ALL;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

We can see that the labels were added by viewing the descriptor portion of the dataset via PROC CONTENTS. A label can be as long as 256 characters.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The LENGTH Statement\n", "\n", "We learned earlier that by default, character variables are 8 characters long. To get around this before, we used a character informat or column input to read character strings that were longer than 8 characters. Another way is the use the LENGTH statement to set the variable length attribute to be longer than 8 characters.\n", "\n", "
\n", "

Example

\n", "

The following simple program illustrates creating a character value with the default length versus setting the length to be longer by using the LENGTH statement.

\n", "
" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Character Length Data Example

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obschar_var2char_var1
1A longer character sA longer
2shortCharshortCha
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA char_length;\n", " LENGTH char_var2 $20;\n", " INFILE datalines delimiter = \",\";\n", " INPUT char_var1 $ char_var2 $;\n", " DATALINES;\n", "A longer character string, A longer character string\n", "shortChar, shortChar\n", ";\n", "RUN;\n", "\n", "PROC print data = char_length;\n", " title \"Character Length Data Example\";\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The LENGTH statement sets the length of the variable char_var2 to be up to 20 characters, while char_var1 gets the default length of 8 characters. When reading the same string value, char_var1 can only store up to the first 8 characters and discards the rest, while char_var2 can store up to 20. This results in the strings being only partially read in for char_var1.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting the Rows of a Dataset: PROC SORT\n", "\n", "By default, the PRINT procedure displays observations in the order in which they appear in your data set. Alternatively, you can use the SORT procedure to first sort your data set based on the values of one or more variables. Then, when you use the PRINT procedure, SAS will display the observations in the order in which you sorted the data.\n", "\n", "
\n", "

Example

\n", "

The following SAS program uses the SORT procedure to sort the temporary dataset back8 that we created earlier by gender and then by age. Note that it is sorted by gender first. Then, if there are any ties in gender, these are sorted by age.

\n", "
" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

First 5 sorted observations for gender = 1

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsgenderagesubjv_datebirth
11-18.301411012502/06/9605/22/14
21-17.369911015207/19/9611/27/13
3123.498641007306/13/9512/19/71
4123.682252005109/26/9401/26/71
5124.863041007105/30/9507/25/70
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

First 5 sorted observations for gender = 2

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsgenderagesubjv_datebirth
572-23.153441009704/24/9606/14/19
582-22.679542007409/07/9505/07/18
592-22.646652000708/13/9303/31/16
602-22.243831010202/12/9605/06/18
612-22.178142011212/17/9602/15/19
622-20.304151001009/21/9301/05/14
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SORT data = back8 out = srtd_back8;\n", " by gender age;\n", "RUN;\n", "\n", "title \"First 5 sorted observations for gender = 1\";\n", "PROC PRINT data = srtd_back8 (obs = 5);\n", " var gender age subj v_date birth;\n", "RUN;\n", "\n", "title \"First 5 sorted observations for gender = 2\";\n", "PROC PRINT data = srtd_back8 (firstobs = 57 obs = 62);\n", " var gender age subj v_date birth;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch the SAS program, but print the whole dataset by removing the FIRSTOBS and OBS dataset options to verify that the datset is sorted first by gender and then by age.

\n", "

Then, note that while the SORT procedure's BY statement is required, it's OUT= option is optional. If you don't use it, however, then the SORT procedure permanently sorts the data set that is specified in the DATA= option. Therefore, if you need your data to be sorted just to produce output temporarily, then you should use the OUT= option in conjunction with a temporary SAS data set name.

\n", "

By default, SAS sorts the values of the variables appearing in the BY statement in ascending order. If you want them sorted in descending order, you need to use the BY statement's DESCENDING option.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following SAS program uses the BY statement's DESCENDING option to tell SAS to sort the temporary dataset back8 first by gender in descending order, and then by age in ascending order.

\n", "
" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

First 5 sorted observations for gender = 1

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsgenderagesubjv_datebirth
12-23.153441009704/24/9606/14/19
22-22.679542007409/07/9505/07/18
32-22.646652000708/13/9303/31/16
42-22.243831010202/12/9605/06/18
52-22.178142011212/17/9602/15/19
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

First 5 sorted observations for gender = 2

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsgenderagesubjv_datebirth
57226.547941000609/09/9302/28/67
58226.586351005101/20/9506/26/68
59226.611041009102/14/9607/12/69
60226.745221010005/06/9608/14/69
61226.750721000605/13/9308/19/66
62226.843822003608/08/9510/10/68
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SORT data = back8 out = srtd_back8;\n", " by descending gender age;\n", "RUN;\n", "\n", "title \"First 5 sorted observations for gender = 1\";\n", "PROC PRINT data = srtd_back8 (obs = 5);\n", " var gender age subj v_date birth;\n", "RUN;\n", "\n", "title \"First 5 sorted observations for gender = 2\";\n", "PROC PRINT data = srtd_back8 (firstobs = 57 obs = 62);\n", " var gender age subj v_date birth;\n", "RUN;\n", "title ;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch the SAS program, but print the whole dataset by removing the FIRSTOBS and OBS dataset options to see that the dataset is sorted in the desired order. Note that if your BY statement contains more than one variable, then the DESCENDING option applies only to the variable that immediately follows it. You might want to sandwich another DESCENDING gender and age in the BY statement and then rerun the program to see the effect.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Create a temporary SAS dataset called `heart` from the sashelp.heart dataset. The library sashelp if a permanent library that comes with SAS.\n", "2. View the description of the `heart` dataset by running PROC CONTENTS. How many observations (rows) and variables (columns) are in the dataset?\n", "3. Use the RENAME= option to rename `Status` to `condition`. Show the variables table from PROC CONTENTS to see that the name has changed.\n", "4. Create a subset of `heart` called `heart_sub` that contains only the columns `height`, `weight`, and `diastolic`. What are the dimensions of `heart_sub`? Print the first five rows of `heart_sub`.\n", "5. Subset the patients (rows) of `heart` that have a diastolic blood pressure (diastolic) more than 90 into a dataset called `heart_sub2`. How many are there? Print the first five rows of `heart_sub2`.\n", "6. Subset the patients (rows) of `heart` that have a diastolic blood pressure less than or equal to 90 and have a weight more than 180 pounds into a dataset called `heart_sub3`. How many are there? Print the first five rows of `heart_sub3`.\n", "7. Create a subset of `heart` that only contains the columns `weight`, `systolic`, and `cholesterol` for those that are heavy smokers smoker (Smoking_Status = \"Heavy (16-25)\"). Save this to `heart_sub`. What are the dimensions of the dataset? Print the first five rows of `heart_sub`.\n", "8. Reorder the rows of `heart_sub` in decreasing order by cholesterol. Print the first 5 rows of the sorted dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "SAS", "language": "sas", "name": "sas" }, "language_info": { "codemirror_mode": "sas", "file_extension": ".sas", "mimetype": "text/x-sas", "name": "sas" } }, "nbformat": 4, "nbformat_minor": 2 }