6. Working with Your Data

Now that we know how to read data into SAS and work with SAS variables in the DATA step, we want to learn the basics of working with our dataset. In this section, we will see how to

  1. Subset rows of a SAS dataset using WHERE, IF, OBS, and FIRSTOBS.

  2. Subset columns of a SAS dataset by using DROP or KEEP with SAS variable lists.

  3. Add a column/new variable to a SAS dataset.

  4. Modify attributes of a variable by renaming, adding a label to a SAS variable, and changing the variable’s length.

  5. Order the rows of a SAS dataset using PROC SORT.

6.1. Subsetting Rows of a SAS Dataset

Working with subsets created from an existing SAS data set can make more efficient use of computer resources than working with the original, larger data set. Reading fewer observations means that fewer iterations of the DATA step must occur.

There are a few ways to subset rows in SAS

  • The FIRSTOBS= and OBS= set statement options.

  • A subsetting IF or WHERE statement.

6.1.1. The FIRSTOBS= and OBS= Options

Let’s begin by working with the FIRSTOBS and OBS set statement options.

  • The SET statement’s FIRSTOBS= option tells SAS to begin reading the data from the input SAS data set at the line number specified by FIRSTOBS.

  • The SET statement’s OBS= option tells SAS to stop reading the data from the input SAS data set at the line number specified by OBS.

We’ll start by using the OBS= option to create the data set that we’ll be working with throughout this lesson. The data set we’ll use contains demographic (or “background”) information collected on 638 subjects once enrolled in the National Institute of Health’s Interstitial Cystitis Data Base (ICDB) Study. Not surprisingly, the ICDB Study collected data on people who were diagnosed as having interstitial cystitis! The primary reason for conducting the study was that interstitial cystitis is a poorly understood condition that causes severe bladder and pelvic pain, urinary frequency, and painful urination in the absence of any identifiable cause. Although the disease is more prevalent in women, it affects both men and women of all ages. (If you want to learn more about the ICDB Study, I refer you to one of the National Institute of Health’s web sites in which a general description is given.

Given that we’ll use the ICDB Study’s background data, it would probably be helpful for you to take a peek at the background data form on which the data were collected. In order to run the SAS programs in this lesson, you’ll need to download and save the background data set, back.sas7bdat, in a folder on your computer (see the DATA in CANVAS).

Example

The DATA step in the following program uses the OBS= option to tell SAS to create a temporary data set called back by selecting the first 25 observations from the permanent background data set icdb.back:

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
 
DATA back;
    set PHC6089.back (obs=25);
RUN;
 
PROC PRINT data=back (obs = 10);
    title 'A Subset of the Background Data Set';
RUN;
SAS Output

A Subset of the Background Data Set

Obs subj v_type v_date r_id b_date sex state country race ethnic relig mar_st ed_level emp_st job_chng income
1 110027 0 10/05/93 2068 07/05/62 2 5 1 4 0 0 1 3 1 . 2
2 110029 0 10/05/93 2068 09/07/26 2 5 1 4 0 2 1 5 8 . 2
3 110039 0 12/07/93 2068 07/24/24 2 22 1 4 0 3 1 3 8 . 2
4 110040 0 11/30/93 2068 10/20/67 2 32 1 4 0 7 1 5 1 . 2
5 110045 0 01/11/94 2068 04/18/25 1 36 1 4 0 3 1 1 8 0 2
6 110049 0 01/25/94 2068 10/05/23 2 37 1 4 0 1 1 5 8 0 2
7 110051 0 01/25/94 2068 12/02/42 2 42 1 4 0 3 1 3 1 0 2
8 110052 0 01/27/94 1808 01/04/25 2 5 1 4 0 0 1 4 8 0 2
9 110053 0 02/22/94 1808 03/15/22 2 5 1 4 1 3 1 1 8 0 1
10 110055 0 03/15/94 1808 03/31/41 2 5 1 4 0 0 1 3 1 0 2

The program is pretty straightforward. The main thing to keep in mind is that you have to enclose the OBS= option in parentheses.

If you haven't already done so, download and save the background data set (see CANVAS) to a convenient location on your computer. Then, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you've saved the data set. Then, run the SAS program, and review the output from the PRINT procedure to familiarize yourself with the data set.

Example

The following program uses the SET statement's FIRSTOBS= and OBS= options to tell SAS to include fourteen observations — observations 7, 8, 9, ..., and 20 — from the permanent icdb.back data set in the temporary back data set:

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back1;
    set phc6089.back (FIRSTOBS=7 OBS=20);
RUN;
 
PROC PRINT data=back1;
    title 'Output Dataset: BACK1';
RUN;
SAS Output

Output Dataset: BACK1

Obs subj v_type v_date r_id b_date sex state country race ethnic relig mar_st ed_level emp_st job_chng income
1 110051 0 01/25/94 2068 12/02/42 2 42 1 4 0 3 1 3 1 0 2
2 110052 0 01/27/94 1808 01/04/25 2 5 1 4 0 0 1 4 8 0 2
3 110053 0 02/22/94 1808 03/15/22 2 5 1 4 1 3 1 1 8 0 1
4 110055 0 03/15/94 1808 03/31/41 2 5 1 4 0 0 1 3 1 0 2
5 110057 0 03/15/94 2068 07/10/44 2 5 1 4 0 3 1 4 2 0 2
6 110058 0 03/18/94 1808 09/09/50 2 . 13 4 1 0 1 3 1 0 1
7 110059 0 03/18/94 1808 07/25/34 2 13 1 4 0 1 1 3 8 0 1
8 110060 0 06/14/94 1808 05/29/36 2 13 1 4 0 3 1 3 1 0 2
9 110062 0 03/31/94 1808 04/21/36 2 3 1 4 0 1 1 4 5 0 2
10 110065 0 04/04/94 1808 10/12/52 2 5 1 4 0 3 1 4 4 1 1
11 110066 0 04/12/94 1808 08/28/62 2 5 1 4 0 0 1 4 6 0 2
12 110067 0 04/26/94 1808 02/22/72 2 5 1 4 0 1 6 4 2 0 2
13 110068 0 06/13/94 1808 09/10/55 2 25 1 4 0 11 1 3 7 0 2
14 110069 0 05/31/94 1808 08/17/38 2 32 1 4 0 0 1 3 1 0 2

Launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the background data set. Then, run the SAS program, and review the output from the PRINT procedure. Compare the output to the output of that from the previous example to convince yourself that the temporary data set back1 indeed contains fourteen observations — observations 7, 8, ..., 20 in the original background data set.

6.1.2. WHERE= Option, WHERE clause and Subsetting IF

The WHERE= option allows one to select only those observations from a SAS data set that meet a certain condition. The WHERE= option can be attached to the SET statement or the DATA statement. If the WHERE= option is attached to the SET statement, SAS selects the observations that meet the condition as it reads in the data. If the WHERE= option is attached to the DATA statement, SAS selects the observations as it writes the data from the program data vector to the output data set.

Example

The following program illustrates the use of the WHERE= option to select observations from a SAS data set that meet a certain condition. Because the WHERE= option is attached to the DATA statement, the selection process takes place as SAS writes the data from the program data vector to the output data set:

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA temple (where = (int(subj/10000)=23)) 
       okla (where = (int(subj/10000)=31));
    set phc6089.back;
RUN;
 
PROC PRINT data=temple;
    title 'Output Dataset: TEMPLE';
RUN;
 
PROC PRINT data=okla (obs = 10);
    title 'Output Dataset: OKLA';
RUN;
SAS Output

Output Dataset: TEMPLE

Obs subj v_type v_date r_id b_date sex state country race ethnic relig mar_st ed_level emp_st job_chng income
1 230003 0 07/09/93 5396 10/25/47 2 22 1 4 0 1 4 3 1 . 1
2 230004 0 01/04/94 5396 08/15/23 2 38 1 4 0 1 5 3 5 0 1
3 230005 0 01/06/94 5396 05/25/49 2 10 1 3 0 3 2 4 1 0 2
4 230006 0 01/06/94 5396 04/24/49 2 21 1 4 0 3 1 5 5 1 2
5 230008 0 10/03/96 5396 08/09/60 2 38 1 4 0 1 1 4 2 0 2
6 230009 0 10/31/96 5396 11/13/50 1 38 1 4 0 1 1 3 1 0 2

Output Dataset: OKLA

Obs subj v_type v_date r_id b_date sex state country race ethnic relig mar_st ed_level emp_st job_chng income
1 310020 0 06/18/93 1916 08/09/43 2 43 1 4 0 3 1 5 1 . 2
2 310032 0 08/03/93 1916 04/02/34 2 13 1 4 0 1 1 1 8 . 1
3 310037 0 11/08/93 1916 04/25/39 2 36 1 4 0 1 1 3 7 . 2
4 310041 0 10/27/93 1916 11/01/26 2 43 1 4 0 3 1 4 5 . 2
5 310049 0 02/04/94 1916 10/03/31 2 25 1 4 0 3 1 4 8 1 2
6 310055 0 07/12/94 1916 12/11/56 2 36 1 4 0 3 1 4 1 1 1
7 310056 0 04/01/94 1916 11/16/61 2 36 1 4 0 3 1 4 1 0 2
8 310059 0 06/13/94 1916 08/22/34 2 43 1 4 0 1 1 4 8 0 2
9 310065 0 06/30/94 1916 07/27/56 2 36 1 4 0 0 1 4 1 0 2
10 310069 0 09/23/94 1916 07/14/42 2 36 1 4 0 3 3 3 7 1 1

Well, it's not really that simple. This program also illustrates the use of the WHERE= option to divide a large data set up into smaller data sets, based on the values of a certain variable. As you can see, there are actually two data set names — temple and okla — appearing in the DATA statement. That tells SAS that we want to create two data sets within this one DATA step. One data set, temple, should contain only those subjects enrolled at Temple University, while the other data set, okla, should contain only those subjects enrolled at the University of Oklahoma. Accomplishing that requires a bit of understanding about the subject number used in the ICDB Study, as well as the functionality of the INT function:

  • The variable subj contains a six-digit subject number. The first two digits of the subject number indicate the location where the subject was enrolled. If the first two digits of the subject id number is 23, then the subject was enrolled at Temple University. If the first two digits of the subject id number is 31, the subject was enrolled at the University of Oklahoma.
  • You might recall that the INT function — that is, the integer function— returns the integer part of the expression contained within parentheses. So, if the subject number is 230007, then int(subj/10000) = int(23.0007) = 23, the location id for Temple University.

Pay particular attention to the syntax of the WHERE= option ... it can trip you up if you aren't careful. The entire WHERE= option must be contained in parentheses immediately following the data set to which you want the condition to apply. The condition itself is also placed in parentheses. So, in general, the syntax, when applied to a DATA statement, should look like this:

DATA dsname (WHERE = (condition));

where dsname is the data set name and condition is the condition that you want SAS to evaluate when selecting the observations.

Now, before running the program, change the directory in the LIBNAME statement so that it reflects the location in which you have saved the background data set. After you've edited the LIBNAME statement, run the SAS program. Review the output to convince yourself that the temporary data set temple contains the observations in the ICDB background data set that correspond to subjects enrolled at Temple University, while the temporary data set okla contains the observations that correspond to subjects enrolled at the University of Oklahoma.

A few more comments. When you use the WHERE= option, the condition applies only to the data set which immediately precedes it. Alternatively, you can use a WHERE statement, in which the condition applies to all of the input data sets. Also beware that the WHERE= option cannot be used at the same time as the FIRSTOBS= or OBS= options.

Example

The following program illustrates efficient use of the WHERE= option in the SET statement. Because the WHERE= option appears in the SET statement, the selection process takes places as SAS reads in the observations from the icdb.back data set:

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA temple2;
    set phc6089.back (where = (int(subj/10000)=23));
RUN;
 
PROC PRINT data = temple2;
    title 'Output Dataset: TEMPLE2';
RUN;
SAS Output

Output Dataset: TEMPLE2

Obs subj v_type v_date r_id b_date sex state country race ethnic relig mar_st ed_level emp_st job_chng income
1 230003 0 07/09/93 5396 10/25/47 2 22 1 4 0 1 4 3 1 . 1
2 230004 0 01/04/94 5396 08/15/23 2 38 1 4 0 1 5 3 5 0 1
3 230005 0 01/06/94 5396 05/25/49 2 10 1 3 0 3 2 4 1 0 2
4 230006 0 01/06/94 5396 04/24/49 2 21 1 4 0 3 1 5 5 1 2
5 230008 0 10/03/96 5396 08/09/60 2 38 1 4 0 1 1 4 2 0 2
6 230009 0 10/31/96 5396 11/13/50 1 38 1 4 0 1 1 3 1 0 2

Before running the program, change the directory in the LIBNAME statement so that it reflects the location in which you have saved the background data set. After you've edited the LIBNAME statement, run the SAS program. Review the output from the PRINT procedure. Convince yourself that the temple2 data set contains the observations in the ICDB background data set that correspond to subjects enrolled at Temple University.

Alternatively, if you want to exclude certain observations from all output datasets, you can use

  • WHERE clause or subsetting IF to include rows that meet a certain set of conditions

  • If-Then-Delete to exclude rows that meet a certain condition.

Example

The following program illustrates using the WHERE statement to only keep the records whose Subject ID starts with 23.

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA temple3;
    set phc6089.back;
    where (int(subj/10000)=23);
RUN;
 
PROC PRINT data = temple3;
    title 'Output Dataset: TEMPLE3';
RUN;
SAS Output

Output Dataset: TEMPLE3

Obs subj v_type v_date r_id b_date sex state country race ethnic relig mar_st ed_level emp_st job_chng income
1 230003 0 07/09/93 5396 10/25/47 2 22 1 4 0 1 4 3 1 . 1
2 230004 0 01/04/94 5396 08/15/23 2 38 1 4 0 1 5 3 5 0 1
3 230005 0 01/06/94 5396 05/25/49 2 10 1 3 0 3 2 4 1 0 2
4 230006 0 01/06/94 5396 04/24/49 2 21 1 4 0 3 1 5 5 1 2
5 230008 0 10/03/96 5396 08/09/60 2 38 1 4 0 1 1 4 2 0 2
6 230009 0 10/31/96 5396 11/13/50 1 38 1 4 0 1 1 3 1 0 2

Note that there is no = sign with the WHERE statement as opposed to using the WHERE= dataset option. Alternatively, we could have done the same thing using a subsetting If.

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA temple4;
    set phc6089.back;
    if (int(subj/10000)=23);
RUN;
 
PROC PRINT data = temple4;
    title 'Output Dataset: TEMPLE4';
RUN;
SAS Output

Output Dataset: TEMPLE4

Obs subj v_type v_date r_id b_date sex state country race ethnic relig mar_st ed_level emp_st job_chng income
1 230003 0 07/09/93 5396 10/25/47 2 22 1 4 0 1 4 3 1 . 1
2 230004 0 01/04/94 5396 08/15/23 2 38 1 4 0 1 5 3 5 0 1
3 230005 0 01/06/94 5396 05/25/49 2 10 1 3 0 3 2 4 1 0 2
4 230006 0 01/06/94 5396 04/24/49 2 21 1 4 0 3 1 5 5 1 2
5 230008 0 10/03/96 5396 08/09/60 2 38 1 4 0 1 1 4 2 0 2
6 230009 0 10/31/96 5396 11/13/50 1 38 1 4 0 1 1 3 1 0 2

If instead we want to exclude observations that meet a certain condition, then we would need to use an If-Then-Delete statement. The following program removes rows where the subject id does not start with 23.

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA temple5;
    set phc6089.back;
    if (int(subj/10000) ne 23) Then DELETE;
RUN;
 
PROC PRINT data = temple5;
    title 'Output Dataset: TEMPLE5';
RUN;
SAS Output

Output Dataset: TEMPLE5

Obs subj v_type v_date r_id b_date sex state country race ethnic relig mar_st ed_level emp_st job_chng income
1 230003 0 07/09/93 5396 10/25/47 2 22 1 4 0 1 4 3 1 . 1
2 230004 0 01/04/94 5396 08/15/23 2 38 1 4 0 1 5 3 5 0 1
3 230005 0 01/06/94 5396 05/25/49 2 10 1 3 0 3 2 4 1 0 2
4 230006 0 01/06/94 5396 04/24/49 2 21 1 4 0 3 1 5 5 1 2
5 230008 0 10/03/96 5396 08/09/60 2 38 1 4 0 1 1 4 2 0 2
6 230009 0 10/31/96 5396 11/13/50 1 38 1 4 0 1 1 3 1 0 2

6.2. Subsetting Columns of a SAS Dataset

There may be situations in which you’d like to reduce the number of variables in a data set in order to work with a data set of a much more manageable size. I like to think of doing such a thing as “making a data set thinner.” We can select certain variables from our dataset by either

  • Using a DROP statement or DROP= dataset option to select which variables to exclude.

  • Or by using a KEEP statement or KEEP=dataset option to select which variables in include.

The advantage is that, when you can get away with reading in fewer variables, SAS uses a smaller program data vector and therefore runs much more efficiently.

Which option/statement should you use? The choice of whether to use the DROP or the KEEP in a given situation is personal. Or rather, let’s say that most people will choose the option that requires the least amount of typing! I still try though, whenever possible, to use the KEEP, as then when I read my programs I know exactly which variables my data sets contain. Using DROP, on the other hand, requires one to think about what variables were in the data set before the dropping took place.

Example

The following program tells SAS to keep just three variables — subj, v_date, and b_date — when reading from the back1 data set in order to create the back2 data set:

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back2;
    set phc6089.back (keep = subj v_date b_date);
    age = (v_date - b_date)/365;  * Calculate AGE in years;
    format age 4.1;
RUN;
 
PROC PRINT data=back2 (obs=5);
    title 'Output Dataset: BACK2';
RUN;
SAS Output

Output Dataset: BACK2

Obs subj v_date b_date age
1 110027 10/05/93 07/05/62 31.3
2 110029 10/05/93 09/07/26 67.1
3 110039 12/07/93 07/24/24 69.4
4 110040 11/30/93 10/20/67 26.1
5 110045 01/11/94 04/18/25 68.8

The SET statement's KEEP= option is used to tell SAS to read three variables — subject (subj), visit date (v_date), and birth date (b_date) — from the data set back and to store them in a new temporary data set back2. Note that the KEEP= option must be placed in parentheses and follow the name of the data set from you which you want SAS to select variables. The list of variables that you specify to keep must be separated by at least one blank space.

Then, the subject's age (age) at the time of the visit is calculated by subtracting the difference between visit date (v_date) and birth date (b_date) and dividing by 365 to get (an approximate) age in years.

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the temporary data set back2 contains the three kept variables (subj, v_date, and b_date), as well as the newly calculated variable (age).

Alternative, we could have used the keep statement. The benefit to using the SET option keep= in the example above, is that we have only read in the variables that we need. The following version reads in all the variables to the DATA step and then only keeps the selected varaibles to output to the dataset back3.

Note that now, we need to also put the new variable age in the keep statement. Before, the keep= option in the SET statement only applied to the input dataset in the SET statement, but the keep statement applies to all variables in the DATA step.

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back3;
    set phc6089.back; 
    keep subj v_date b_date age;
    age = (v_date - b_date)/365;  * Calculate AGE in years;
    format age 4.1;
RUN;
 
PROC PRINT data=back3 (obs=5);
    title 'Output Dataset: BACK3';
RUN;
SAS Output

Output Dataset: BACK3

Obs subj v_date b_date age
1 110027 10/05/93 07/05/62 31.3
2 110029 10/05/93 09/07/26 67.1
3 110039 12/07/93 07/24/24 69.4
4 110040 11/30/93 10/20/67 26.1
5 110045 01/11/94 04/18/25 68.8

We can also use the keep= option as a DATA statement option.

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back3a (keep= subj v_date b_date age);
    set phc6089.back; 
    age = (v_date - b_date)/365;  * Calculate AGE in years;
    format age 4.1;
RUN;
 
PROC PRINT data=back3a (obs=5);
    title 'Output Dataset: BACK3a';
RUN;
SAS Output

Output Dataset: BACK3a

Obs subj v_date b_date age
1 110027 10/05/93 07/05/62 31.3
2 110029 10/05/93 09/07/26 67.1
3 110039 12/07/93 07/24/24 69.4
4 110040 11/30/93 10/20/67 26.1
5 110045 01/11/94 04/18/25 68.8

Another way to acheive the same result is to use DROP to exclude all the variables that we don't want

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back4;
    set phc6089.back (drop = v_type r_id sex state country 
                             race ethnic relig mar_st
                             ed_level emp_st job_chng income);
    age = (v_date - b_date)/365;  * Calculate AGE in years;
    format age 4.1;
RUN;
 
PROC PRINT data=back4 (obs = 5);
    title 'Output Dataset: BACK4';
RUN;
SAS Output

Output Dataset: BACK4

Obs subj v_date b_date age
1 110027 10/05/93 07/05/62 31.3
2 110029 10/05/93 09/07/26 67.1
3 110039 12/07/93 07/24/24 69.4
4 110040 11/30/93 10/20/67 26.1
5 110045 01/11/94 04/18/25 68.8

The SET statement's DROP= option tells SAS to drop thirteen variables — the visit type (v_type), the id number of the reviewer of the data form (r_id), as well as the subject's gender (sex), state (state) and country (country) of residence, race (race), ethnicity (ethnic), religion (relig), marital status (mar_st), education level (ed_level), employment status (emp_st), job change status (job_chng), and income (income) — from the data set back and to store the remaining variables in a new temporary data set back3. Therefore, the program data vector used to create the SAS data set back3 contains only three variables — subject id (subj), visit date (v_date), and birth date (b_date).

Note that, just like the KEEP= option, the DROP= option must be placed in parentheses and follow the name of the data set from you which you want SAS to exclude variables. The list of variables that you specify to drop must be separated by at least one blank space.

We could also have accomplished this by using a DROP statement. This however will result in SAS reading in all variables to the SAS data vector, and then only outputting the variables we decide not to drop to the output dataset.

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back5;
    set phc6089.back;
    drop v_type r_id sex state country 
         race ethnic relig mar_st
         ed_level emp_st job_chng income;
    age = (v_date - b_date)/365;  * Calculate AGE in years;
    format age 4.1;
RUN;
 
PROC PRINT data=back5 (obs = 5);
    title 'Output Dataset: BACK5';
RUN;
SAS Output

Output Dataset: BACK5

Obs subj v_date b_date age
1 110027 10/05/93 07/05/62 31.3
2 110029 10/05/93 09/07/26 67.1
3 110039 12/07/93 07/24/24 69.4
4 110040 11/30/93 10/20/67 26.1
5 110045 01/11/94 04/18/25 68.8

Alternatively, we could have also used the DATA statement's drop= option.

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back5a (drop = v_type r_id sex state country 
         race ethnic relig mar_st
         ed_level emp_st job_chng income);
    set phc6089.back;
    age = (v_date - b_date)/365;  * Calculate AGE in years;
    format age 4.1;
RUN;
 
PROC PRINT data=back5a (obs = 5);
    title 'Output Dataset: BACK5a';
RUN;
SAS Output

Output Dataset: BACK5a

Obs subj v_date b_date age
1 110027 10/05/93 07/05/62 31.3
2 110029 10/05/93 09/07/26 67.1
3 110039 12/07/93 07/24/24 69.4
4 110040 11/30/93 10/20/67 26.1
5 110045 01/11/94 04/18/25 68.8

Listing out all of these variables can be very tedious, so SAS provides a shortcut to create variable lists. We can specify a list of the form var1 -- var2 where var1 and var2 are two variable names in our SAS dataset. The -- tells SAS to "fill in" all variables that occur in the dataset inbetween these two columns.

If we look at the order of the columns in our dataset by using PROC CONTENTS with the VARNUM option, then we can see that we all the columns from sex to income are in order sequentially. This will allow us to shorten our DROP statement to

DROP v_type r_id sex -- income;
ods select Position;
PROC contents data = phc6089.back order=varnum;
RUN;
ods select ALL;
SAS Output

Output Dataset: BACK6

The CONTENTS Procedure

Variables in Creation Order
# Variable Type Len Format
1 subj Num 8  
2 v_type Num 8  
3 v_date Num 8 MMDDYY8.
4 r_id Num 8  
5 b_date Num 8 MMDDYY8.
6 sex Num 8  
7 state Num 8  
8 country Num 8  
9 race Num 8  
10 ethnic Num 8  
11 relig Num 8  
12 mar_st Num 8  
13 ed_level Num 8  
14 emp_st Num 8  
15 job_chng Num 8  
16 income Num 8  
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back6;
    set phc6089.back;
    drop v_type r_id sex -- income;
    age = (v_date - b_date)/365;  * Calculate AGE in years;
    format age 4.1;
RUN;
 
PROC PRINT data=back6 (obs = 5);
    title 'Output Dataset: BACK6';
RUN;
SAS Output

Output Dataset: BACK6

Obs subj v_date b_date age
1 110027 10/05/93 07/05/62 31.3
2 110029 10/05/93 09/07/26 67.1
3 110039 12/07/93 07/24/24 69.4
4 110040 11/30/93 10/20/67 26.1
5 110045 01/11/94 04/18/25 68.8

What to Use When

After reviewing the previous four examples, your head might be spinning and you might be wondering how you would know which to use — the KEEP= option on the SET statement? the DROP= option on the SET statement? the KEEP= option on the DATA statement? or the DROP= option on the DATA statement? Here are the key facts that will help us devise a strategy:

  • As stated earlier, the choice between KEEP and DROP is a matter of personal choice. It’s whether you place the options on the SET statement, the DATA statement, or use the individual statement that makes a difference.

  • In the SET statement, the DROP= and KEEP= options determine which variables are read from the input SAS dataset. That is, when you specify the DROP= or KEEP= option in the SET statement, the excluded variables are never even read into the program data vector.

  • In the DATA statement, the DROP= and KEEP= options determine which variables are written from the program data vector to the resulting SAS data set. That is, when you specify the DROP= or KEEP= option in the DATA statement, all of the variables in the input data set are read in to the program data vector, but only the selected variables are written from the program data vector to the output data set. This also allows you to create multiple datasets in a single DATA step by keeping/dropping different sets of variables in each of the output dataset.

  • The individual DROP and KEEP statements work similarly to the DATA statement options except that they will apply to all datasets listed in the DATA statement.

Those facts stated, it becomes obvious that we can construct a more efficient DATA step by not reading unneeded variables from the input data set. That said, we might have a working strategy:

  • use the KEEP= option (or DROP= option) in the SET statement to tell SAS only those variables that you need from the input data set read into the program data vector in order to achieve the goals of your program

  • use the KEEP= option (or DROP= option) in the DATA statement or the individual statements to tell SAS all of the variables that you want written from the program data vector into the output data set. You must use the DATA statement options if you want to select different variables for different datasets listed in the DATA statement.

Let’s try our working strategy out on an example.

Example

The following SAS program illustrates our working strategy of when to use the KEEP= option in the SET statement and when to use the KEEP= option in the DATA statement:

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back6 (keep = subj age);
    set phc6089.back (keep= subj v_date b_date);
    age = (v_date - b_date)/365;
    format age 4.1;
RUN;
 
PROC PRINT data=back6 (obs=5);
    title 'Output Dataset: BACK6';
RUN;
SAS Output

Output Dataset: BACK6

Obs subj age
1 110027 31.3
2 110029 67.1
3 110039 69.4
4 110040 26.1
5 110045 68.8

Here, the KEEP= option on the SET statement tells SAS to keep only three variables — subj, v_date, and b_date — from the input data set. The v_date and b_date variables are kept because we need them in order to calculate the age variable. The subject's age is calculated by subtracting b_date from v_date and dividing by 365. Therefore, the program data vector contains four variables — subj, v_date, b_date, and age —all, none or any of which can be written to the output back6 data set. The KEEP= option of the DATA statement tells SAS which variables to write from the program data vector to the output data set. Here, SAS is told to write only the subject's id (subj) and age (age) to back6.

Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the temporary data set back6 does indeed contain only thesubject's id (subj) and age (age).

6.3. Modifying Variable Attributes: RENAME= and LABEL options

Recall that the descriptor portion of a data set containing information about the attributes of each variable in the data set. The attribute information includes the variable’s name, type, length, format, informat, and label. In this section, we will learn how to

  • Use the LENGTH statement to adjust the default length of a variable.

  • Use the RENAME= option to change the name of a variable.

  • Use the LABEL statement to add a descriptive label to a variable.

We will cover formats in a later section.

6.3.1. The RENAME= Option

There may be occasions in which you want to change some of the variable names in your SAS data set. To do so, you’ll want to use the RENAME= option. As its name suggests, the RENAME= option allows you to change the variable names within a SAS data set.

The format of the RENAME= option is:

RENAME = (old1=new1 old2=new2 .... oldk=newk);

where old1, old2, … oldk are the variable names as they appear in the data set that precedes the RENAME= option, and new1, new2, …, newk are the corresponding new variable names.

The effect of the RENAME= option depends on where it appears:

  • If the RENAME= option appears in the SET statement, then the new variable name takes effect when the program data vector is created. Therefore, all programming statements within the DATA step must refer to the new variable name.

  • If the RENAME= option appears in the DATA statement, then the new variable name takes effect only when the data are written to the SAS data set. Therefore, all programming statements within the DATA step must refer to the old variable name.

Example

The following program illustrates the use of the RENAME= option in the SET statement. Specifically, the variable sex is changed to gender, and b_date is changed to birth, when the program data vector is created:

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back7 (keep = subj gender v_date birth age);
    set phc6089.back (rename=(sex=gender b_date=birth));
    age = (v_date - birth)/365;   *MUST use NEW name for date of birth;
RUN;
 
PROC PRINT data=back7 (obs=5);
    title 'Output Dataset: BACK7';
RUN;
SAS Output

Output Dataset: BACK7

Obs subj v_date birth gender age
1 110027 10/05/93 07/05/62 2 31.2740
2 110029 10/05/93 09/07/26 2 67.1233
3 110039 12/07/93 07/24/24 2 69.4192
4 110040 11/30/93 10/20/67 2 26.1315
5 110045 01/11/94 04/18/25 1 68.7808

Because the RENAME= option appears in the SET statement, SAS no longer recognizes the variable name sex as the gender, nor b_date as the birth date, of the subject. Instead, SAS recognizes the variable names gender and birth. Hence, when we subsequently calculate the subjects' ages (age) in the DATA step, we must refer to the new variable name birth.

Again, pay particular attention to the syntax of the RENAME= option ... it too can be tricky. The entire RENAME= option must be contained in parentheses immediately following the data set to which you want the name changes to apply. The variable names must also be placed in parentheses. So, in general, the syntax, when applied to a DATA statement, should look like this:

DATA dsname (RENAME = (o1=n1 o2=n2 ...));

where dsname is the data set name and o1 and o2 are the old variable names, and n1 and n2 are the new variable names.

Launch and run the SAS program. Review the output from the PRINT procedure. Convince yourself that the variable names sex and b_date have been changed as advertised to gender and birth, respectively. Also, verify that the ages of the subjects have been calculated appropriately. Then, in the SAS program, change the variable name birth back to the variable name b_date, and re-run the program. Does SAS indeed hiccup?

Example

The following program illustrates use of the RENAME= option, when it appears in the DATA statement. Specifically, the variable sex is changed to gender, and b_date is changed to birth when SAS writes the data to the output data set:

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back8 (rename=(sex=gender b_date=birth)
            keep = subj sex v_date b_date age);
    set phc6089.back;
    age = (v_date - b_date)/365; *MUST use OLD name 
                                  for date of birth;
RUN;
 
PROC PRINT data=back8 (obs=5);
    title 'Output Dataset: BACK8';
RUN;
SAS Output

Output Dataset: BACK8

Obs subj v_date birth gender age
1 110027 10/05/93 07/05/62 2 31.2740
2 110029 10/05/93 09/07/26 2 67.1233
3 110039 12/07/93 07/24/24 2 69.4192
4 110040 11/30/93 10/20/67 2 26.1315
5 110045 01/11/94 04/18/25 1 68.7808

Because the RENAME= option appears in the DATA statement, SAS only recognizes the variable names as they appear in the input data set back3. That is, for example, SAS recognizes the variable name b_date as the birth date of the subjects. Hence, when we subsequently calculate the subjects' ages in the DATA step, we must refer to the old variable name b_date. Also note that the KEEP= option in the DATA statement must refer to the original variable names as they appear in the back3 data set.

This program also illustrates how to use more than one DATA step option at a time. Specifically, the RENAME= and KEEP= options are used to modify the back8 data set. As such, both options are placed within one set of parentheses immediately following the data set to which you want the changes to apply. Then, within those parentheses, the basic syntax for each option is followed.

Launch and run the SAS program, and review the output from the PRINT procedure. Convince yourself that the variable names sex and b_date have been changed as advertised to gender and birth, respectively. Also, verify that the ages of the subjects have been calculated appropriately.

6.3.2. The LABEL Statement

The LABEL statement associates a descriptive label to a variable. This label is used in place of the variable name when printing such as when creating plots, the label is used as the axis label.

Example

The following program uses a LABEL statement to associate descriptive labels to the variables subj, sex, v_date, b_date, and age.

LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';

DATA back8a (rename=(sex=gender b_date=birth)
            keep = subj sex v_date b_date age);
    set phc6089.back;
    age = (v_date - b_date)/365; *MUST use OLD name 
                                  for date of birth;
    LABEL subj = "Subject ID"
          sex = "Gender"
          v_date = "Visit Date"
          b_date = "Birth Date"
          age = "Age in Years";
RUN;

title ;
ODS SELECT Position;
PROC contents data = back8a order = varnum;
RUN;
ODS SELECT ALL;
SAS Output

The CONTENTS Procedure

Variables in Creation Order
# Variable Type Len Format Label
1 subj Num 8   Subject ID
2 v_date Num 8 MMDDYY8. Visit Date
3 birth Num 8 MMDDYY8. Birth Date
4 gender Num 8   Gender
5 age Num 8   Age in Years

We can see that the labels were added by viewing the descriptor portion of the dataset via PROC CONTENTS. A label can be as long as 256 characters.

6.3.3. The LENGTH Statement

We learned earlier that by default, character variables are 8 characters long. To get around this before, we used a character informat or column input to read character strings that were longer than 8 characters. Another way is the use the LENGTH statement to set the variable length attribute to be longer than 8 characters.

Example

The following simple program illustrates creating a character value with the default length versus setting the length to be longer by using the LENGTH statement.

DATA char_length;
    LENGTH char_var2 $20;
    INFILE datalines delimiter = ",";
    INPUT char_var1 $ char_var2 $;
    DATALINES;
A longer character string, A longer character string
shortChar, shortChar
;
RUN;

PROC print data = char_length;
    title "Character Length Data Example";
RUN;
SAS Output

Character Length Data Example

Obs char_var2 char_var1
1 A longer character s A longer
2 shortChar shortCha

The LENGTH statement sets the length of the variable char_var2 to be up to 20 characters, while char_var1 gets the default length of 8 characters. When reading the same string value, char_var1 can only store up to the first 8 characters and discards the rest, while char_var2 can store up to 20. This results in the strings being only partially read in for char_var1.

6.4. Sorting the Rows of a Dataset: PROC SORT

By default, the PRINT procedure displays observations in the order in which they appear in your data set. Alternatively, you can use the SORT procedure to first sort your data set based on the values of one or more variables. Then, when you use the PRINT procedure, SAS will display the observations in the order in which you sorted the data.

Example

The following SAS program uses the SORT procedure to sort the temporary dataset back8 that we created earlier by gender and then by age. Note that it is sorted by gender first. Then, if there are any ties in gender, these are sorted by age.

PROC SORT data = back8 out = srtd_back8;
    by gender age;
RUN;

title "First 5 sorted observations for gender  = 1";
PROC PRINT data = srtd_back8 (obs = 5);
    var gender age subj v_date birth;
RUN;

title "First 5 sorted observations for gender  = 2";
PROC PRINT data = srtd_back8 (firstobs = 57 obs = 62);
    var gender age subj v_date birth;
RUN;
SAS Output

First 5 sorted observations for gender = 1

Obs gender age subj v_date birth
1 1 -18.3014 110125 02/06/96 05/22/14
2 1 -17.3699 110152 07/19/96 11/27/13
3 1 23.4986 410073 06/13/95 12/19/71
4 1 23.6822 520051 09/26/94 01/26/71
5 1 24.8630 410071 05/30/95 07/25/70

First 5 sorted observations for gender = 2

Obs gender age subj v_date birth
57 2 -23.1534 410097 04/24/96 06/14/19
58 2 -22.6795 420074 09/07/95 05/07/18
59 2 -22.6466 520007 08/13/93 03/31/16
60 2 -22.2438 310102 02/12/96 05/06/18
61 2 -22.1781 420112 12/17/96 02/15/19
62 2 -20.3041 510010 09/21/93 01/05/14

Launch the SAS program, but print the whole dataset by removing the FIRSTOBS and OBS dataset options to verify that the datset is sorted first by gender and then by age.

Then, note that while the SORT procedure's BY statement is required, it's OUT= option is optional. If you don't use it, however, then the SORT procedure permanently sorts the data set that is specified in the DATA= option. Therefore, if you need your data to be sorted just to produce output temporarily, then you should use the OUT= option in conjunction with a temporary SAS data set name.

By default, SAS sorts the values of the variables appearing in the BY statement in ascending order. If you want them sorted in descending order, you need to use the BY statement's DESCENDING option.

Example

The following SAS program uses the BY statement's DESCENDING option to tell SAS to sort the temporary dataset back8 first by gender in descending order, and then by age in ascending order.

PROC SORT data = back8 out = srtd_back8;
    by descending gender age;
RUN;

title "First 5 sorted observations for gender  = 1";
PROC PRINT data = srtd_back8 (obs = 5);
    var gender age subj v_date birth;
RUN;

title "First 5 sorted observations for gender  = 2";
PROC PRINT data = srtd_back8 (firstobs = 57 obs = 62);
    var gender age subj v_date birth;
RUN;
title ;
SAS Output

First 5 sorted observations for gender = 1

Obs gender age subj v_date birth
1 2 -23.1534 410097 04/24/96 06/14/19
2 2 -22.6795 420074 09/07/95 05/07/18
3 2 -22.6466 520007 08/13/93 03/31/16
4 2 -22.2438 310102 02/12/96 05/06/18
5 2 -22.1781 420112 12/17/96 02/15/19

First 5 sorted observations for gender = 2

Obs gender age subj v_date birth
57 2 26.5479 410006 09/09/93 02/28/67
58 2 26.5863 510051 01/20/95 06/26/68
59 2 26.6110 410091 02/14/96 07/12/69
60 2 26.7452 210100 05/06/96 08/14/69
61 2 26.7507 210006 05/13/93 08/19/66
62 2 26.8438 220036 08/08/95 10/10/68

Launch the SAS program, but print the whole dataset by removing the FIRSTOBS and OBS dataset options to see that the dataset is sorted in the desired order. Note that if your BY statement contains more than one variable, then the DESCENDING option applies only to the variable that immediately follows it. You might want to sandwich another DESCENDING gender and age in the BY statement and then rerun the program to see the effect.

6.5. Exercises

  1. Create a temporary SAS dataset called heart from the sashelp.heart dataset. The library sashelp if a permanent library that comes with SAS.

  2. View the description of the heart dataset by running PROC CONTENTS. How many observations (rows) and variables (columns) are in the dataset?

  3. Use the RENAME= option to rename Status to condition. Show the variables table from PROC CONTENTS to see that the name has changed.

  4. Create a subset of heart called heart_sub that contains only the columns height, weight, and diastolic. What are the dimensions of heart_sub? Print the first five rows of heart_sub.

  5. Subset the patients (rows) of heart that have a diastolic blood pressure (diastolic) more than 90 into a dataset called heart_sub2. How many are there? Print the first five rows of heart_sub2.

  6. Subset the patients (rows) of heart that have a diastolic blood pressure less than or equal to 90 and have a weight more than 180 pounds into a dataset called heart_sub3. How many are there? Print the first five rows of heart_sub3.

  7. Create a subset of heart that only contains the columns weight, systolic, and cholesterol for those that are heavy smokers smoker (Smoking_Status = “Heavy (16-25)”). Save this to heart_sub. What are the dimensions of the dataset? Print the first five rows of heart_sub.

  8. Reorder the rows of heart_sub in decreasing order by cholesterol. Print the first 5 rows of the sorted dataset.