6. Working with Your Data¶
Now that we know how to read data into SAS and work with SAS variables in the DATA step, we want to learn the basics of working with our dataset. In this section, we will see how to
Subset rows of a SAS dataset using WHERE, IF, OBS, and FIRSTOBS.
Subset columns of a SAS dataset by using DROP or KEEP with SAS variable lists.
Add a column/new variable to a SAS dataset.
Modify attributes of a variable by renaming, adding a label to a SAS variable, and changing the variable’s length.
Order the rows of a SAS dataset using PROC SORT.
6.1. Subsetting Rows of a SAS Dataset¶
Working with subsets created from an existing SAS data set can make more efficient use of computer resources than working with the original, larger data set. Reading fewer observations means that fewer iterations of the DATA step must occur.
There are a few ways to subset rows in SAS
The FIRSTOBS= and OBS= set statement options.
A subsetting IF or WHERE statement.
6.1.1. The FIRSTOBS= and OBS= Options¶
Let’s begin by working with the FIRSTOBS and OBS set statement options.
The SET statement’s FIRSTOBS= option tells SAS to begin reading the data from the input SAS data set at the line number specified by FIRSTOBS.
The SET statement’s OBS= option tells SAS to stop reading the data from the input SAS data set at the line number specified by OBS.
We’ll start by using the OBS= option to create the data set that we’ll be working with throughout this lesson. The data set we’ll use contains demographic (or “background”) information collected on 638 subjects once enrolled in the National Institute of Health’s Interstitial Cystitis Data Base (ICDB) Study. Not surprisingly, the ICDB Study collected data on people who were diagnosed as having interstitial cystitis! The primary reason for conducting the study was that interstitial cystitis is a poorly understood condition that causes severe bladder and pelvic pain, urinary frequency, and painful urination in the absence of any identifiable cause. Although the disease is more prevalent in women, it affects both men and women of all ages. (If you want to learn more about the ICDB Study, I refer you to one of the National Institute of Health’s web sites in which a general description is given.
Given that we’ll use the ICDB Study’s background data, it would probably be helpful for you to take a peek at the background data form on which the data were collected. In order to run the SAS programs in this lesson, you’ll need to download and save the background data set, back.sas7bdat, in a folder on your computer (see the DATA in CANVAS).
Example
The DATA step in the following program uses the OBS= option to tell SAS to create a temporary data set called back by selecting the first 25 observations from the permanent background data set icdb.back:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back;
set PHC6089.back (obs=25);
RUN;
PROC PRINT data=back (obs = 10);
title 'A Subset of the Background Data Set';
RUN;
Obs | subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 110027 | 0 | 10/05/93 | 2068 | 07/05/62 | 2 | 5 | 1 | 4 | 0 | 0 | 1 | 3 | 1 | . | 2 |
2 | 110029 | 0 | 10/05/93 | 2068 | 09/07/26 | 2 | 5 | 1 | 4 | 0 | 2 | 1 | 5 | 8 | . | 2 |
3 | 110039 | 0 | 12/07/93 | 2068 | 07/24/24 | 2 | 22 | 1 | 4 | 0 | 3 | 1 | 3 | 8 | . | 2 |
4 | 110040 | 0 | 11/30/93 | 2068 | 10/20/67 | 2 | 32 | 1 | 4 | 0 | 7 | 1 | 5 | 1 | . | 2 |
5 | 110045 | 0 | 01/11/94 | 2068 | 04/18/25 | 1 | 36 | 1 | 4 | 0 | 3 | 1 | 1 | 8 | 0 | 2 |
6 | 110049 | 0 | 01/25/94 | 2068 | 10/05/23 | 2 | 37 | 1 | 4 | 0 | 1 | 1 | 5 | 8 | 0 | 2 |
7 | 110051 | 0 | 01/25/94 | 2068 | 12/02/42 | 2 | 42 | 1 | 4 | 0 | 3 | 1 | 3 | 1 | 0 | 2 |
8 | 110052 | 0 | 01/27/94 | 1808 | 01/04/25 | 2 | 5 | 1 | 4 | 0 | 0 | 1 | 4 | 8 | 0 | 2 |
9 | 110053 | 0 | 02/22/94 | 1808 | 03/15/22 | 2 | 5 | 1 | 4 | 1 | 3 | 1 | 1 | 8 | 0 | 1 |
10 | 110055 | 0 | 03/15/94 | 1808 | 03/31/41 | 2 | 5 | 1 | 4 | 0 | 0 | 1 | 3 | 1 | 0 | 2 |
The program is pretty straightforward. The main thing to keep in mind is that you have to enclose the OBS= option in parentheses.
If you haven't already done so, download and save the background data set (see CANVAS) to a convenient location on your computer. Then, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you've saved the data set. Then, run the SAS program, and review the output from the PRINT procedure to familiarize yourself with the data set.
Example
The following program uses the SET statement's FIRSTOBS= and OBS= options to tell SAS to include fourteen observations — observations 7, 8, 9, ..., and 20 — from the permanent icdb.back data set in the temporary back data set:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back1;
set phc6089.back (FIRSTOBS=7 OBS=20);
RUN;
PROC PRINT data=back1;
title 'Output Dataset: BACK1';
RUN;
Obs | subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 110051 | 0 | 01/25/94 | 2068 | 12/02/42 | 2 | 42 | 1 | 4 | 0 | 3 | 1 | 3 | 1 | 0 | 2 |
2 | 110052 | 0 | 01/27/94 | 1808 | 01/04/25 | 2 | 5 | 1 | 4 | 0 | 0 | 1 | 4 | 8 | 0 | 2 |
3 | 110053 | 0 | 02/22/94 | 1808 | 03/15/22 | 2 | 5 | 1 | 4 | 1 | 3 | 1 | 1 | 8 | 0 | 1 |
4 | 110055 | 0 | 03/15/94 | 1808 | 03/31/41 | 2 | 5 | 1 | 4 | 0 | 0 | 1 | 3 | 1 | 0 | 2 |
5 | 110057 | 0 | 03/15/94 | 2068 | 07/10/44 | 2 | 5 | 1 | 4 | 0 | 3 | 1 | 4 | 2 | 0 | 2 |
6 | 110058 | 0 | 03/18/94 | 1808 | 09/09/50 | 2 | . | 13 | 4 | 1 | 0 | 1 | 3 | 1 | 0 | 1 |
7 | 110059 | 0 | 03/18/94 | 1808 | 07/25/34 | 2 | 13 | 1 | 4 | 0 | 1 | 1 | 3 | 8 | 0 | 1 |
8 | 110060 | 0 | 06/14/94 | 1808 | 05/29/36 | 2 | 13 | 1 | 4 | 0 | 3 | 1 | 3 | 1 | 0 | 2 |
9 | 110062 | 0 | 03/31/94 | 1808 | 04/21/36 | 2 | 3 | 1 | 4 | 0 | 1 | 1 | 4 | 5 | 0 | 2 |
10 | 110065 | 0 | 04/04/94 | 1808 | 10/12/52 | 2 | 5 | 1 | 4 | 0 | 3 | 1 | 4 | 4 | 1 | 1 |
11 | 110066 | 0 | 04/12/94 | 1808 | 08/28/62 | 2 | 5 | 1 | 4 | 0 | 0 | 1 | 4 | 6 | 0 | 2 |
12 | 110067 | 0 | 04/26/94 | 1808 | 02/22/72 | 2 | 5 | 1 | 4 | 0 | 1 | 6 | 4 | 2 | 0 | 2 |
13 | 110068 | 0 | 06/13/94 | 1808 | 09/10/55 | 2 | 25 | 1 | 4 | 0 | 11 | 1 | 3 | 7 | 0 | 2 |
14 | 110069 | 0 | 05/31/94 | 1808 | 08/17/38 | 2 | 32 | 1 | 4 | 0 | 0 | 1 | 3 | 1 | 0 | 2 |
Launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the background data set. Then, run the SAS program, and review the output from the PRINT procedure. Compare the output to the output of that from the previous example to convince yourself that the temporary data set back1 indeed contains fourteen observations — observations 7, 8, ..., 20 in the original background data set.
6.1.2. WHERE= Option, WHERE clause and Subsetting IF¶
The WHERE= option allows one to select only those observations from a SAS data set that meet a certain condition. The WHERE= option can be attached to the SET statement or the DATA statement. If the WHERE= option is attached to the SET statement, SAS selects the observations that meet the condition as it reads in the data. If the WHERE= option is attached to the DATA statement, SAS selects the observations as it writes the data from the program data vector to the output data set.
Example
The following program illustrates the use of the WHERE= option to select observations from a SAS data set that meet a certain condition. Because the WHERE= option is attached to the DATA statement, the selection process takes place as SAS writes the data from the program data vector to the output data set:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA temple (where = (int(subj/10000)=23))
okla (where = (int(subj/10000)=31));
set phc6089.back;
RUN;
PROC PRINT data=temple;
title 'Output Dataset: TEMPLE';
RUN;
PROC PRINT data=okla (obs = 10);
title 'Output Dataset: OKLA';
RUN;
Obs | subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 230003 | 0 | 07/09/93 | 5396 | 10/25/47 | 2 | 22 | 1 | 4 | 0 | 1 | 4 | 3 | 1 | . | 1 |
2 | 230004 | 0 | 01/04/94 | 5396 | 08/15/23 | 2 | 38 | 1 | 4 | 0 | 1 | 5 | 3 | 5 | 0 | 1 |
3 | 230005 | 0 | 01/06/94 | 5396 | 05/25/49 | 2 | 10 | 1 | 3 | 0 | 3 | 2 | 4 | 1 | 0 | 2 |
4 | 230006 | 0 | 01/06/94 | 5396 | 04/24/49 | 2 | 21 | 1 | 4 | 0 | 3 | 1 | 5 | 5 | 1 | 2 |
5 | 230008 | 0 | 10/03/96 | 5396 | 08/09/60 | 2 | 38 | 1 | 4 | 0 | 1 | 1 | 4 | 2 | 0 | 2 |
6 | 230009 | 0 | 10/31/96 | 5396 | 11/13/50 | 1 | 38 | 1 | 4 | 0 | 1 | 1 | 3 | 1 | 0 | 2 |
Obs | subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 310020 | 0 | 06/18/93 | 1916 | 08/09/43 | 2 | 43 | 1 | 4 | 0 | 3 | 1 | 5 | 1 | . | 2 |
2 | 310032 | 0 | 08/03/93 | 1916 | 04/02/34 | 2 | 13 | 1 | 4 | 0 | 1 | 1 | 1 | 8 | . | 1 |
3 | 310037 | 0 | 11/08/93 | 1916 | 04/25/39 | 2 | 36 | 1 | 4 | 0 | 1 | 1 | 3 | 7 | . | 2 |
4 | 310041 | 0 | 10/27/93 | 1916 | 11/01/26 | 2 | 43 | 1 | 4 | 0 | 3 | 1 | 4 | 5 | . | 2 |
5 | 310049 | 0 | 02/04/94 | 1916 | 10/03/31 | 2 | 25 | 1 | 4 | 0 | 3 | 1 | 4 | 8 | 1 | 2 |
6 | 310055 | 0 | 07/12/94 | 1916 | 12/11/56 | 2 | 36 | 1 | 4 | 0 | 3 | 1 | 4 | 1 | 1 | 1 |
7 | 310056 | 0 | 04/01/94 | 1916 | 11/16/61 | 2 | 36 | 1 | 4 | 0 | 3 | 1 | 4 | 1 | 0 | 2 |
8 | 310059 | 0 | 06/13/94 | 1916 | 08/22/34 | 2 | 43 | 1 | 4 | 0 | 1 | 1 | 4 | 8 | 0 | 2 |
9 | 310065 | 0 | 06/30/94 | 1916 | 07/27/56 | 2 | 36 | 1 | 4 | 0 | 0 | 1 | 4 | 1 | 0 | 2 |
10 | 310069 | 0 | 09/23/94 | 1916 | 07/14/42 | 2 | 36 | 1 | 4 | 0 | 3 | 3 | 3 | 7 | 1 | 1 |
Well, it's not really that simple. This program also illustrates the use of the WHERE= option to divide a large data set up into smaller data sets, based on the values of a certain variable. As you can see, there are actually two data set names — temple and okla — appearing in the DATA statement. That tells SAS that we want to create two data sets within this one DATA step. One data set, temple, should contain only those subjects enrolled at Temple University, while the other data set, okla, should contain only those subjects enrolled at the University of Oklahoma. Accomplishing that requires a bit of understanding about the subject number used in the ICDB Study, as well as the functionality of the INT function:
- The variable subj contains a six-digit subject number. The first two digits of the subject number indicate the location where the subject was enrolled. If the first two digits of the subject id number is 23, then the subject was enrolled at Temple University. If the first two digits of the subject id number is 31, the subject was enrolled at the University of Oklahoma.
- You might recall that the INT function — that is, the integer function— returns the integer part of the expression contained within parentheses. So, if the subject number is 230007, then int(subj/10000) = int(23.0007) = 23, the location id for Temple University.
Pay particular attention to the syntax of the WHERE= option ... it can trip you up if you aren't careful. The entire WHERE= option must be contained in parentheses immediately following the data set to which you want the condition to apply. The condition itself is also placed in parentheses. So, in general, the syntax, when applied to a DATA statement, should look like this:
DATA dsname (WHERE = (condition));
where dsname
is the data set name and condition
is the condition that you want SAS to evaluate when selecting the observations.
Now, before running the program, change the directory in the LIBNAME statement so that it reflects the location in which you have saved the background data set. After you've edited the LIBNAME statement, run the SAS program. Review the output to convince yourself that the temporary data set temple contains the observations in the ICDB background data set that correspond to subjects enrolled at Temple University, while the temporary data set okla contains the observations that correspond to subjects enrolled at the University of Oklahoma.
A few more comments. When you use the WHERE= option, the condition applies only to the data set which immediately precedes it. Alternatively, you can use a WHERE statement, in which the condition applies to all of the input data sets. Also beware that the WHERE= option cannot be used at the same time as the FIRSTOBS= or OBS= options.
Example
The following program illustrates efficient use of the WHERE= option in the SET statement. Because the WHERE= option appears in the SET statement, the selection process takes places as SAS reads in the observations from the icdb.back data set:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA temple2;
set phc6089.back (where = (int(subj/10000)=23));
RUN;
PROC PRINT data = temple2;
title 'Output Dataset: TEMPLE2';
RUN;
Obs | subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 230003 | 0 | 07/09/93 | 5396 | 10/25/47 | 2 | 22 | 1 | 4 | 0 | 1 | 4 | 3 | 1 | . | 1 |
2 | 230004 | 0 | 01/04/94 | 5396 | 08/15/23 | 2 | 38 | 1 | 4 | 0 | 1 | 5 | 3 | 5 | 0 | 1 |
3 | 230005 | 0 | 01/06/94 | 5396 | 05/25/49 | 2 | 10 | 1 | 3 | 0 | 3 | 2 | 4 | 1 | 0 | 2 |
4 | 230006 | 0 | 01/06/94 | 5396 | 04/24/49 | 2 | 21 | 1 | 4 | 0 | 3 | 1 | 5 | 5 | 1 | 2 |
5 | 230008 | 0 | 10/03/96 | 5396 | 08/09/60 | 2 | 38 | 1 | 4 | 0 | 1 | 1 | 4 | 2 | 0 | 2 |
6 | 230009 | 0 | 10/31/96 | 5396 | 11/13/50 | 1 | 38 | 1 | 4 | 0 | 1 | 1 | 3 | 1 | 0 | 2 |
Before running the program, change the directory in the LIBNAME statement so that it reflects the location in which you have saved the background data set. After you've edited the LIBNAME statement, run the SAS program. Review the output from the PRINT procedure. Convince yourself that the temple2 data set contains the observations in the ICDB background data set that correspond to subjects enrolled at Temple University.
Alternatively, if you want to exclude certain observations from all output datasets, you can use
WHERE clause or subsetting IF to include rows that meet a certain set of conditions
If-Then-Delete to exclude rows that meet a certain condition.
Example
The following program illustrates using the WHERE statement to only keep the records whose Subject ID starts with 23.
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA temple3;
set phc6089.back;
where (int(subj/10000)=23);
RUN;
PROC PRINT data = temple3;
title 'Output Dataset: TEMPLE3';
RUN;
Obs | subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 230003 | 0 | 07/09/93 | 5396 | 10/25/47 | 2 | 22 | 1 | 4 | 0 | 1 | 4 | 3 | 1 | . | 1 |
2 | 230004 | 0 | 01/04/94 | 5396 | 08/15/23 | 2 | 38 | 1 | 4 | 0 | 1 | 5 | 3 | 5 | 0 | 1 |
3 | 230005 | 0 | 01/06/94 | 5396 | 05/25/49 | 2 | 10 | 1 | 3 | 0 | 3 | 2 | 4 | 1 | 0 | 2 |
4 | 230006 | 0 | 01/06/94 | 5396 | 04/24/49 | 2 | 21 | 1 | 4 | 0 | 3 | 1 | 5 | 5 | 1 | 2 |
5 | 230008 | 0 | 10/03/96 | 5396 | 08/09/60 | 2 | 38 | 1 | 4 | 0 | 1 | 1 | 4 | 2 | 0 | 2 |
6 | 230009 | 0 | 10/31/96 | 5396 | 11/13/50 | 1 | 38 | 1 | 4 | 0 | 1 | 1 | 3 | 1 | 0 | 2 |
Note that there is no = sign with the WHERE statement as opposed to using the WHERE= dataset option. Alternatively, we could have done the same thing using a subsetting If.
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA temple4;
set phc6089.back;
if (int(subj/10000)=23);
RUN;
PROC PRINT data = temple4;
title 'Output Dataset: TEMPLE4';
RUN;
Obs | subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 230003 | 0 | 07/09/93 | 5396 | 10/25/47 | 2 | 22 | 1 | 4 | 0 | 1 | 4 | 3 | 1 | . | 1 |
2 | 230004 | 0 | 01/04/94 | 5396 | 08/15/23 | 2 | 38 | 1 | 4 | 0 | 1 | 5 | 3 | 5 | 0 | 1 |
3 | 230005 | 0 | 01/06/94 | 5396 | 05/25/49 | 2 | 10 | 1 | 3 | 0 | 3 | 2 | 4 | 1 | 0 | 2 |
4 | 230006 | 0 | 01/06/94 | 5396 | 04/24/49 | 2 | 21 | 1 | 4 | 0 | 3 | 1 | 5 | 5 | 1 | 2 |
5 | 230008 | 0 | 10/03/96 | 5396 | 08/09/60 | 2 | 38 | 1 | 4 | 0 | 1 | 1 | 4 | 2 | 0 | 2 |
6 | 230009 | 0 | 10/31/96 | 5396 | 11/13/50 | 1 | 38 | 1 | 4 | 0 | 1 | 1 | 3 | 1 | 0 | 2 |
If instead we want to exclude observations that meet a certain condition, then we would need to use an If-Then-Delete statement. The following program removes rows where the subject id does not start with 23.
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA temple5;
set phc6089.back;
if (int(subj/10000) ne 23) Then DELETE;
RUN;
PROC PRINT data = temple5;
title 'Output Dataset: TEMPLE5';
RUN;
Obs | subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 230003 | 0 | 07/09/93 | 5396 | 10/25/47 | 2 | 22 | 1 | 4 | 0 | 1 | 4 | 3 | 1 | . | 1 |
2 | 230004 | 0 | 01/04/94 | 5396 | 08/15/23 | 2 | 38 | 1 | 4 | 0 | 1 | 5 | 3 | 5 | 0 | 1 |
3 | 230005 | 0 | 01/06/94 | 5396 | 05/25/49 | 2 | 10 | 1 | 3 | 0 | 3 | 2 | 4 | 1 | 0 | 2 |
4 | 230006 | 0 | 01/06/94 | 5396 | 04/24/49 | 2 | 21 | 1 | 4 | 0 | 3 | 1 | 5 | 5 | 1 | 2 |
5 | 230008 | 0 | 10/03/96 | 5396 | 08/09/60 | 2 | 38 | 1 | 4 | 0 | 1 | 1 | 4 | 2 | 0 | 2 |
6 | 230009 | 0 | 10/31/96 | 5396 | 11/13/50 | 1 | 38 | 1 | 4 | 0 | 1 | 1 | 3 | 1 | 0 | 2 |
6.2. Subsetting Columns of a SAS Dataset¶
There may be situations in which you’d like to reduce the number of variables in a data set in order to work with a data set of a much more manageable size. I like to think of doing such a thing as “making a data set thinner.” We can select certain variables from our dataset by either
Using a DROP statement or DROP= dataset option to select which variables to exclude.
Or by using a KEEP statement or KEEP=dataset option to select which variables in include.
The advantage is that, when you can get away with reading in fewer variables, SAS uses a smaller program data vector and therefore runs much more efficiently.
Which option/statement should you use? The choice of whether to use the DROP or the KEEP in a given situation is personal. Or rather, let’s say that most people will choose the option that requires the least amount of typing! I still try though, whenever possible, to use the KEEP, as then when I read my programs I know exactly which variables my data sets contain. Using DROP, on the other hand, requires one to think about what variables were in the data set before the dropping took place.
Example
The following program tells SAS to keep just three variables — subj, v_date, and b_date — when reading from the back1 data set in order to create the back2 data set:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back2;
set phc6089.back (keep = subj v_date b_date);
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back2 (obs=5);
title 'Output Dataset: BACK2';
RUN;
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 31.3 |
2 | 110029 | 10/05/93 | 09/07/26 | 67.1 |
3 | 110039 | 12/07/93 | 07/24/24 | 69.4 |
4 | 110040 | 11/30/93 | 10/20/67 | 26.1 |
5 | 110045 | 01/11/94 | 04/18/25 | 68.8 |
The SET statement's KEEP= option is used to tell SAS to read three variables — subject (subj), visit date (v_date), and birth date (b_date) — from the data set back and to store them in a new temporary data set back2. Note that the KEEP= option must be placed in parentheses and follow the name of the data set from you which you want SAS to select variables. The list of variables that you specify to keep must be separated by at least one blank space.
Then, the subject's age (age) at the time of the visit is calculated by subtracting the difference between visit date (v_date) and birth date (b_date) and dividing by 365 to get (an approximate) age in years.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the temporary data set back2 contains the three kept variables (subj, v_date, and b_date), as well as the newly calculated variable (age).
Alternative, we could have used the keep statement. The benefit to using the SET option keep= in the example above, is that we have only read in the variables that we need. The following version reads in all the variables to the DATA step and then only keeps the selected varaibles to output to the dataset back3.
Note that now, we need to also put the new variable age in the keep statement. Before, the keep= option in the SET statement only applied to the input dataset in the SET statement, but the keep statement applies to all variables in the DATA step.
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back3;
set phc6089.back;
keep subj v_date b_date age;
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back3 (obs=5);
title 'Output Dataset: BACK3';
RUN;
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 31.3 |
2 | 110029 | 10/05/93 | 09/07/26 | 67.1 |
3 | 110039 | 12/07/93 | 07/24/24 | 69.4 |
4 | 110040 | 11/30/93 | 10/20/67 | 26.1 |
5 | 110045 | 01/11/94 | 04/18/25 | 68.8 |
We can also use the keep= option as a DATA statement option.
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back3a (keep= subj v_date b_date age);
set phc6089.back;
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back3a (obs=5);
title 'Output Dataset: BACK3a';
RUN;
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 31.3 |
2 | 110029 | 10/05/93 | 09/07/26 | 67.1 |
3 | 110039 | 12/07/93 | 07/24/24 | 69.4 |
4 | 110040 | 11/30/93 | 10/20/67 | 26.1 |
5 | 110045 | 01/11/94 | 04/18/25 | 68.8 |
Another way to acheive the same result is to use DROP to exclude all the variables that we don't want
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back4;
set phc6089.back (drop = v_type r_id sex state country
race ethnic relig mar_st
ed_level emp_st job_chng income);
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back4 (obs = 5);
title 'Output Dataset: BACK4';
RUN;
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 31.3 |
2 | 110029 | 10/05/93 | 09/07/26 | 67.1 |
3 | 110039 | 12/07/93 | 07/24/24 | 69.4 |
4 | 110040 | 11/30/93 | 10/20/67 | 26.1 |
5 | 110045 | 01/11/94 | 04/18/25 | 68.8 |
The SET statement's DROP= option tells SAS to drop thirteen variables — the visit type (v_type), the id number of the reviewer of the data form (r_id), as well as the subject's gender (sex), state (state) and country (country) of residence, race (race), ethnicity (ethnic), religion (relig), marital status (mar_st), education level (ed_level), employment status (emp_st), job change status (job_chng), and income (income) — from the data set back and to store the remaining variables in a new temporary data set back3. Therefore, the program data vector used to create the SAS data set back3 contains only three variables — subject id (subj), visit date (v_date), and birth date (b_date).
Note that, just like the KEEP= option, the DROP= option must be placed in parentheses and follow the name of the data set from you which you want SAS to exclude variables. The list of variables that you specify to drop must be separated by at least one blank space.
We could also have accomplished this by using a DROP statement. This however will result in SAS reading in all variables to the SAS data vector, and then only outputting the variables we decide not to drop to the output dataset.
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back5;
set phc6089.back;
drop v_type r_id sex state country
race ethnic relig mar_st
ed_level emp_st job_chng income;
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back5 (obs = 5);
title 'Output Dataset: BACK5';
RUN;
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 31.3 |
2 | 110029 | 10/05/93 | 09/07/26 | 67.1 |
3 | 110039 | 12/07/93 | 07/24/24 | 69.4 |
4 | 110040 | 11/30/93 | 10/20/67 | 26.1 |
5 | 110045 | 01/11/94 | 04/18/25 | 68.8 |
Alternatively, we could have also used the DATA statement's drop= option.
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back5a (drop = v_type r_id sex state country
race ethnic relig mar_st
ed_level emp_st job_chng income);
set phc6089.back;
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back5a (obs = 5);
title 'Output Dataset: BACK5a';
RUN;
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 31.3 |
2 | 110029 | 10/05/93 | 09/07/26 | 67.1 |
3 | 110039 | 12/07/93 | 07/24/24 | 69.4 |
4 | 110040 | 11/30/93 | 10/20/67 | 26.1 |
5 | 110045 | 01/11/94 | 04/18/25 | 68.8 |
Listing out all of these variables can be very tedious, so SAS provides a shortcut to create variable lists. We can specify a list of the form var1 -- var2
where var1 and var2 are two variable names in our SAS dataset. The --
tells SAS to "fill in" all variables that occur in the dataset inbetween these two columns.
If we look at the order of the columns in our dataset by using PROC CONTENTS with the VARNUM option, then we can see that we all the columns from sex to income are in order sequentially. This will allow us to shorten our DROP statement to
DROP v_type r_id sex -- income;
ods select Position;
PROC contents data = phc6089.back order=varnum;
RUN;
ods select ALL;
The CONTENTS Procedure
Variables in Creation Order | ||||
---|---|---|---|---|
# | Variable | Type | Len | Format |
1 | subj | Num | 8 | |
2 | v_type | Num | 8 | |
3 | v_date | Num | 8 | MMDDYY8. |
4 | r_id | Num | 8 | |
5 | b_date | Num | 8 | MMDDYY8. |
6 | sex | Num | 8 | |
7 | state | Num | 8 | |
8 | country | Num | 8 | |
9 | race | Num | 8 | |
10 | ethnic | Num | 8 | |
11 | relig | Num | 8 | |
12 | mar_st | Num | 8 | |
13 | ed_level | Num | 8 | |
14 | emp_st | Num | 8 | |
15 | job_chng | Num | 8 | |
16 | income | Num | 8 |
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back6;
set phc6089.back;
drop v_type r_id sex -- income;
age = (v_date - b_date)/365; * Calculate AGE in years;
format age 4.1;
RUN;
PROC PRINT data=back6 (obs = 5);
title 'Output Dataset: BACK6';
RUN;
Obs | subj | v_date | b_date | age |
---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 31.3 |
2 | 110029 | 10/05/93 | 09/07/26 | 67.1 |
3 | 110039 | 12/07/93 | 07/24/24 | 69.4 |
4 | 110040 | 11/30/93 | 10/20/67 | 26.1 |
5 | 110045 | 01/11/94 | 04/18/25 | 68.8 |
What to Use When
After reviewing the previous four examples, your head might be spinning and you might be wondering how you would know which to use — the KEEP= option on the SET statement? the DROP= option on the SET statement? the KEEP= option on the DATA statement? or the DROP= option on the DATA statement? Here are the key facts that will help us devise a strategy:
As stated earlier, the choice between KEEP and DROP is a matter of personal choice. It’s whether you place the options on the SET statement, the DATA statement, or use the individual statement that makes a difference.
In the SET statement, the DROP= and KEEP= options determine which variables are read from the input SAS dataset. That is, when you specify the DROP= or KEEP= option in the SET statement, the excluded variables are never even read into the program data vector.
In the DATA statement, the DROP= and KEEP= options determine which variables are written from the program data vector to the resulting SAS data set. That is, when you specify the DROP= or KEEP= option in the DATA statement, all of the variables in the input data set are read in to the program data vector, but only the selected variables are written from the program data vector to the output data set. This also allows you to create multiple datasets in a single DATA step by keeping/dropping different sets of variables in each of the output dataset.
The individual DROP and KEEP statements work similarly to the DATA statement options except that they will apply to all datasets listed in the DATA statement.
Those facts stated, it becomes obvious that we can construct a more efficient DATA step by not reading unneeded variables from the input data set. That said, we might have a working strategy:
use the KEEP= option (or DROP= option) in the SET statement to tell SAS only those variables that you need from the input data set read into the program data vector in order to achieve the goals of your program
use the KEEP= option (or DROP= option) in the DATA statement or the individual statements to tell SAS all of the variables that you want written from the program data vector into the output data set. You must use the DATA statement options if you want to select different variables for different datasets listed in the DATA statement.
Let’s try our working strategy out on an example.
Example
The following SAS program illustrates our working strategy of when to use the KEEP= option in the SET statement and when to use the KEEP= option in the DATA statement:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back6 (keep = subj age);
set phc6089.back (keep= subj v_date b_date);
age = (v_date - b_date)/365;
format age 4.1;
RUN;
PROC PRINT data=back6 (obs=5);
title 'Output Dataset: BACK6';
RUN;
Obs | subj | age |
---|---|---|
1 | 110027 | 31.3 |
2 | 110029 | 67.1 |
3 | 110039 | 69.4 |
4 | 110040 | 26.1 |
5 | 110045 | 68.8 |
Here, the KEEP= option on the SET statement tells SAS to keep only three variables — subj, v_date, and b_date — from the input data set. The v_date and b_date variables are kept because we need them in order to calculate the age variable. The subject's age is calculated by subtracting b_date from v_date and dividing by 365. Therefore, the program data vector contains four variables — subj, v_date, b_date, and age —all, none or any of which can be written to the output back6 data set. The KEEP= option of the DATA statement tells SAS which variables to write from the program data vector to the output data set. Here, SAS is told to write only the subject's id (subj) and age (age) to back6.
Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the temporary data set back6 does indeed contain only thesubject's id (subj) and age (age).
6.3. Modifying Variable Attributes: RENAME= and LABEL options¶
Recall that the descriptor portion of a data set containing information about the attributes of each variable in the data set. The attribute information includes the variable’s name, type, length, format, informat, and label. In this section, we will learn how to
Use the LENGTH statement to adjust the default length of a variable.
Use the RENAME= option to change the name of a variable.
Use the LABEL statement to add a descriptive label to a variable.
We will cover formats in a later section.
6.3.1. The RENAME= Option¶
There may be occasions in which you want to change some of the variable names in your SAS data set. To do so, you’ll want to use the RENAME= option. As its name suggests, the RENAME= option allows you to change the variable names within a SAS data set.
The format of the RENAME= option is:
RENAME = (old1=new1 old2=new2 .... oldk=newk);
where old1, old2, … oldk are the variable names as they appear in the data set that precedes the RENAME= option, and new1, new2, …, newk are the corresponding new variable names.
The effect of the RENAME= option depends on where it appears:
If the RENAME= option appears in the SET statement, then the new variable name takes effect when the program data vector is created. Therefore, all programming statements within the DATA step must refer to the new variable name.
If the RENAME= option appears in the DATA statement, then the new variable name takes effect only when the data are written to the SAS data set. Therefore, all programming statements within the DATA step must refer to the old variable name.
Example
The following program illustrates the use of the RENAME= option in the SET statement. Specifically, the variable sex
is changed to gender
, and b_date
is changed to birth
, when the program data vector is created:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back7 (keep = subj gender v_date birth age);
set phc6089.back (rename=(sex=gender b_date=birth));
age = (v_date - birth)/365; *MUST use NEW name for date of birth;
RUN;
PROC PRINT data=back7 (obs=5);
title 'Output Dataset: BACK7';
RUN;
Obs | subj | v_date | birth | gender | age |
---|---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 2 | 31.2740 |
2 | 110029 | 10/05/93 | 09/07/26 | 2 | 67.1233 |
3 | 110039 | 12/07/93 | 07/24/24 | 2 | 69.4192 |
4 | 110040 | 11/30/93 | 10/20/67 | 2 | 26.1315 |
5 | 110045 | 01/11/94 | 04/18/25 | 1 | 68.7808 |
Because the RENAME= option appears in the SET statement, SAS no longer recognizes the variable name sex
as the gender, nor b_date
as the birth date, of the subject. Instead, SAS recognizes the variable names gender and birth. Hence, when we subsequently calculate the subjects' ages (age
) in the DATA step, we must refer to the new variable name birth.
Again, pay particular attention to the syntax of the RENAME= option ... it too can be tricky. The entire RENAME= option must be contained in parentheses immediately following the data set to which you want the name changes to apply. The variable names must also be placed in parentheses. So, in general, the syntax, when applied to a DATA statement, should look like this:
DATA dsname (RENAME = (o1=n1 o2=n2 ...));
where dsname
is the data set name and o1
and o2
are the old variable names, and n1
and n2
are the new variable names.
Launch and run the SAS program. Review the output from the PRINT procedure. Convince yourself that the variable names sex and b_date have been changed as advertised to gender and birth, respectively. Also, verify that the ages of the subjects have been calculated appropriately. Then, in the SAS program, change the variable name birth back to the variable name b_date, and re-run the program. Does SAS indeed hiccup?
Example
The following program illustrates use of the RENAME= option, when it appears in the DATA statement. Specifically, the variable sex
is changed to gender, and b_date
is changed to birth when SAS writes the data to the output data set:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back8 (rename=(sex=gender b_date=birth)
keep = subj sex v_date b_date age);
set phc6089.back;
age = (v_date - b_date)/365; *MUST use OLD name
for date of birth;
RUN;
PROC PRINT data=back8 (obs=5);
title 'Output Dataset: BACK8';
RUN;
Obs | subj | v_date | birth | gender | age |
---|---|---|---|---|---|
1 | 110027 | 10/05/93 | 07/05/62 | 2 | 31.2740 |
2 | 110029 | 10/05/93 | 09/07/26 | 2 | 67.1233 |
3 | 110039 | 12/07/93 | 07/24/24 | 2 | 69.4192 |
4 | 110040 | 11/30/93 | 10/20/67 | 2 | 26.1315 |
5 | 110045 | 01/11/94 | 04/18/25 | 1 | 68.7808 |
Because the RENAME= option appears in the DATA statement, SAS only recognizes the variable names as they appear in the input data set back3. That is, for example, SAS recognizes the variable name b_date as the birth date of the subjects. Hence, when we subsequently calculate the subjects' ages in the DATA step, we must refer to the old variable name b_date. Also note that the KEEP= option in the DATA statement must refer to the original variable names as they appear in the back3 data set.
This program also illustrates how to use more than one DATA step option at a time. Specifically, the RENAME= and KEEP= options are used to modify the back8 data set. As such, both options are placed within one set of parentheses immediately following the data set to which you want the changes to apply. Then, within those parentheses, the basic syntax for each option is followed.
Launch and run the SAS program, and review the output from the PRINT procedure. Convince yourself that the variable names sex
and b_date
have been changed as advertised to gender
and birth
, respectively. Also, verify that the ages of the subjects have been calculated appropriately.
6.3.2. The LABEL Statement¶
The LABEL statement associates a descriptive label to a variable. This label is used in place of the variable name when printing such as when creating plots, the label is used as the axis label.
Example
The following program uses a LABEL statement to associate descriptive labels to the variables subj, sex, v_date, b_date, and age.
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA back8a (rename=(sex=gender b_date=birth)
keep = subj sex v_date b_date age);
set phc6089.back;
age = (v_date - b_date)/365; *MUST use OLD name
for date of birth;
LABEL subj = "Subject ID"
sex = "Gender"
v_date = "Visit Date"
b_date = "Birth Date"
age = "Age in Years";
RUN;
title ;
ODS SELECT Position;
PROC contents data = back8a order = varnum;
RUN;
ODS SELECT ALL;
The CONTENTS Procedure
Variables in Creation Order | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Label |
1 | subj | Num | 8 | Subject ID | |
2 | v_date | Num | 8 | MMDDYY8. | Visit Date |
3 | birth | Num | 8 | MMDDYY8. | Birth Date |
4 | gender | Num | 8 | Gender | |
5 | age | Num | 8 | Age in Years |
We can see that the labels were added by viewing the descriptor portion of the dataset via PROC CONTENTS. A label can be as long as 256 characters.
6.3.3. The LENGTH Statement¶
We learned earlier that by default, character variables are 8 characters long. To get around this before, we used a character informat or column input to read character strings that were longer than 8 characters. Another way is the use the LENGTH statement to set the variable length attribute to be longer than 8 characters.
Example
The following simple program illustrates creating a character value with the default length versus setting the length to be longer by using the LENGTH statement.
DATA char_length;
LENGTH char_var2 $20;
INFILE datalines delimiter = ",";
INPUT char_var1 $ char_var2 $;
DATALINES;
A longer character string, A longer character string
shortChar, shortChar
;
RUN;
PROC print data = char_length;
title "Character Length Data Example";
RUN;
Obs | char_var2 | char_var1 |
---|---|---|
1 | A longer character s | A longer |
2 | shortChar | shortCha |
The LENGTH statement sets the length of the variable char_var2 to be up to 20 characters, while char_var1 gets the default length of 8 characters. When reading the same string value, char_var1 can only store up to the first 8 characters and discards the rest, while char_var2 can store up to 20. This results in the strings being only partially read in for char_var1.
6.4. Sorting the Rows of a Dataset: PROC SORT¶
By default, the PRINT procedure displays observations in the order in which they appear in your data set. Alternatively, you can use the SORT procedure to first sort your data set based on the values of one or more variables. Then, when you use the PRINT procedure, SAS will display the observations in the order in which you sorted the data.
Example
The following SAS program uses the SORT procedure to sort the temporary dataset back8 that we created earlier by gender and then by age. Note that it is sorted by gender first. Then, if there are any ties in gender, these are sorted by age.
PROC SORT data = back8 out = srtd_back8;
by gender age;
RUN;
title "First 5 sorted observations for gender = 1";
PROC PRINT data = srtd_back8 (obs = 5);
var gender age subj v_date birth;
RUN;
title "First 5 sorted observations for gender = 2";
PROC PRINT data = srtd_back8 (firstobs = 57 obs = 62);
var gender age subj v_date birth;
RUN;
Obs | gender | age | subj | v_date | birth |
---|---|---|---|---|---|
1 | 1 | -18.3014 | 110125 | 02/06/96 | 05/22/14 |
2 | 1 | -17.3699 | 110152 | 07/19/96 | 11/27/13 |
3 | 1 | 23.4986 | 410073 | 06/13/95 | 12/19/71 |
4 | 1 | 23.6822 | 520051 | 09/26/94 | 01/26/71 |
5 | 1 | 24.8630 | 410071 | 05/30/95 | 07/25/70 |
Obs | gender | age | subj | v_date | birth |
---|---|---|---|---|---|
57 | 2 | -23.1534 | 410097 | 04/24/96 | 06/14/19 |
58 | 2 | -22.6795 | 420074 | 09/07/95 | 05/07/18 |
59 | 2 | -22.6466 | 520007 | 08/13/93 | 03/31/16 |
60 | 2 | -22.2438 | 310102 | 02/12/96 | 05/06/18 |
61 | 2 | -22.1781 | 420112 | 12/17/96 | 02/15/19 |
62 | 2 | -20.3041 | 510010 | 09/21/93 | 01/05/14 |
Launch the SAS program, but print the whole dataset by removing the FIRSTOBS and OBS dataset options to verify that the datset is sorted first by gender and then by age.
Then, note that while the SORT procedure's BY statement is required, it's OUT= option is optional. If you don't use it, however, then the SORT procedure permanently sorts the data set that is specified in the DATA= option. Therefore, if you need your data to be sorted just to produce output temporarily, then you should use the OUT= option in conjunction with a temporary SAS data set name.
By default, SAS sorts the values of the variables appearing in the BY statement in ascending order. If you want them sorted in descending order, you need to use the BY statement's DESCENDING option.
Example
The following SAS program uses the BY statement's DESCENDING option to tell SAS to sort the temporary dataset back8 first by gender in descending order, and then by age in ascending order.
PROC SORT data = back8 out = srtd_back8;
by descending gender age;
RUN;
title "First 5 sorted observations for gender = 1";
PROC PRINT data = srtd_back8 (obs = 5);
var gender age subj v_date birth;
RUN;
title "First 5 sorted observations for gender = 2";
PROC PRINT data = srtd_back8 (firstobs = 57 obs = 62);
var gender age subj v_date birth;
RUN;
title ;
Obs | gender | age | subj | v_date | birth |
---|---|---|---|---|---|
1 | 2 | -23.1534 | 410097 | 04/24/96 | 06/14/19 |
2 | 2 | -22.6795 | 420074 | 09/07/95 | 05/07/18 |
3 | 2 | -22.6466 | 520007 | 08/13/93 | 03/31/16 |
4 | 2 | -22.2438 | 310102 | 02/12/96 | 05/06/18 |
5 | 2 | -22.1781 | 420112 | 12/17/96 | 02/15/19 |
Obs | gender | age | subj | v_date | birth |
---|---|---|---|---|---|
57 | 2 | 26.5479 | 410006 | 09/09/93 | 02/28/67 |
58 | 2 | 26.5863 | 510051 | 01/20/95 | 06/26/68 |
59 | 2 | 26.6110 | 410091 | 02/14/96 | 07/12/69 |
60 | 2 | 26.7452 | 210100 | 05/06/96 | 08/14/69 |
61 | 2 | 26.7507 | 210006 | 05/13/93 | 08/19/66 |
62 | 2 | 26.8438 | 220036 | 08/08/95 | 10/10/68 |
Launch the SAS program, but print the whole dataset by removing the FIRSTOBS and OBS dataset options to see that the dataset is sorted in the desired order. Note that if your BY statement contains more than one variable, then the DESCENDING option applies only to the variable that immediately follows it. You might want to sandwich another DESCENDING gender and age in the BY statement and then rerun the program to see the effect.
6.5. Exercises¶
Create a temporary SAS dataset called
heart
from the sashelp.heart dataset. The library sashelp if a permanent library that comes with SAS.View the description of the
heart
dataset by running PROC CONTENTS. How many observations (rows) and variables (columns) are in the dataset?Use the RENAME= option to rename
Status
tocondition
. Show the variables table from PROC CONTENTS to see that the name has changed.Create a subset of
heart
calledheart_sub
that contains only the columnsheight
,weight
, anddiastolic
. What are the dimensions ofheart_sub
? Print the first five rows ofheart_sub
.Subset the patients (rows) of
heart
that have a diastolic blood pressure (diastolic) more than 90 into a dataset calledheart_sub2
. How many are there? Print the first five rows ofheart_sub2
.Subset the patients (rows) of
heart
that have a diastolic blood pressure less than or equal to 90 and have a weight more than 180 pounds into a dataset calledheart_sub3
. How many are there? Print the first five rows ofheart_sub3
.Create a subset of
heart
that only contains the columnsweight
,systolic
, andcholesterol
for those that are heavy smokers smoker (Smoking_Status = “Heavy (16-25)”). Save this toheart_sub
. What are the dimensions of the dataset? Print the first five rows ofheart_sub
.Reorder the rows of
heart_sub
in decreasing order by cholesterol. Print the first 5 rows of the sorted dataset.