17. Solutions to Exercises¶
17.1. SAS Windowing Environment¶
Use the sample SAS program code provided above and the Charm City Circulator Ridership dataset to experiment with using SAS by making the following changes to the provided code.
Go through and change the colors using the attribute map dataset
palette
to something other than what they originally were. See https://support.sas.com/content/dam/SAS/support/en/books/pro-template-made-easy-a-guide-for-sas-users/62007_Appendix.pdf for a large list of colors in SAS.Change the days you are keeping to show
"Sunday"
instead of “Saturday"
.Change the plot from a line plot to a scatterplot by changing
series
toscatter
inproc sgpanel
.
%let working_directory = data; *Replace data with your path to your
directory containing the dataset;
proc import datafile = "&working_directory/Charm_City_Circulator_Ridership.csv"
out = circ dbms = csv replace;
getnames = yes;
guessingrows = max;
run;
proc sort data = circ;
by day date daily;
run;
proc transpose data = circ out = long (rename=(_NAME_ = var col1 = number));
var orangeboardings -- banneraverage;
by day date daily;
run;
data long2;
set long;
var = tranwrd(var, 'Board', '_Board');
var = tranwrd(var, 'Alight', '_Alight');
var = tranwrd(var, 'Average', '_Average');
line = scan(var, 1, '_');
type = scan(var, 2, '_');
drop var;
run;
data avg;
set long2;
where (type = "Average");
run;
data palette;
input id $ value $ linecolor $;
markercolor = linecolor;
datalines;
lc banner red
lc green yellow
lc orange darkblue
lc purple brown
;
run;
proc sgpanel data = avg dattrmap=palette;
where day in ("Monday" "Tuesday" "Friday" "Sunday");
panelby day;
scatter Y = number X = date / group = line attrid = lc;
run;
17.2. Basic SAS Programming¶
Submit the following simple SAS program. What does it do?
PROC PRINT DATA = sashelp.heart (OBS=10); RUN; PROC MEANS DATA = sashelp.heart; VAR height weight; RUN;
PROC PRINT DATA = sashelp.heart (OBS=10);
RUN;
PROC MEANS DATA = sashelp.heart;
VAR height weight;
RUN;
Obs | Status | DeathCause | AgeCHDdiag | Sex | AgeAtStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeAtDeath | Cholesterol | Chol_Status | BP_Status | Weight_Status | Smoking_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Dead | Other | . | Female | 29 | 62.50 | 140 | 78 | 124 | 121 | 0 | 55 | . | Normal | Overweight | Non-smoker | |
2 | Dead | Cancer | . | Female | 41 | 59.75 | 194 | 92 | 144 | 183 | 0 | 57 | 181 | Desirable | High | Overweight | Non-smoker |
3 | Alive | . | Female | 57 | 62.25 | 132 | 90 | 170 | 114 | 10 | . | 250 | High | High | Overweight | Moderate (6-15) | |
4 | Alive | . | Female | 39 | 65.75 | 158 | 80 | 128 | 123 | 0 | . | 242 | High | Normal | Overweight | Non-smoker | |
5 | Alive | . | Male | 42 | 66.00 | 156 | 76 | 110 | 116 | 20 | . | 281 | High | Optimal | Overweight | Heavy (16-25) | |
6 | Alive | . | Female | 58 | 61.75 | 131 | 92 | 176 | 117 | 0 | . | 196 | Desirable | High | Overweight | Non-smoker | |
7 | Alive | . | Female | 36 | 64.75 | 136 | 80 | 112 | 110 | 15 | . | 196 | Desirable | Normal | Overweight | Moderate (6-15) | |
8 | Dead | Other | . | Male | 53 | 65.50 | 130 | 80 | 114 | 99 | 0 | 77 | 276 | High | Normal | Normal | Non-smoker |
9 | Alive | . | Male | 35 | 71.00 | 194 | 68 | 132 | 124 | 0 | . | 211 | Borderline | Normal | Overweight | Non-smoker | |
10 | Dead | Cerebral Vascular Disease | . | Male | 52 | 62.50 | 129 | 78 | 124 | 106 | 5 | 82 | 284 | High | Normal | Normal | Light (1-5) |
The MEANS Procedure
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
Height
Weight
|
5203
5203
|
64.8131847
153.0866808
|
3.5827074
28.9154261
|
51.5000000
67.0000000
|
76.5000000
300.0000000
|
This program prints out the first 10 rows of the sashelp.heart dataset and the summary statistics N (sample size), mean, standard deviation, minimum and maxium for the Height and Weight variables.
Submit the following small program. How many DATA steps are there? How man PROC steps? Which steps are executed?
DATA work.quant; SET sashelp.heart; /* Read in from the heart datset */ KEEP height weight diastolic systolic; *Only keep these variable in the new dataset; RUN; PROC PRINT DATA = work.quant (OBS=10); RUN; /* PROC MEANS DATA = work.quant MIN Q1 Median Q3 MAX MEAN STD; RUN; */
DATA work.quant;
SET sashelp.heart; /* Read in from the heart datset */
KEEP height weight diastolic systolic; *Only keep these variable
in the new dataset;
RUN;
PROC PRINT DATA = work.quant (OBS=10);
RUN;
/*
PROC MEANS DATA = work.quant MIN Q1 Median Q3 MAX MEAN STD;
RUN;
*/
Obs | Height | Weight | Diastolic | Systolic |
---|---|---|---|---|
1 | 62.50 | 140 | 78 | 124 |
2 | 59.75 | 194 | 92 | 144 |
3 | 62.25 | 132 | 90 | 170 |
4 | 65.75 | 158 | 80 | 128 |
5 | 66.00 | 156 | 76 | 110 |
6 | 61.75 | 131 | 92 | 176 |
7 | 64.75 | 136 | 80 | 112 |
8 | 65.50 | 130 | 80 | 114 |
9 | 71.00 | 194 | 68 | 132 |
10 | 62.50 | 129 | 78 | 124 |
There is one DATA step and there are two PROC steps; however, the last PROC MEANS is not run since it is within a paragraph comment.
Run PROC CONTENTS on the dataset sashelp.heart. How many observations are in this datset?
PROC CONTENTS data = sashelp.heart;
RUN;
The CONTENTS Procedure
Data Set Name | SASHELP.HEART | Observations | 5209 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/25/2018 02:21:26 | Observation Length | 168 |
Last Modified | 10/25/2018 02:21:26 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | Framingham Heart Study | ||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | us-ascii ASCII (ANSI) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 14 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /opt/sasinside/SASHome/SASFoundation/9.4/sashelp/heart.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 6235 |
Access Permission | rw-r--r-- |
Owner Name | sas |
File Size | 960KB |
File Size (bytes) | 983040 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
1 | Status | Char | 5 | |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
There are 5209 observations (rows) in the dataset.
Submit the following program to SAS. What is the value of CurrentDate? (This value represents the number of days since Januar 1, 1960, the reference date for the SAS system).
DATA work.date; CurrentDate = Today(); RUN; PROC PRINT DATA = work.date; RUN;
DATA work.date;
CurrentDate = Today();
RUN;
PROC PRINT DATA = work.date;
RUN;
PROC PRINT DATA = work.date;
format currentdate mmddyy10.;
RUN;
Obs | CurrentDate |
---|---|
1 | 22205 |
Obs | CurrentDate |
---|---|
1 | 10/17/2020 |
The value of current date will change depending on when you submit this program. I submitted this program in 10/17/2020 and received the value 22205 or 22205 days since January 1, 1960.
17.3. Data I/O¶
Use PROC IMPORT to read in the Youth Tobacco study, Youth_Tobacco_Survey_YTS_Data.csv and name it youth.
PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Youth_Tobacco_Survey_YTS_Data.csv"
out = youth dbms = csv replace;
getnames = yes;
guessingrows = 500;
RUN;
PROC PRINT data = youth (OBS=5);
RUN;
Obs | YEAR | LocationAbbr | LocationDesc | TopicType | TopicDesc | MeasureDesc | DataSource | Response | Data_Value_Unit | Data_Value_Type | Data_Value | Data_Value_Footnote_Symbol | Data_Value_Footnote | Data_Value_Std_Err | Low_Confidence_Limit | High_Confidence_Limit | Sample_Size | Gender | Race | Age | Education | GeoLocation | TopicTypeId | TopicId | MeasureId | StratificationID1 | StratificationID2 | StratificationID3 | StratificationID4 | SubMeasureID | DisplayOrder |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Percent of Current Smokers Who Want to Quit | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Overall | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 170CES | 1GEN | 8AGE | 6RAC | 1EDU | YTS01 | 1 | |
2 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Percent of Current Smokers Who Want to Quit | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Male | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 170CES | 2GEN | 8AGE | 6RAC | 1EDU | YTS02 | 2 | |
3 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Percent of Current Smokers Who Want to Quit | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Female | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 170CES | 3GEN | 8AGE | 6RAC | 1EDU | YTS03 | 3 | |
4 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Quit Attempt in Past Year Among Current Cigarette Smokers | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Overall | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 169QUA | 1GEN | 8AGE | 6RAC | 1EDU | YTS04 | 4 | |
5 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Quit Attempt in Past Year Among Current Cigarette Smokers | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Male | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 169QUA | 2GEN | 8AGE | 6RAC | 1EDU | YTS05 | 5 |
Download an Excel version of the Monuments dataset, Monuments.xlsx, from CANVAS. Use PROC IMPORT to read in the dataset and call the output mon.
PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Monuments.xlsx"
out = mon dbms = xlsx replace;
getnames = yes;
RUN;
PROC PRINT data = mon (OBS=5);
RUN;
Obs | name | zipCode | neighborhood | councilDistrict | policeDistrict | Location 1 |
---|---|---|---|---|---|---|
1 | James Cardinal Gibbons | 21201 | Downtown | 11 | CENTRAL | 408 CHARLES ST Baltimore, MD |
2 | The Battle Monument | 21202 | Downtown | 11 | CENTRAL | |
3 | Negro Heroes of the U.S Monument | 21202 | Downtown | 11 | CENTRAL | |
4 | Star Bangled Banner | 21202 | Downtown | 11 | CENTRAL | 100 HOLLIDAY ST Baltimore, MD |
5 | Flame at the Holocaust Monument | 21202 | Downtown | 11 | CENTRAL | 50 MARKET PL Baltimore, MD |
Write a DATA step that will read in the following column formatted data using DATALINES and print the results using PROC PRINT. This dataset contains information on esophageal cancer in Ille-et-Vilaine, France. The variables are
Age group
Alcohol consumption
Tobacco consumption
Number of Cases
Number of Controls
Write a DATA step to read in the data and print the results. Note: the first line is to help you count the columns. Each number corresponds to a multiple of 5.
----1----2----3----4----5----6----7----8 75+ 0-39g/day 0-9g/day 1 18 75+ 0-39g/day 10-19 2 6 75+ 0-39g/day 30+ 1 3 75+ 40-79 0-9g/day 2 5 75+ 40-79 10-19 1 3 75+ 40-79 20-29 0 3 75+ 40-79 30+ 1 1 75+ 80-119 0-9g/day 1 1 75+ 80-119 10-19 1 1 75+ 120+ 0-9g/day 2 2 75+ 120+ 10-19 1 1
DATA eso;
INPUT aggrp $ 1-3 alc $ 5-13 smoke $ 15-22 ncases 29 ncontrl 38-39;
DATALINES;
75+ 0-39g/day 0-9g/day 1 18
75+ 0-39g/day 10-19 2 6
75+ 0-39g/day 30+ 1 3
75+ 40-79 0-9g/day 2 5
75+ 40-79 10-19 1 3
75+ 40-79 20-29 0 3
75+ 40-79 30+ 1 1
75+ 80-119 0-9g/day 1 1
75+ 80-119 10-19 1 1
75+ 120+ 0-9g/day 2 2
75+ 120+ 10-19 1 1
;
RUN;
PROC PRINT DATA=eso;
RUN;
SAS Connection established. Subprocess id is 2198
Obs | aggrp | alc | smoke | ncases | ncontrl |
---|---|---|---|---|---|
1 | 75+ | 0-39g/day | 0-9g/day | 1 | 18 |
2 | 75+ | 0-39g/day | 10-19 | 2 | 6 |
3 | 75+ | 0-39g/day | 30+ | 1 | 3 |
4 | 75+ | 40-79 | 0-9g/day | 2 | 5 |
5 | 75+ | 40-79 | 10-19 | 1 | 3 |
6 | 75+ | 40-79 | 20-29 | 0 | 3 |
7 | 75+ | 40-79 | 30+ | 1 | 1 |
8 | 75+ | 80-119 | 0-9g/day | 1 | 1 |
9 | 75+ | 80-119 | 10-19 | 1 | 1 |
10 | 75+ | 120+ | 0-9g/day | 2 | 2 |
11 | 75+ | 120+ | 10-19 | 1 | 1 |
17.4. SAS Variables and Assignment Statements¶
Use the following SAS data step to create the new variables
Grade
andCourse
defined below:
data school; input Age Quiz : $1. Midterm Final; /* Add you statements here */ datalines; 12 A 92 95 12 B 88 88 13 C 78 75 13 A 92 93 12 F 55 62 13 B 88 82 ;
Using If-Then-Else statements, compute two new variables as follows:
Grade (numeric), with a value of 6 if Age is 12 and a value of 8 if Age is 13.
The quiz grades have numerical equivalents as follows: A = 95, B = 85, C = 75, D = 70, and F = 65. Using this information, compute a course grade (Course) as a weighted average of the Quiz (20%), Midterm (30%) and Final (50%).
data school;
input Age Quiz : $1. Midterm Final;
IF Age = 12 THEN Grade = 6;
ELSE IF AGE = 13 THEN Grade = 8;
IF quiz = "A" THEN Course = 0.2*95 + 0.3*midterm + 0.5*final;
ELSE IF quiz = "B" THEN Course = 0.2*85 + 0.3*midterm + 0.5*final;
ELSE IF quiz = "C" THEN Course = 0.2*75 + 0.3*midterm + 0.5*final;
ELSE IF quiz = "D" THEN Course = 0.2*70 + 0.3*midterm + 0.5*final;
ELSE IF quiz = "F" THEN Course = 0.2*65 + 0.3*midterm + 0.5*final;
datalines;
12 A 92 95
12 B 88 88
13 C 78 75
13 A 92 93
12 F 55 62
13 B 88 82
;
RUN;
PROC PRINT DATA = school;
RUN;
Obs | Age | Quiz | Midterm | Final | Grade | Course |
---|---|---|---|---|---|---|
1 | 12 | A | 92 | 95 | 6 | 94.1 |
2 | 12 | B | 88 | 88 | 6 | 87.4 |
3 | 13 | C | 78 | 75 | 8 | 75.9 |
4 | 13 | A | 92 | 93 | 8 | 93.1 |
5 | 12 | F | 55 | 62 | 6 | 60.5 |
6 | 13 | B | 88 | 82 | 8 | 84.4 |
17.5. Working with Your Data¶
Create a temporary SAS dataset called
heart
from the sashelp.heart dataset. The library sashelp if a permanent library that comes with SAS.
DATA heart;
SET sashelp.heart;
RUN;
1348 ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
1348! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
1349
1350 DATA heart;
1351 SET sashelp.heart;
1352 RUN;
NOTE: There were 5209 observations read from the data set SASHELP.HEART.
NOTE: The data set WORK.HEART has 5209 observations and 17 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
1353
1354 ods html5 (id=saspy_internal) close;ods listing;
1355
View the description of the
heart
dataset by running PROC CONTENTS. How many observations (rows) and variables (columns) are in the dataset?
PROC CONTENTS DATA = heart;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART | Observations | 5209 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:35:04 | Observation Length | 168 |
Last Modified | 10/17/2020 21:35:04 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 14 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671720 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 960KB |
File Size (bytes) | 983040 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
1 | condition | Char | 5 |
There are 5209 observations (rows) and 17 variables (columns).
Use the RENAME= option to rename
Status
tocondition
. Show the variables table from PROC CONTENTS to see that the name has changed.
DATA heart2 (RENAME = (status = condition));
SET heart;
RUN;
PROC CONTENTS data = heart2;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART2 | Observations | 5209 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:39:47 | Observation Length | 168 |
Last Modified | 10/17/2020 21:39:47 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 14 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart2.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671720 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 960KB |
File Size (bytes) | 983040 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
1 | condition | Char | 5 |
Create a subset of
heart
calledheart_sub
that contains only the columnsheight
,weight
, anddiastolic
. What are the dimensions ofheart_sub
? Print the first five rows ofheart_sub
.
DATA heart_sub;
SET heart;
KEEP height weight diastolic;
RUN;
PROC CONTENTS data = heart_sub;
RUN;
PROC PRINT data = heart_sub(OBS=5);
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART_SUB | Observations | 5209 |
---|---|---|---|
Member Type | DATA | Variables | 3 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:39:50 | Observation Length | 24 |
Last Modified | 10/17/2020 21:39:50 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 2 |
First Data Page | 1 |
Max Obs per Page | 2714 |
Obs in First Data Page | 2625 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671721 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 192KB |
File Size (bytes) | 196608 |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
3 | Diastolic | Num | 8 |
1 | Height | Num | 8 |
2 | Weight | Num | 8 |
Obs | Height | Weight | Diastolic |
---|---|---|---|
1 | 62.50 | 140 | 78 |
2 | 59.75 | 194 | 92 |
3 | 62.25 | 132 | 90 |
4 | 65.75 | 158 | 80 |
5 | 66.00 | 156 | 76 |
The new dataset heart_sub
contains 5209 rows and 3 columns.
Subset the patients (rows) of
heart
that have a diastolic blood pressure (diastolic) more than 90 into a dataset calledheart_sub2
. How many are there? Print the first five rows ofheart_sub2
.
DATA heart_sub2;
SET heart;
WHERE diastolic > 90;
RUN;
PROC CONTENTS data = heart_sub2;
RUN;
PROC PRINT data = heart_sub2(OBS=5);
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART_SUB2 | Observations | 1366 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:39:55 | Observation Length | 168 |
Last Modified | 10/17/2020 21:39:55 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 4 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub2.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671723 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 320KB |
File Size (bytes) | 327680 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
1 | Status | Char | 5 | |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
Obs | Status | DeathCause | AgeCHDdiag | Sex | AgeAtStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeAtDeath | Cholesterol | Chol_Status | BP_Status | Weight_Status | Smoking_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Dead | Cancer | . | Female | 41 | 59.75 | 194 | 92 | 144 | 183 | 0 | 57 | 181 | Desirable | High | Overweight | Non-smoker |
2 | Alive | . | Female | 58 | 61.75 | 131 | 92 | 176 | 117 | 0 | . | 196 | Desirable | High | Overweight | Non-smoker | |
3 | Alive | . | Female | 42 | 67.75 | 162 | 96 | 138 | 119 | 1 | . | 200 | Borderline | High | Overweight | Light (1-5) | |
4 | Dead | Coronary Heart Disease | 71 | Female | 49 | 60.50 | 153 | 110 | 196 | 140 | 5 | 73 | 221 | Borderline | High | Overweight | Light (1-5) |
5 | Alive | 43 | Male | 33 | 66.50 | 172 | 106 | 146 | 127 | 0 | . | 247 | High | High | Overweight | Non-smoker |
There are 1366 patients with a diastolic blood pressure above 90.
Subset the patients (rows) of
heart
that have a diastolic blood pressure less than or equal to 90 and have a weight more than 180 pounds into a dataset calledheart_sub3
. How many are there? Print the first five rows ofheart_sub3
.
DATA heart_sub3;
SET heart;
WHERE diastolic <= 90 AND weight > 180;
RUN;
PROC CONTENTS data = heart_sub3;
RUN;
PROC PRINT data = heart_sub3(OBS=5);
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART_SUB3 | Observations | 481 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:43:36 | Observation Length | 168 |
Last Modified | 10/17/2020 21:43:36 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 2 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub3.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671722 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 192KB |
File Size (bytes) | 196608 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
1 | Status | Char | 5 | |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
Obs | Status | DeathCause | AgeCHDdiag | Sex | AgeAtStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeAtDeath | Cholesterol | Chol_Status | BP_Status | Weight_Status | Smoking_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Alive | . | Male | 35 | 71.00 | 194 | 68 | 132 | 124 | 0 | . | 211 | Borderline | Normal | Overweight | Non-smoker | |
2 | Alive | . | Female | 50 | 67.50 | 185 | 88 | 150 | 136 | 15 | . | 228 | Borderline | High | Overweight | Moderate (6-15) | |
3 | Alive | . | Male | 42 | 72.25 | 182 | 78 | 136 | 113 | 0 | . | 221 | Borderline | Normal | Overweight | Non-smoker | |
4 | Alive | 68 | Male | 40 | 70.00 | 189 | 78 | 124 | 124 | 0 | . | 319 | High | Normal | Overweight | Non-smoker | |
5 | Alive | 68 | Male | 40 | 70.00 | 195 | 76 | 132 | 128 | 20 | . | 205 | Borderline | Normal | Overweight | Heavy (16-25) |
There are 481 patients with a diastolic blood pressure less than or equal to 90 and with a weight above 180 pounds.
Create a subset of
heart
that only contains the columnsweight
,systolic
, andcholesterol
for those that are heavy smokers smoker (Smoking_Status = “Heavy (16-25)”). Save this toheart_sub
. What are the dimensions of the dataset? Print the first five rows ofheart_sub
.
DATA heart_sub;
SET heart;
WHERE Smoking_Status = "Heavy (16-25)";
KEEP weight systolic cholesterol;
RUN;
PROC CONTENTS data = heart_sub;
RUN;
PROC PRINT data = heart_sub(OBS=5);
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART_SUB | Observations | 1046 |
---|---|---|---|
Member Type | DATA | Variables | 3 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:47:19 | Observation Length | 24 |
Last Modified | 10/17/2020 21:47:19 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 2714 |
Obs in First Data Page | 1046 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671725 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
3 | Cholesterol | Num | 8 |
2 | Systolic | Num | 8 |
1 | Weight | Num | 8 |
Obs | Weight | Systolic | Cholesterol |
---|---|---|---|
1 | 156 | 110 | 281 |
2 | 195 | 132 | 205 |
3 | 181 | 144 | 223 |
4 | 152 | 144 | 263 |
5 | 196 | 128 | . |
The new heart_sub dataset contains 1046 rows and three columns.
Reorder the rows of
heart_sub
in decreasing order by cholesterol. Print the first 5 rows of the sorted dataset.
PROC SORT data = heart_sub out = heart_sub_srt;
BY descending cholesterol;
RUN;
PROC PRINT data = heart_sub_srt(OBS=5);
RUN;
Obs | Weight | Systolic | Cholesterol |
---|---|---|---|
1 | 169 | 155 | 568 |
2 | 205 | 154 | 534 |
3 | 194 | 150 | 429 |
4 | 104 | 122 | 392 |
5 | 137 | 185 | 368 |
17.6. Data Summarization¶
How many bike lanes are currently in Baltimore? You can assume that each observation/row is a different bike lane.
PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
out = bike dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC CONTENTS data = bike;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.BIKE | Observations | 1631 |
---|---|---|---|
Member Type | DATA | Variables | 9 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 22:01:39 | Observation Length | 136 |
Last Modified | 10/17/2020 22:01:39 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 4 |
First Data Page | 1 |
Max Obs per Page | 481 |
Obs in First Data Page | 459 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/bike.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671721 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 320KB |
File Size (bytes) | 327680 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
3 | block | Char | 29 | $29. | $29. |
9 | dateInstalled | Num | 8 | BEST12. | BEST32. |
8 | length | Num | 8 | BEST12. | BEST32. |
2 | name | Char | 20 | $20. | $20. |
5 | numLanes | Num | 8 | BEST12. | BEST32. |
6 | project | Char | 25 | $25. | $25. |
7 | route | Char | 13 | $13. | $13. |
1 | subType | Char | 6 | $6. | $6. |
4 | type | Char | 15 | $15. | $15. |
There are 1631 rows.
How many (a) feet and (b) miles of total bike lanes are currently in Baltimore? (The length variable provides the length in feet.)
PROC MEANS data = bike sum;
VAR length;
RUN;
The MEANS Procedure
Analysis Variable : length |
---|
Sum |
439447.59 |
DATA miles;
length_mi = 439447.59 / 5280;
RUN;
PROC PRINT data = miles;
RUN;
Obs | length_mi |
---|---|
1 | 83.2287 |
How many types (type) bike lanes are there? Which type (a) occurs the most and (b) has the longest average bike lane length?
PROC FREQ data = bike;
TABLE type;
RUN;
PROC MEANS data = bike noprint;
CLASS type;
VAR length;
OUTPUT out = type_mean mean = length_mean;
RUN;
PROC SORT data = type_mean;
BY descending length_mean;
RUN;
PROC PRINT DATA = type_mean;
RUN;
The FREQ Procedure
type | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Frequency Missing = 9 | ||||
BIKE BOULEVARD | 49 | 3.02 | 49 | 3.02 |
BIKE LANE | 621 | 38.29 | 670 | 41.31 |
CONTRAFLOW | 13 | 0.80 | 683 | 42.11 |
SHARED BUS BIKE | 39 | 2.40 | 722 | 44.51 |
SHARROW | 589 | 36.31 | 1311 | 80.83 |
SIDEPATH | 7 | 0.43 | 1318 | 81.26 |
SIGNED ROUTE | 304 | 18.74 | 1622 | 100.00 |
Obs | type | _TYPE_ | _FREQ_ | length_mean |
---|---|---|---|---|
1 | SIDEPATH | 1 | 7 | 665.87751019 |
2 | BIKE LANE | 1 | 621 | 300.24183502 |
3 | SHARED BUS BIKE | 1 | 39 | 276.66575695 |
4 | 0 | 1622 | 269.48876975 | |
5 | SIGNED ROUTE | 1 | 304 | 263.57986593 |
6 | SHARROW | 1 | 589 | 243.8718592 |
7 | BIKE BOULEVARD | 1 | 49 | 197.27815495 |
8 | CONTRAFLOW | 1 | 13 | 136.46646088 |
There are 7 bike lane types (8 if you include missing as a category). The lane type that occurrs the most often is bike lane with a count of 621. The lane type sidepath has the longest average length at 665.88 ft.
How many different projects (project) do the bike lanes fall into? Which project category has the longest average bike lane length?
PROC FREQ data = bike;
TABLE project;
RUN;
PROC MEANS data = bike noprint;
CLASS project;
VAR length;
OUTPUT out = project_mean mean = length_mean;
RUN;
PROC SORT data = project_mean;
BY descending length_mean;
RUN;
PROC PRINT DATA = project_mean;
RUN;
The FREQ Procedure
project | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Frequency Missing = 74 | ||||
CHARM CITY CIRCULATOR | 39 | 2.50 | 39 | 2.50 |
COLLEGETOWN | 339 | 21.77 | 378 | 24.28 |
COLLEGETOWN NETWORK | 13 | 0.83 | 391 | 25.11 |
ENGINEERING CONSTRUCTION | 12 | 0.77 | 403 | 25.88 |
GUILFORD AVE BIKE BLVD | 49 | 3.15 | 452 | 29.03 |
MAINTENANCE | 4 | 0.26 | 456 | 29.29 |
OPERATION ORANGE CONE | 458 | 29.42 | 914 | 58.70 |
PARK HEIGHTS BIKE NETWORK | 172 | 11.05 | 1086 | 69.75 |
PLANNING TRAFFIC | 18 | 1.16 | 1104 | 70.91 |
SOUTHEAST BIKE NETWORK | 323 | 20.75 | 1427 | 91.65 |
TRAFFIC | 51 | 3.28 | 1478 | 94.93 |
TRAFFIC CALMING | 79 | 5.07 | 1557 | 100.00 |
Obs | project | _TYPE_ | _FREQ_ | length_mean |
---|---|---|---|---|
1 | MAINTENANCE | 1 | 4 | 1942.1522856 |
2 | ENGINEERING CONSTRUCTION | 1 | 12 | 512.09762483 |
3 | TRAFFIC | 1 | 51 | 419.52878809 |
4 | COLLEGETOWN | 1 | 339 | 320.68363126 |
5 | PARK HEIGHTS BIKE NETWORK | 1 | 172 | 283.22519367 |
6 | CHARM CITY CIRCULATOR | 1 | 39 | 276.66575695 |
7 | 0 | 1557 | 272.0534698 | |
8 | TRAFFIC CALMING | 1 | 79 | 268.53137627 |
9 | OPERATION ORANGE CONE | 1 | 458 | 250.07840273 |
10 | COLLEGETOWN NETWORK | 1 | 13 | 213.63728443 |
11 | SOUTHEAST BIKE NETWORK | 1 | 323 | 210.82827388 |
12 | PLANNING TRAFFIC | 1 | 18 | 209.42893827 |
13 | GUILFORD AVE BIKE BLVD | 1 | 49 | 197.27815495 |
There are 12 project types (13 if you include missing as a project type). The maintenance project has the longest average bike lane length at 1942.15 ft.
What was the average bike lane length per year that they were installed? (Be sure to first set dateInstalled to . if it is equal to zero.)
DATA bike;
SET bike;
IF dateinstalled = 0 THEN dateinstalled = .;
RUN;
PROC MEANS data = bike mean;
CLASS dateinstalled;
VAR length;
RUN;
The MEANS Procedure
Analysis Variable : length | ||
---|---|---|
dateInstalled | N Obs | Mean |
2006 | 2 | 1469.35 |
2007 | 368 | 309.8903287 |
2008 | 206 | 249.4810027 |
2009 | 86 | 407.2947463 |
2010 | 625 | 245.6015777 |
2011 | 101 | 232.8127593 |
2012 | 107 | 271.4321825 |
2013 | 10 | 290.4263702 |
(a) Numerically and (b) graphically describe the distribution of bike lane lengths (length).
PROC UNIVARIATE data = bike plots;
VAR length;
RUN;
The UNIVARIATE Procedure
Variable: length
Moments | |||
---|---|---|---|
N | 1631 | Sum Weights | 1631 |
Mean | 269.434449 | Sum Observations | 439447.587 |
Std Deviation | 277.706646 | Variance | 77120.9813 |
Skewness | 5.1297861 | Kurtosis | 43.3888947 |
Uncorrected SS | 244109518 | Corrected SS | 125707200 |
Coeff Variation | 103.070208 | Std Error Mean | 6.87637089 |
Basic Statistical Measures | |||
---|---|---|---|
Location | Variability | ||
Mean | 269.4344 | Std Deviation | 277.70665 |
Median | 200.3027 | Variance | 77121 |
Mode | . | Range | 3749 |
Interquartile Range | 217.16479 |
Tests for Location: Mu0=0 | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Student's t | t | 39.18265 | Pr > |t| | <.0001 |
Sign | M | 815 | Pr >= |M| | <.0001 |
Signed Rank | S | 664632.5 | Pr >= |S| | <.0001 |
Quantiles (Definition 5) | |
---|---|
Level | Quantile |
100% Max | 3749.3226 |
99% | 1341.8291 |
95% | 664.2305 |
90% | 502.2232 |
75% Q3 | 341.1493 |
50% Median | 200.3027 |
25% Q1 | 123.9845 |
10% | 67.8206 |
5% | 45.3655 |
1% | 14.8383 |
0% Min | 0.0000 |
Extreme Observations | |||
---|---|---|---|
Lowest | Highest | ||
Value | Obs | Value | Obs |
0.00000 | 4 | 2536.73 | 1284 |
5.51372 | 523 | 2715.20 | 1344 |
6.85712 | 724 | 2916.53 | 903 |
8.16243 | 490 | 3467.68 | 904 |
9.71914 | 595 | 3749.32 | 3 |
Describe the distribution of bike lane lengths numerically and graphically after stratifying them by (a) type and then by (b) number of lanes (numLanes).
PROC MEANS data=bike mean std;
CLASS type;
VAR length;
RUN;
PROC SGPLOT data = bike;
VBOX length / category = type;
RUN;
PROC MEANS data=bike mean std;
CLASS numlanes;
VAR length;
RUN;
PROC SGPLOT data = bike;
VBOX length / category = numlanes;
RUN;
The MEANS Procedure
Analysis Variable : length | |||
---|---|---|---|
type | N Obs | Mean | Std Dev |
BIKE BOULEVARD | 49 | 197.2781549 | 113.1318019 |
BIKE LANE | 621 | 300.2418350 | 314.6605599 |
CONTRAFLOW | 13 | 136.4664609 | 48.5165725 |
SHARED BUS BIKE | 39 | 276.6657570 | 140.1811711 |
SHARROW | 589 | 243.8718592 | 188.4222254 |
SIDEPATH | 7 | 665.8775102 | 619.3894227 |
SIGNED ROUTE | 304 | 263.5798659 | 347.8501541 |
The MEANS Procedure
Analysis Variable : length | |||
---|---|---|---|
numLanes | N Obs | Mean | Std Dev |
0 | 21 | 308.3767969 | 201.5548096 |
1 | 900 | 277.9016282 | 317.8281529 |
2 | 710 | 257.5495894 | 218.6179448 |