17. Solutions to Exercises¶
17.1. SAS Windowing Environment¶
Use the sample SAS program code provided above and the Charm City Circulator Ridership dataset to experiment with using SAS by making the following changes to the provided code.
Go through and change the colors using the attribute map dataset
palette
to something other than what they originally were. See https://support.sas.com/content/dam/SAS/support/en/books/pro-template-made-easy-a-guide-for-sas-users/62007_Appendix.pdf for a large list of colors in SAS.Change the days you are keeping to show
"Sunday"
instead of “Saturday"
.Change the plot from a line plot to a scatterplot by changing
series
toscatter
inproc sgpanel
.
%let working_directory = data; *Replace data with your path to your
directory containing the dataset;
proc import datafile = "&working_directory/Charm_City_Circulator_Ridership.csv"
out = circ dbms = csv replace;
getnames = yes;
guessingrows = max;
run;
proc sort data = circ;
by day date daily;
run;
proc transpose data = circ out = long (rename=(_NAME_ = var col1 = number));
var orangeboardings -- banneraverage;
by day date daily;
run;
data long2;
set long;
var = tranwrd(var, 'Board', '_Board');
var = tranwrd(var, 'Alight', '_Alight');
var = tranwrd(var, 'Average', '_Average');
line = scan(var, 1, '_');
type = scan(var, 2, '_');
drop var;
run;
data avg;
set long2;
where (type = "Average");
run;
data palette;
input id $ value $ linecolor $;
markercolor = linecolor;
datalines;
lc banner red
lc green yellow
lc orange darkblue
lc purple brown
;
run;
proc sgpanel data = avg dattrmap=palette;
where day in ("Monday" "Tuesday" "Friday" "Sunday");
panelby day;
scatter Y = number X = date / group = line attrid = lc;
run;
17.2. Basic SAS Programming¶
Submit the following simple SAS program. What does it do?
PROC PRINT DATA = sashelp.heart (OBS=10); RUN; PROC MEANS DATA = sashelp.heart; VAR height weight; RUN;
PROC PRINT DATA = sashelp.heart (OBS=10);
RUN;
PROC MEANS DATA = sashelp.heart;
VAR height weight;
RUN;
Obs | Status | DeathCause | AgeCHDdiag | Sex | AgeAtStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeAtDeath | Cholesterol | Chol_Status | BP_Status | Weight_Status | Smoking_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Dead | Other | . | Female | 29 | 62.50 | 140 | 78 | 124 | 121 | 0 | 55 | . | Normal | Overweight | Non-smoker | |
2 | Dead | Cancer | . | Female | 41 | 59.75 | 194 | 92 | 144 | 183 | 0 | 57 | 181 | Desirable | High | Overweight | Non-smoker |
3 | Alive | . | Female | 57 | 62.25 | 132 | 90 | 170 | 114 | 10 | . | 250 | High | High | Overweight | Moderate (6-15) | |
4 | Alive | . | Female | 39 | 65.75 | 158 | 80 | 128 | 123 | 0 | . | 242 | High | Normal | Overweight | Non-smoker | |
5 | Alive | . | Male | 42 | 66.00 | 156 | 76 | 110 | 116 | 20 | . | 281 | High | Optimal | Overweight | Heavy (16-25) | |
6 | Alive | . | Female | 58 | 61.75 | 131 | 92 | 176 | 117 | 0 | . | 196 | Desirable | High | Overweight | Non-smoker | |
7 | Alive | . | Female | 36 | 64.75 | 136 | 80 | 112 | 110 | 15 | . | 196 | Desirable | Normal | Overweight | Moderate (6-15) | |
8 | Dead | Other | . | Male | 53 | 65.50 | 130 | 80 | 114 | 99 | 0 | 77 | 276 | High | Normal | Normal | Non-smoker |
9 | Alive | . | Male | 35 | 71.00 | 194 | 68 | 132 | 124 | 0 | . | 211 | Borderline | Normal | Overweight | Non-smoker | |
10 | Dead | Cerebral Vascular Disease | . | Male | 52 | 62.50 | 129 | 78 | 124 | 106 | 5 | 82 | 284 | High | Normal | Normal | Light (1-5) |
The MEANS Procedure
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
Height
Weight
|
5203
5203
|
64.8131847
153.0866808
|
3.5827074
28.9154261
|
51.5000000
67.0000000
|
76.5000000
300.0000000
|
This program prints out the first 10 rows of the sashelp.heart dataset and the summary statistics N (sample size), mean, standard deviation, minimum and maxium for the Height and Weight variables.
Submit the following small program. How many DATA steps are there? How man PROC steps? Which steps are executed?
DATA work.quant; SET sashelp.heart; /* Read in from the heart datset */ KEEP height weight diastolic systolic; *Only keep these variable in the new dataset; RUN; PROC PRINT DATA = work.quant (OBS=10); RUN; /* PROC MEANS DATA = work.quant MIN Q1 Median Q3 MAX MEAN STD; RUN; */
DATA work.quant;
SET sashelp.heart; /* Read in from the heart datset */
KEEP height weight diastolic systolic; *Only keep these variable
in the new dataset;
RUN;
PROC PRINT DATA = work.quant (OBS=10);
RUN;
/*
PROC MEANS DATA = work.quant MIN Q1 Median Q3 MAX MEAN STD;
RUN;
*/
Obs | Height | Weight | Diastolic | Systolic |
---|---|---|---|---|
1 | 62.50 | 140 | 78 | 124 |
2 | 59.75 | 194 | 92 | 144 |
3 | 62.25 | 132 | 90 | 170 |
4 | 65.75 | 158 | 80 | 128 |
5 | 66.00 | 156 | 76 | 110 |
6 | 61.75 | 131 | 92 | 176 |
7 | 64.75 | 136 | 80 | 112 |
8 | 65.50 | 130 | 80 | 114 |
9 | 71.00 | 194 | 68 | 132 |
10 | 62.50 | 129 | 78 | 124 |
There is one DATA step and there are two PROC steps; however, the last PROC MEANS is not run since it is within a paragraph comment.
Run PROC CONTENTS on the dataset sashelp.heart. How many observations are in this datset?
PROC CONTENTS data = sashelp.heart;
RUN;
The CONTENTS Procedure
Data Set Name | SASHELP.HEART | Observations | 5209 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/25/2018 02:21:26 | Observation Length | 168 |
Last Modified | 10/25/2018 02:21:26 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | Framingham Heart Study | ||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | us-ascii ASCII (ANSI) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 14 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /opt/sasinside/SASHome/SASFoundation/9.4/sashelp/heart.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 6235 |
Access Permission | rw-r--r-- |
Owner Name | sas |
File Size | 960KB |
File Size (bytes) | 983040 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
1 | Status | Char | 5 | |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
There are 5209 observations (rows) in the dataset.
Submit the following program to SAS. What is the value of CurrentDate? (This value represents the number of days since Januar 1, 1960, the reference date for the SAS system).
DATA work.date; CurrentDate = Today(); RUN; PROC PRINT DATA = work.date; RUN;
DATA work.date;
CurrentDate = Today();
RUN;
PROC PRINT DATA = work.date;
RUN;
PROC PRINT DATA = work.date;
format currentdate mmddyy10.;
RUN;
Obs | CurrentDate |
---|---|
1 | 22205 |
Obs | CurrentDate |
---|---|
1 | 10/17/2020 |
The value of current date will change depending on when you submit this program. I submitted this program in 10/17/2020 and received the value 22205 or 22205 days since January 1, 1960.
17.3. Data I/O¶
Use PROC IMPORT to read in the Youth Tobacco study, Youth_Tobacco_Survey_YTS_Data.csv and name it youth.
PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Youth_Tobacco_Survey_YTS_Data.csv"
out = youth dbms = csv replace;
getnames = yes;
guessingrows = 500;
RUN;
PROC PRINT data = youth (OBS=5);
RUN;
Obs | YEAR | LocationAbbr | LocationDesc | TopicType | TopicDesc | MeasureDesc | DataSource | Response | Data_Value_Unit | Data_Value_Type | Data_Value | Data_Value_Footnote_Symbol | Data_Value_Footnote | Data_Value_Std_Err | Low_Confidence_Limit | High_Confidence_Limit | Sample_Size | Gender | Race | Age | Education | GeoLocation | TopicTypeId | TopicId | MeasureId | StratificationID1 | StratificationID2 | StratificationID3 | StratificationID4 | SubMeasureID | DisplayOrder |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Percent of Current Smokers Who Want to Quit | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Overall | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 170CES | 1GEN | 8AGE | 6RAC | 1EDU | YTS01 | 1 | |
2 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Percent of Current Smokers Who Want to Quit | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Male | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 170CES | 2GEN | 8AGE | 6RAC | 1EDU | YTS02 | 2 | |
3 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Percent of Current Smokers Who Want to Quit | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Female | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 170CES | 3GEN | 8AGE | 6RAC | 1EDU | YTS03 | 3 | |
4 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Quit Attempt in Past Year Among Current Cigarette Smokers | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Overall | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 169QUA | 1GEN | 8AGE | 6RAC | 1EDU | YTS04 | 4 | |
5 | 2015 | AZ | Arizona | Tobacco Use - Survey Data | Cessation (Youth) | Quit Attempt in Past Year Among Current Cigarette Smokers | YTS | % | Percentage | . | * | Data in these cells have been suppressed because of a small sample size | . | . | . | . | Male | All Races | All Ages | Middle School | (34.865970280000454, -111.76381127699972) | BEH | 105BEH | 169QUA | 2GEN | 8AGE | 6RAC | 1EDU | YTS05 | 5 |
Download an Excel version of the Monuments dataset, Monuments.xlsx, from CANVAS. Use PROC IMPORT to read in the dataset and call the output mon.
PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Monuments.xlsx"
out = mon dbms = xlsx replace;
getnames = yes;
RUN;
PROC PRINT data = mon (OBS=5);
RUN;
Obs | name | zipCode | neighborhood | councilDistrict | policeDistrict | Location 1 |
---|---|---|---|---|---|---|
1 | James Cardinal Gibbons | 21201 | Downtown | 11 | CENTRAL | 408 CHARLES ST Baltimore, MD |
2 | The Battle Monument | 21202 | Downtown | 11 | CENTRAL | |
3 | Negro Heroes of the U.S Monument | 21202 | Downtown | 11 | CENTRAL | |
4 | Star Bangled Banner | 21202 | Downtown | 11 | CENTRAL | 100 HOLLIDAY ST Baltimore, MD |
5 | Flame at the Holocaust Monument | 21202 | Downtown | 11 | CENTRAL | 50 MARKET PL Baltimore, MD |
Write a DATA step that will read in the following column formatted data using DATALINES and print the results using PROC PRINT. This dataset contains information on esophageal cancer in Ille-et-Vilaine, France. The variables are
Age group
Alcohol consumption
Tobacco consumption
Number of Cases
Number of Controls
Write a DATA step to read in the data and print the results. Note: the first line is to help you count the columns. Each number corresponds to a multiple of 5.
----1----2----3----4----5----6----7----8 75+ 0-39g/day 0-9g/day 1 18 75+ 0-39g/day 10-19 2 6 75+ 0-39g/day 30+ 1 3 75+ 40-79 0-9g/day 2 5 75+ 40-79 10-19 1 3 75+ 40-79 20-29 0 3 75+ 40-79 30+ 1 1 75+ 80-119 0-9g/day 1 1 75+ 80-119 10-19 1 1 75+ 120+ 0-9g/day 2 2 75+ 120+ 10-19 1 1
DATA eso;
INPUT aggrp $ 1-3 alc $ 5-13 smoke $ 15-22 ncases 29 ncontrl 38-39;
DATALINES;
75+ 0-39g/day 0-9g/day 1 18
75+ 0-39g/day 10-19 2 6
75+ 0-39g/day 30+ 1 3
75+ 40-79 0-9g/day 2 5
75+ 40-79 10-19 1 3
75+ 40-79 20-29 0 3
75+ 40-79 30+ 1 1
75+ 80-119 0-9g/day 1 1
75+ 80-119 10-19 1 1
75+ 120+ 0-9g/day 2 2
75+ 120+ 10-19 1 1
;
RUN;
PROC PRINT DATA=eso;
RUN;
SAS Connection established. Subprocess id is 2198
Obs | aggrp | alc | smoke | ncases | ncontrl |
---|---|---|---|---|---|
1 | 75+ | 0-39g/day | 0-9g/day | 1 | 18 |
2 | 75+ | 0-39g/day | 10-19 | 2 | 6 |
3 | 75+ | 0-39g/day | 30+ | 1 | 3 |
4 | 75+ | 40-79 | 0-9g/day | 2 | 5 |
5 | 75+ | 40-79 | 10-19 | 1 | 3 |
6 | 75+ | 40-79 | 20-29 | 0 | 3 |
7 | 75+ | 40-79 | 30+ | 1 | 1 |
8 | 75+ | 80-119 | 0-9g/day | 1 | 1 |
9 | 75+ | 80-119 | 10-19 | 1 | 1 |
10 | 75+ | 120+ | 0-9g/day | 2 | 2 |
11 | 75+ | 120+ | 10-19 | 1 | 1 |
17.4. SAS Variables and Assignment Statements¶
Use the following SAS data step to create the new variables
Grade
andCourse
defined below:
data school; input Age Quiz : $1. Midterm Final; /* Add you statements here */ datalines; 12 A 92 95 12 B 88 88 13 C 78 75 13 A 92 93 12 F 55 62 13 B 88 82 ;
Using If-Then-Else statements, compute two new variables as follows:
Grade (numeric), with a value of 6 if Age is 12 and a value of 8 if Age is 13.
The quiz grades have numerical equivalents as follows: A = 95, B = 85, C = 75, D = 70, and F = 65. Using this information, compute a course grade (Course) as a weighted average of the Quiz (20%), Midterm (30%) and Final (50%).
data school;
input Age Quiz : $1. Midterm Final;
IF Age = 12 THEN Grade = 6;
ELSE IF AGE = 13 THEN Grade = 8;
IF quiz = "A" THEN Course = 0.2*95 + 0.3*midterm + 0.5*final;
ELSE IF quiz = "B" THEN Course = 0.2*85 + 0.3*midterm + 0.5*final;
ELSE IF quiz = "C" THEN Course = 0.2*75 + 0.3*midterm + 0.5*final;
ELSE IF quiz = "D" THEN Course = 0.2*70 + 0.3*midterm + 0.5*final;
ELSE IF quiz = "F" THEN Course = 0.2*65 + 0.3*midterm + 0.5*final;
datalines;
12 A 92 95
12 B 88 88
13 C 78 75
13 A 92 93
12 F 55 62
13 B 88 82
;
RUN;
PROC PRINT DATA = school;
RUN;
Obs | Age | Quiz | Midterm | Final | Grade | Course |
---|---|---|---|---|---|---|
1 | 12 | A | 92 | 95 | 6 | 94.1 |
2 | 12 | B | 88 | 88 | 6 | 87.4 |
3 | 13 | C | 78 | 75 | 8 | 75.9 |
4 | 13 | A | 92 | 93 | 8 | 93.1 |
5 | 12 | F | 55 | 62 | 6 | 60.5 |
6 | 13 | B | 88 | 82 | 8 | 84.4 |
17.5. Working with Your Data¶
Create a temporary SAS dataset called
heart
from the sashelp.heart dataset. The library sashelp if a permanent library that comes with SAS.
DATA heart;
SET sashelp.heart;
RUN;
1348 ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
1348! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
1349
1350 DATA heart;
1351 SET sashelp.heart;
1352 RUN;
NOTE: There were 5209 observations read from the data set SASHELP.HEART.
NOTE: The data set WORK.HEART has 5209 observations and 17 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
1353
1354 ods html5 (id=saspy_internal) close;ods listing;
1355
View the description of the
heart
dataset by running PROC CONTENTS. How many observations (rows) and variables (columns) are in the dataset?
PROC CONTENTS DATA = heart;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART | Observations | 5209 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:35:04 | Observation Length | 168 |
Last Modified | 10/17/2020 21:35:04 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 14 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671720 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 960KB |
File Size (bytes) | 983040 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
1 | condition | Char | 5 |
There are 5209 observations (rows) and 17 variables (columns).
Use the RENAME= option to rename
Status
tocondition
. Show the variables table from PROC CONTENTS to see that the name has changed.
DATA heart2 (RENAME = (status = condition));
SET heart;
RUN;
PROC CONTENTS data = heart2;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART2 | Observations | 5209 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:39:47 | Observation Length | 168 |
Last Modified | 10/17/2020 21:39:47 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 14 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart2.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671720 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 960KB |
File Size (bytes) | 983040 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
1 | condition | Char | 5 |
Create a subset of
heart
calledheart_sub
that contains only the columnsheight
,weight
, anddiastolic
. What are the dimensions ofheart_sub
? Print the first five rows ofheart_sub
.
DATA heart_sub;
SET heart;
KEEP height weight diastolic;
RUN;
PROC CONTENTS data = heart_sub;
RUN;
PROC PRINT data = heart_sub(OBS=5);
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART_SUB | Observations | 5209 |
---|---|---|---|
Member Type | DATA | Variables | 3 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:39:50 | Observation Length | 24 |
Last Modified | 10/17/2020 21:39:50 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 2 |
First Data Page | 1 |
Max Obs per Page | 2714 |
Obs in First Data Page | 2625 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671721 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 192KB |
File Size (bytes) | 196608 |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
3 | Diastolic | Num | 8 |
1 | Height | Num | 8 |
2 | Weight | Num | 8 |
Obs | Height | Weight | Diastolic |
---|---|---|---|
1 | 62.50 | 140 | 78 |
2 | 59.75 | 194 | 92 |
3 | 62.25 | 132 | 90 |
4 | 65.75 | 158 | 80 |
5 | 66.00 | 156 | 76 |
The new dataset heart_sub
contains 5209 rows and 3 columns.
Subset the patients (rows) of
heart
that have a diastolic blood pressure (diastolic) more than 90 into a dataset calledheart_sub2
. How many are there? Print the first five rows ofheart_sub2
.
DATA heart_sub2;
SET heart;
WHERE diastolic > 90;
RUN;
PROC CONTENTS data = heart_sub2;
RUN;
PROC PRINT data = heart_sub2(OBS=5);
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART_SUB2 | Observations | 1366 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:39:55 | Observation Length | 168 |
Last Modified | 10/17/2020 21:39:55 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 4 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub2.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671723 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 320KB |
File Size (bytes) | 327680 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
1 | Status | Char | 5 | |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
Obs | Status | DeathCause | AgeCHDdiag | Sex | AgeAtStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeAtDeath | Cholesterol | Chol_Status | BP_Status | Weight_Status | Smoking_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Dead | Cancer | . | Female | 41 | 59.75 | 194 | 92 | 144 | 183 | 0 | 57 | 181 | Desirable | High | Overweight | Non-smoker |
2 | Alive | . | Female | 58 | 61.75 | 131 | 92 | 176 | 117 | 0 | . | 196 | Desirable | High | Overweight | Non-smoker | |
3 | Alive | . | Female | 42 | 67.75 | 162 | 96 | 138 | 119 | 1 | . | 200 | Borderline | High | Overweight | Light (1-5) | |
4 | Dead | Coronary Heart Disease | 71 | Female | 49 | 60.50 | 153 | 110 | 196 | 140 | 5 | 73 | 221 | Borderline | High | Overweight | Light (1-5) |
5 | Alive | 43 | Male | 33 | 66.50 | 172 | 106 | 146 | 127 | 0 | . | 247 | High | High | Overweight | Non-smoker |
There are 1366 patients with a diastolic blood pressure above 90.
Subset the patients (rows) of
heart
that have a diastolic blood pressure less than or equal to 90 and have a weight more than 180 pounds into a dataset calledheart_sub3
. How many are there? Print the first five rows ofheart_sub3
.
DATA heart_sub3;
SET heart;
WHERE diastolic <= 90 AND weight > 180;
RUN;
PROC CONTENTS data = heart_sub3;
RUN;
PROC PRINT data = heart_sub3(OBS=5);
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART_SUB3 | Observations | 481 |
---|---|---|---|
Member Type | DATA | Variables | 17 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:43:36 | Observation Length | 168 |
Last Modified | 10/17/2020 21:43:36 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 2 |
First Data Page | 1 |
Max Obs per Page | 389 |
Obs in First Data Page | 365 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub3.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671722 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 192KB |
File Size (bytes) | 196608 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Label |
12 | AgeAtDeath | Num | 8 | Age at Death |
5 | AgeAtStart | Num | 8 | Age at Start |
3 | AgeCHDdiag | Num | 8 | Age CHD Diagnosed |
15 | BP_Status | Char | 7 | Blood Pressure Status |
14 | Chol_Status | Char | 10 | Cholesterol Status |
13 | Cholesterol | Num | 8 | |
2 | DeathCause | Char | 26 | Cause of Death |
8 | Diastolic | Num | 8 | |
6 | Height | Num | 8 | |
10 | MRW | Num | 8 | Metropolitan Relative Weight |
4 | Sex | Char | 6 | |
11 | Smoking | Num | 8 | |
17 | Smoking_Status | Char | 17 | Smoking Status |
1 | Status | Char | 5 | |
9 | Systolic | Num | 8 | |
7 | Weight | Num | 8 | |
16 | Weight_Status | Char | 11 | Weight Status |
Obs | Status | DeathCause | AgeCHDdiag | Sex | AgeAtStart | Height | Weight | Diastolic | Systolic | MRW | Smoking | AgeAtDeath | Cholesterol | Chol_Status | BP_Status | Weight_Status | Smoking_Status |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Alive | . | Male | 35 | 71.00 | 194 | 68 | 132 | 124 | 0 | . | 211 | Borderline | Normal | Overweight | Non-smoker | |
2 | Alive | . | Female | 50 | 67.50 | 185 | 88 | 150 | 136 | 15 | . | 228 | Borderline | High | Overweight | Moderate (6-15) | |
3 | Alive | . | Male | 42 | 72.25 | 182 | 78 | 136 | 113 | 0 | . | 221 | Borderline | Normal | Overweight | Non-smoker | |
4 | Alive | 68 | Male | 40 | 70.00 | 189 | 78 | 124 | 124 | 0 | . | 319 | High | Normal | Overweight | Non-smoker | |
5 | Alive | 68 | Male | 40 | 70.00 | 195 | 76 | 132 | 128 | 20 | . | 205 | Borderline | Normal | Overweight | Heavy (16-25) |
There are 481 patients with a diastolic blood pressure less than or equal to 90 and with a weight above 180 pounds.
Create a subset of
heart
that only contains the columnsweight
,systolic
, andcholesterol
for those that are heavy smokers smoker (Smoking_Status = “Heavy (16-25)”). Save this toheart_sub
. What are the dimensions of the dataset? Print the first five rows ofheart_sub
.
DATA heart_sub;
SET heart;
WHERE Smoking_Status = "Heavy (16-25)";
KEEP weight systolic cholesterol;
RUN;
PROC CONTENTS data = heart_sub;
RUN;
PROC PRINT data = heart_sub(OBS=5);
RUN;
The CONTENTS Procedure
Data Set Name | WORK.HEART_SUB | Observations | 1046 |
---|---|---|---|
Member Type | DATA | Variables | 3 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 21:47:19 | Observation Length | 24 |
Last Modified | 10/17/2020 21:47:19 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 2714 |
Obs in First Data Page | 1046 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671725 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
3 | Cholesterol | Num | 8 |
2 | Systolic | Num | 8 |
1 | Weight | Num | 8 |
Obs | Weight | Systolic | Cholesterol |
---|---|---|---|
1 | 156 | 110 | 281 |
2 | 195 | 132 | 205 |
3 | 181 | 144 | 223 |
4 | 152 | 144 | 263 |
5 | 196 | 128 | . |
The new heart_sub dataset contains 1046 rows and three columns.
Reorder the rows of
heart_sub
in decreasing order by cholesterol. Print the first 5 rows of the sorted dataset.
PROC SORT data = heart_sub out = heart_sub_srt;
BY descending cholesterol;
RUN;
PROC PRINT data = heart_sub_srt(OBS=5);
RUN;
Obs | Weight | Systolic | Cholesterol |
---|---|---|---|
1 | 169 | 155 | 568 |
2 | 205 | 154 | 534 |
3 | 194 | 150 | 429 |
4 | 104 | 122 | 392 |
5 | 137 | 185 | 368 |
17.6. Data Summarization¶
How many bike lanes are currently in Baltimore? You can assume that each observation/row is a different bike lane.
PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
out = bike dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC CONTENTS data = bike;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.BIKE | Observations | 1631 |
---|---|---|---|
Member Type | DATA | Variables | 9 |
Engine | V9 | Indexes | 0 |
Created | 10/17/2020 22:01:39 | Observation Length | 136 |
Last Modified | 10/17/2020 22:01:39 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 4 |
First Data Page | 1 |
Max Obs per Page | 481 |
Obs in First Data Page | 459 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/bike.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671721 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 320KB |
File Size (bytes) | 327680 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
3 | block | Char | 29 | $29. | $29. |
9 | dateInstalled | Num | 8 | BEST12. | BEST32. |
8 | length | Num | 8 | BEST12. | BEST32. |
2 | name | Char | 20 | $20. | $20. |
5 | numLanes | Num | 8 | BEST12. | BEST32. |
6 | project | Char | 25 | $25. | $25. |
7 | route | Char | 13 | $13. | $13. |
1 | subType | Char | 6 | $6. | $6. |
4 | type | Char | 15 | $15. | $15. |
There are 1631 rows.
How many (a) feet and (b) miles of total bike lanes are currently in Baltimore? (The length variable provides the length in feet.)
PROC MEANS data = bike sum;
VAR length;
RUN;
The MEANS Procedure
Analysis Variable : length |
---|
Sum |
439447.59 |
DATA miles;
length_mi = 439447.59 / 5280;
RUN;
PROC PRINT data = miles;
RUN;
Obs | length_mi |
---|---|
1 | 83.2287 |
How many types (type) bike lanes are there? Which type (a) occurs the most and (b) has the longest average bike lane length?
PROC FREQ data = bike;
TABLE type;
RUN;
PROC MEANS data = bike noprint;
CLASS type;
VAR length;
OUTPUT out = type_mean mean = length_mean;
RUN;
PROC SORT data = type_mean;
BY descending length_mean;
RUN;
PROC PRINT DATA = type_mean;
RUN;
The FREQ Procedure
type | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Frequency Missing = 9 | ||||
BIKE BOULEVARD | 49 | 3.02 | 49 | 3.02 |
BIKE LANE | 621 | 38.29 | 670 | 41.31 |
CONTRAFLOW | 13 | 0.80 | 683 | 42.11 |
SHARED BUS BIKE | 39 | 2.40 | 722 | 44.51 |
SHARROW | 589 | 36.31 | 1311 | 80.83 |
SIDEPATH | 7 | 0.43 | 1318 | 81.26 |
SIGNED ROUTE | 304 | 18.74 | 1622 | 100.00 |
Obs | type | _TYPE_ | _FREQ_ | length_mean |
---|---|---|---|---|
1 | SIDEPATH | 1 | 7 | 665.87751019 |
2 | BIKE LANE | 1 | 621 | 300.24183502 |
3 | SHARED BUS BIKE | 1 | 39 | 276.66575695 |
4 | 0 | 1622 | 269.48876975 | |
5 | SIGNED ROUTE | 1 | 304 | 263.57986593 |
6 | SHARROW | 1 | 589 | 243.8718592 |
7 | BIKE BOULEVARD | 1 | 49 | 197.27815495 |
8 | CONTRAFLOW | 1 | 13 | 136.46646088 |
There are 7 bike lane types (8 if you include missing as a category). The lane type that occurrs the most often is bike lane with a count of 621. The lane type sidepath has the longest average length at 665.88 ft.
How many different projects (project) do the bike lanes fall into? Which project category has the longest average bike lane length?
PROC FREQ data = bike;
TABLE project;
RUN;
PROC MEANS data = bike noprint;
CLASS project;
VAR length;
OUTPUT out = project_mean mean = length_mean;
RUN;
PROC SORT data = project_mean;
BY descending length_mean;
RUN;
PROC PRINT DATA = project_mean;
RUN;
The FREQ Procedure
project | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Frequency Missing = 74 | ||||
CHARM CITY CIRCULATOR | 39 | 2.50 | 39 | 2.50 |
COLLEGETOWN | 339 | 21.77 | 378 | 24.28 |
COLLEGETOWN NETWORK | 13 | 0.83 | 391 | 25.11 |
ENGINEERING CONSTRUCTION | 12 | 0.77 | 403 | 25.88 |
GUILFORD AVE BIKE BLVD | 49 | 3.15 | 452 | 29.03 |
MAINTENANCE | 4 | 0.26 | 456 | 29.29 |
OPERATION ORANGE CONE | 458 | 29.42 | 914 | 58.70 |
PARK HEIGHTS BIKE NETWORK | 172 | 11.05 | 1086 | 69.75 |
PLANNING TRAFFIC | 18 | 1.16 | 1104 | 70.91 |
SOUTHEAST BIKE NETWORK | 323 | 20.75 | 1427 | 91.65 |
TRAFFIC | 51 | 3.28 | 1478 | 94.93 |
TRAFFIC CALMING | 79 | 5.07 | 1557 | 100.00 |
Obs | project | _TYPE_ | _FREQ_ | length_mean |
---|---|---|---|---|
1 | MAINTENANCE | 1 | 4 | 1942.1522856 |
2 | ENGINEERING CONSTRUCTION | 1 | 12 | 512.09762483 |
3 | TRAFFIC | 1 | 51 | 419.52878809 |
4 | COLLEGETOWN | 1 | 339 | 320.68363126 |
5 | PARK HEIGHTS BIKE NETWORK | 1 | 172 | 283.22519367 |
6 | CHARM CITY CIRCULATOR | 1 | 39 | 276.66575695 |
7 | 0 | 1557 | 272.0534698 | |
8 | TRAFFIC CALMING | 1 | 79 | 268.53137627 |
9 | OPERATION ORANGE CONE | 1 | 458 | 250.07840273 |
10 | COLLEGETOWN NETWORK | 1 | 13 | 213.63728443 |
11 | SOUTHEAST BIKE NETWORK | 1 | 323 | 210.82827388 |
12 | PLANNING TRAFFIC | 1 | 18 | 209.42893827 |
13 | GUILFORD AVE BIKE BLVD | 1 | 49 | 197.27815495 |
There are 12 project types (13 if you include missing as a project type). The maintenance project has the longest average bike lane length at 1942.15 ft.
What was the average bike lane length per year that they were installed? (Be sure to first set dateInstalled to . if it is equal to zero.)
DATA bike;
SET bike;
IF dateinstalled = 0 THEN dateinstalled = .;
RUN;
PROC MEANS data = bike mean;
CLASS dateinstalled;
VAR length;
RUN;
The MEANS Procedure
Analysis Variable : length | ||
---|---|---|
dateInstalled | N Obs | Mean |
2006 | 2 | 1469.35 |
2007 | 368 | 309.8903287 |
2008 | 206 | 249.4810027 |
2009 | 86 | 407.2947463 |
2010 | 625 | 245.6015777 |
2011 | 101 | 232.8127593 |
2012 | 107 | 271.4321825 |
2013 | 10 | 290.4263702 |
(a) Numerically and (b) graphically describe the distribution of bike lane lengths (length).
PROC UNIVARIATE data = bike plots;
VAR length;
RUN;
The UNIVARIATE Procedure
Variable: length
Moments | |||
---|---|---|---|
N | 1631 | Sum Weights | 1631 |
Mean | 269.434449 | Sum Observations | 439447.587 |
Std Deviation | 277.706646 | Variance | 77120.9813 |
Skewness | 5.1297861 | Kurtosis | 43.3888947 |
Uncorrected SS | 244109518 | Corrected SS | 125707200 |
Coeff Variation | 103.070208 | Std Error Mean | 6.87637089 |
Basic Statistical Measures | |||
---|---|---|---|
Location | Variability | ||
Mean | 269.4344 | Std Deviation | 277.70665 |
Median | 200.3027 | Variance | 77121 |
Mode | . | Range | 3749 |
Interquartile Range | 217.16479 |
Tests for Location: Mu0=0 | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Student's t | t | 39.18265 | Pr > |t| | <.0001 |
Sign | M | 815 | Pr >= |M| | <.0001 |
Signed Rank | S | 664632.5 | Pr >= |S| | <.0001 |
Quantiles (Definition 5) | |
---|---|
Level | Quantile |
100% Max | 3749.3226 |
99% | 1341.8291 |
95% | 664.2305 |
90% | 502.2232 |
75% Q3 | 341.1493 |
50% Median | 200.3027 |
25% Q1 | 123.9845 |
10% | 67.8206 |
5% | 45.3655 |
1% | 14.8383 |
0% Min | 0.0000 |
Extreme Observations | |||
---|---|---|---|
Lowest | Highest | ||
Value | Obs | Value | Obs |
0.00000 | 4 | 2536.73 | 1284 |
5.51372 | 523 | 2715.20 | 1344 |
6.85712 | 724 | 2916.53 | 903 |
8.16243 | 490 | 3467.68 | 904 |
9.71914 | 595 | 3749.32 | 3 |
Describe the distribution of bike lane lengths numerically and graphically after stratifying them by (a) type and then by (b) number of lanes (numLanes).
PROC MEANS data=bike mean std;
CLASS type;
VAR length;
RUN;
PROC SGPLOT data = bike;
VBOX length / category = type;
RUN;
PROC MEANS data=bike mean std;
CLASS numlanes;
VAR length;
RUN;
PROC SGPLOT data = bike;
VBOX length / category = numlanes;
RUN;
The MEANS Procedure
Analysis Variable : length | |||
---|---|---|---|
type | N Obs | Mean | Std Dev |
BIKE BOULEVARD | 49 | 197.2781549 | 113.1318019 |
BIKE LANE | 621 | 300.2418350 | 314.6605599 |
CONTRAFLOW | 13 | 136.4664609 | 48.5165725 |
SHARED BUS BIKE | 39 | 276.6657570 | 140.1811711 |
SHARROW | 589 | 243.8718592 | 188.4222254 |
SIDEPATH | 7 | 665.8775102 | 619.3894227 |
SIGNED ROUTE | 304 | 263.5798659 | 347.8501541 |
The MEANS Procedure
Analysis Variable : length | |||
---|---|---|---|
numLanes | N Obs | Mean | Std Dev |
0 | 21 | 308.3767969 | 201.5548096 |
1 | 900 | 277.9016282 | 317.8281529 |
2 | 710 | 257.5495894 | 218.6179448 |
17.7. SAS Formats and Dates¶
Use proc freq to make a table of the different bike lane types.
PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
out = bike dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC FREQ data = bike;
TABLE type;
RUN;
The FREQ Procedure
type | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Frequency Missing = 9 | ||||
BIKE BOULEVARD | 49 | 3.02 | 49 | 3.02 |
BIKE LANE | 621 | 38.29 | 670 | 41.31 |
CONTRAFLOW | 13 | 0.80 | 683 | 42.11 |
SHARED BUS BIKE | 39 | 2.40 | 722 | 44.51 |
SHARROW | 589 | 36.31 | 1311 | 80.83 |
SIDEPATH | 7 | 0.43 | 1318 | 81.26 |
SIGNED ROUTE | 304 | 18.74 | 1622 | 100.00 |
Create a format that changes changes the bike lane types by mapping “SIDEPATH”, “BIKE BOULEVARD”, and “BIKE LANE” to itself but does not provide a value for the other types. Apply this format to type and make a table of type using PROC FREQ. What happens to the other bike lane types?
PROC FORMAT;
VALUE $typefmt "SIDEPATH" = "SIDEPATH"
"BIKE BOULEVARD" = "BIKE BOULEVARD"
"BIKE LANE" = "BIKE LANE"
OTHER = " ";
RUN;
DATA bike2;
SET bike;
FORMAT type $typefmt.;
RUN;
PROC FREQ data = bike2;
TABLE type;
RUN;
The FREQ Procedure
type | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Frequency Missing = 954 | ||||
BIKE BOULEVARD | 49 | 7.24 | 49 | 7.24 |
BIKE LANE | 621 | 91.73 | 670 | 98.97 |
SIDEPATH | 7 | 1.03 | 677 | 100.00 |
We have translated all but BIKE BOULEVARD, BIKE LANE, and SIDEPATE to missing, so we only get these three listed as categories in the table.
Create a format that changes the bike lane types to be “CONTRAFLOW”, “SHARED BUS BIKE”, “SHARROW”, “SIGNED ROUTE”, or “OTHER” if the lane type is anything else.
PROC FORMAT;
VALUE $type2fmt "CONTRAFLOW" = "CONTRAFLOW"
"SHARED BUS BIKE" = "SHARED BUS BIKE"
"SHARROW" = "SHARROW"
"SIGNED ROUTE" = "SIGNED ROUTE"
"SIDEPATH", "BIKE BOULEVARD", "BIKE LANE" = "OTHER";
RUN;
DATA bike3;
SET bike;
FORMAT type $type2fmt.;
RUN;
PROC FREQ data = bike3;
TABLE type;
RUN;
The FREQ Procedure
type | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Frequency Missing = 9 | ||||
OTHER | 677 | 41.74 | 677 | 41.74 |
CONTRAFLOW | 13 | 0.80 | 690 | 42.54 |
SHARED BUS BIKE | 39 | 2.40 | 729 | 44.94 |
SHARROW | 589 | 36.31 | 1318 | 81.26 |
SIGNED ROUTE | 304 | 18.74 | 1622 | 100.00 |
Write a datatep that can be used to read in the following small dataset with three dates and one time. Then print the dataset using readable date formats.
DATA temp;
INPUT date1 /*informat1*/ +1 date2 /*informat2*/ +1
date3 /*informat3*/ @32 time /*informat4*/;
DATALINES;
2014/02/14 06Jan2018 4/5/2016 03:2:22
;
RUN;
DATA temp;
INPUT date1 yymmdd10. +1 date2 DATE9. +1
date3 mmddyy10. @32 time time8.;
DATALINES;
2014/02/14 06Jan2018 4/5/2016 03:2:22
;
RUN;
PROC PRINT data = temp;
FORMAT date1 date2 date3 mmddyy10. time time8.;
RUN;
Obs | date1 | date2 | date3 | time |
---|---|---|---|---|
1 | 02/14/2014 | 01/06/2018 | 04/05/2016 | 3:02:22 |
17.8. Character Functions¶
Read in the Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, using PROC IMPORT.
PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Baltimore_City_Employee_Salaries_FY2015.csv"
out = sal dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC PRINT data = sal(OBS=5);
RUN;
Obs | name | JobTitle | AgencyID | Agency | HireDate | AnnualSalary | GrossPay |
---|---|---|---|---|---|---|---|
1 | Aaron,Patricia G | Facilities/Office Services II | A03031 | OED-Employment Dev (031) | 10/24/1979 | 55,314 | 53,626 |
2 | Aaron,Petra L | ASSISTANT STATE'S ATTORNEY | A29045 | States Attorneys Office (045) | 09/25/2006 | 74,000 | 73,000 |
3 | Abaineh,Yohannes T | EPIDEMIOLOGIST | A65026 | HLTH-Health Department (026) | 07/23/2009 | 64,500 | 64,404 |
4 | Abbene,Anthony M | POLICE OFFICER | A99005 | Police Department (005) | 07/24/2013 | 46,309 | 59,620 |
5 | Abbey,Emmanuel | CONTRACT SERV SPEC II | A40001 | M-R Info Technology (001) | 05/01/2013 | 60,060 | 54,060 |
Make a temporary SAS dataset called
health_sal
using the salaries dataset, with only the agencies (JobTitle
) of those with “fire” (anywhere in the job title), if any, in the name. Be sure to ignore case in the string matching for ‘fire’.
DATA health_sal;
SET sal;
WHERE index(lowcase(JobTitle), 'fire') > 0;
RUN;
PROC PRINT data=health_sal(OBS=5);
VAR JobTitle;
RUN;
Obs | JobTitle |
---|---|
1 | EMT Firefighter Suppression |
2 | Fire Pump Operator Supp ALS |
3 | Firefighter/Paramedic Suppress |
4 | Fire Pump Operator Supp ALS |
5 | Firefighter Suppression |
Make a temporary data set called
trans
which contains only agencies that contain “TRANS” (case sensitive).
DATA trans;
SET sal;
WHERE index(JobTitle, 'TRANS') > 0;
RUN;
PROC PRINT data=trans(OBS=5);
VAR JobTitle;
RUN;
Obs | JobTitle |
---|---|
1 | TRANSPORTATION ASSOC II |
2 | TAX TRANSFER CLERK I |
3 | GENL SUPT TRANSPORTATION MAINT |
4 | TAX TRANSFER CLERK I |
5 | TAX TRANSFER CLERK I |
What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.
PROC PRINT data=sal;
VAR name JobTitle;
WHERE index(lowcase(name), 'abra') > 0;
RUN;
Obs | name | JobTitle |
---|---|---|
26 | Abraham,Donta D | LABORER (Hourly) |
27 | Abraham,Santhosh | ACCOUNTANT I |
28 | Abraham,Sharon M | HOUSING INSPECTOR |
29 | Abrahams,Brandon A | POLICE OFFICER TRAINEE |
30 | Abrams,Maria | OFFICE SERVICES ASSISTANT II S |
31 | Abrams,Maxine | COLLECTIONS REPRESENTATIVE I |
32 | Abrams,Terry | RECREATION ARTS INSTRUCTOR |
958 | Bey,Abraham | PROCUREMENT SPECIALIST II |
3761 | Elgamil,Abraham D | AUDITOR SUPV |
4545 | Gatto,Abraham M | POLICE OFFICER |
11107 | Schwartz,Abraham M | GENERAL COUNSEL |
12789 | Velez,Abraham L | POLICE OFFICER (EID) |
Create a subset of the salaries dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with (case sensitive) string matching. Call this
emer
. How many employees are in this new dataset?
DATA emer;
SET sal;
WHERE index(agency, "Sheriff's Office") > 0 OR
index(agency, "Police Department") > 0 OR
index(agency, "Fire Department") > 0;
RUN;
PROC PRINT data=emer(OBS=5);
VAR agency;
RUN;
Obs | Agency |
---|---|
1 | Police Department (005) |
2 | Fire Department (120) |
3 | Police Department (127) |
4 | Sheriff's Office (410) |
5 | Fire Department (465) |
Create a variable called
dept
in theemer
data set by extracting the the part of theAgency
variable up until ‘ment’ or ‘ice’. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Use:
re = prxparse('/.*(ment|ice)/');
call prxsubstr(re, Agency, pos, len);
dept = substr(Agency, pos, len);
DATA emer;
SET emer;
re = prxparse('/.*(ment|ice)/');
call prxsubstr(re, Agency, pos, len);
dept = substr(Agency, pos, len);
RUN;
PROC FREQ data = emer;
TABLE dept;
RUN;
The FREQ Procedure
dept | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Fire Department | 1615 | 32.81 | 1615 | 32.81 |
Police Department | 3097 | 62.92 | 4712 | 95.73 |
Sheriff's Office | 210 | 4.27 | 4922 | 100.00 |
17.9. Data Cleaning¶
Read in the bike lanes dataset Bike_Lanes.csv using PROC IMPORT and call it bike.
PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
out = bike dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC PRINT DATA = bike(OBS=5);
RUN;
Obs | subType | name | block | type | numLanes | project | route | length | dateInstalled |
---|---|---|---|---|---|---|---|---|---|
1 | BIKE BOULEVARD | 1 | GUILFORD AVE BIKE BLVD | 435.73787702 | 0 | ||||
2 | SIDEPATH | 1 | NORTHERN | 1024.6746249 | 2010 | ||||
3 | SIGNED ROUTE | 1 | SOUTHEAST BIKE NETWORK | 3749.3226377 | 2010 | ||||
4 | HUNTINGDON PATH | SIDEPATH | 1 | 0 | 0 | ||||
5 | STCLN | EDMONDSON AVE | 5300 BLK EDMONDSON AVE | BIKE LANE | 1 | OPERATION ORANGE CONE | 180.92546489 | 2011 |
How many rows are are in the bike dataset? How many are complete cases? Hint: Use cmiss(of ALL) in a dataset to create a 0/1 variable to indicate if it complete or not and then use PROC MEANS to sum this 0/1 variable.
PROC CONTENTS data = bike;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.BIKE | Observations | 1631 |
---|---|---|---|
Member Type | DATA | Variables | 9 |
Engine | V9 | Indexes | 0 |
Created | 10/18/2020 11:55:08 | Observation Length | 136 |
Last Modified | 10/18/2020 11:55:08 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 4 |
First Data Page | 1 |
Max Obs per Page | 481 |
Obs in First Data Page | 459 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/bike.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671716 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 320KB |
File Size (bytes) | 327680 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
3 | block | Char | 29 | $29. | $29. |
9 | dateInstalled | Num | 8 | BEST12. | BEST32. |
8 | length | Num | 8 | BEST12. | BEST32. |
2 | name | Char | 20 | $20. | $20. |
5 | numLanes | Num | 8 | BEST12. | BEST32. |
6 | project | Char | 25 | $25. | $25. |
7 | route | Char | 13 | $13. | $13. |
1 | subType | Char | 6 | $6. | $6. |
4 | type | Char | 15 | $15. | $15. |
There are 1631 rows in the bike data set.
DATA bike;
SET bike;
IF cmiss(of _ALL_) > 0 THEN complete = 0;
ELSE complete = 1;
RUN;
PROC PRINT data = bike(OBS=5);
RUN;
PROC MEANS data = bike sum;
VAR complete;
RUN;
Obs | subType | name | block | type | numLanes | project | route | length | dateInstalled | complete |
---|---|---|---|---|---|---|---|---|---|---|
1 | BIKE BOULEVARD | 1 | GUILFORD AVE BIKE BLVD | 435.73787702 | 0 | 0 | ||||
2 | SIDEPATH | 1 | NORTHERN | 1024.6746249 | 2010 | 0 | ||||
3 | SIGNED ROUTE | 1 | SOUTHEAST BIKE NETWORK | 3749.3226377 | 2010 | 0 | ||||
4 | HUNTINGDON PATH | SIDEPATH | 1 | 0 | 0 | 0 | ||||
5 | STCLN | EDMONDSON AVE | 5300 BLK EDMONDSON AVE | BIKE LANE | 1 | OPERATION ORANGE CONE | 180.92546489 | 2011 | 0 |
The MEANS Procedure
Analysis Variable : complete |
---|
Sum |
257.0000000 |
Filter rows of bike that are NOT missing the
route
variable, assign this to the objecthave_route
. Create a frequency table of thesubType
variable using PROC FREQ, including the missingsubType
s.
DATA have_route;
SET bike;
WHERE NOT MISSING(route);
RUN;
PROC FREQ data = have_route;
TABLE subType / missing;
RUN;
The FREQ Procedure
subType | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
1 | 0.28 | 1 | 0.28 | |
STRALY | 3 | 0.83 | 4 | 1.10 |
STRPRD | 358 | 98.90 | 362 | 100.00 |
17.10. Data Manipulation¶
Read in the Bike_Lanes_Wide.csv dataset and call is
wide
using PROC IMPORT. Print the first few rows of the dataset.
PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes_Wide.csv"
out = wide dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC CONTENTS data = wide;
RUN;
PROC PRINT DATA = wide(OBS=5);
RUN;
SAS Connection established. Subprocess id is 8974
The CONTENTS Procedure
Data Set Name | WORK.WIDE | Observations | 134 |
---|---|---|---|
Member Type | DATA | Variables | 9 |
Engine | V9 | Indexes | 0 |
Created | 10/29/2020 14:57:01 | Observation Length | 141 |
Last Modified | 10/29/2020 14:57:01 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 464 |
Obs in First Data Page | 134 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workFB360000230E_localhost.localdomain/wide.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671855 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
2 | BIKE BOULEVARD | Char | 11 | $11. | $11. |
3 | BIKE LANE | Char | 11 | $11. | $11. |
4 | CONTRAFLOW | Char | 11 | $11. | $11. |
9 | NA | Char | 11 | $11. | $11. |
5 | SHARED BUS BIKE | Char | 11 | $11. | $11. |
6 | SHARROW | Char | 11 | $11. | $11. |
7 | SIDEPATH | Char | 11 | $11. | $11. |
8 | SIGNED ROUTE | Char | 11 | $11. | $11. |
1 | name | Char | 53 | $53. | $53. |
Obs | name | BIKE BOULEVARD | BIKE LANE | CONTRAFLOW | SHARED BUS BIKE | SHARROW | SIDEPATH | SIGNED ROUTE | NA |
---|---|---|---|---|---|---|---|---|---|
1 | ALBEMARLE ST | NA | NA | NA | NA | 109.9141727 | NA | 234.9190031 | NA |
2 | ALICEANNA ST | NA | NA | NA | NA | 250.7103035 | NA | NA | NA |
3 | ARGONNE DR | NA | 1312.608373 | NA | NA | NA | NA | NA | NA |
4 | ART MUSEUM DR | NA | NA | NA | NA | NA | NA | 277.6579871 | NA |
5 | AUCHENTOROLY TERR | NA | 1341.82908 | NA | NA | NA | NA | NA | NA |
Reshape
wide
using either PROC TRANSPOSE or a DATA step. You will need to gather all columns except the name column. Transform into a long dataset with two new columnslanetype
(the former column names) andthe_length
the data values. In the variablethe_length
, replace ‘NA’ values with . and convert it to a numeric column.
DATA long;
SET wide;
array col{*} 'BIKE BOULEVARD'n -- NA; /*BIKE_BOULEVARD -- NA*/
DO i = 1 TO dim(col);
lanetype = vname(col(i));
IF col(i) = 'NA' THEN the_length = .;
ELSE the_length = INPUT(col(i), 20.7);
OUTPUT;
END;
DROP 'BIKE BOULEVARD'n -- NA i; /*BIKE_BOULEVARD -- NA*/
RUN;
PROC PRINT data = long(obs=10);
TITLE 'Using a DATA step';
RUN;
PROC SORT data=wide;
BY name;
RUN;
PROC TRANSPOSE data = wide
out = long2 (RENAME = (COL1 = the_length
_NAME_ = lanetype));
BY name;
VAR 'BIKE BOULEVARD'n -- NA; /*BIKE_BOULEVARD -- NA*/
RUN;
DATA long2;
SET long2(RENAME = (the_length = length));
IF length = 'NA' THEN the_length = .;
ELSE the_length = INPUT(length, 20.7);
DROP length;
RUN;
PROC PRINT data = long2(OBS=10);
TITLE 'Using PROC TRANSPOSE';
RUN;
TITLE;
Obs | name | lanetype | the_length |
---|---|---|---|
1 | ALBEMARLE ST | BIKE BOULEVARD | . |
2 | ALBEMARLE ST | BIKE LANE | . |
3 | ALBEMARLE ST | CONTRAFLOW | . |
4 | ALBEMARLE ST | SHARED BUS BIKE | . |
5 | ALBEMARLE ST | SHARROW | 109.914 |
6 | ALBEMARLE ST | SIDEPATH | . |
7 | ALBEMARLE ST | SIGNED ROUTE | 234.919 |
8 | ALBEMARLE ST | NA | . |
9 | ALICEANNA ST | BIKE BOULEVARD | . |
10 | ALICEANNA ST | BIKE LANE | . |
Obs | name | lanetype | the_length |
---|---|---|---|
1 | ALBEMARLE ST | BIKE BOULEVARD | . |
2 | ALBEMARLE ST | BIKE LANE | . |
3 | ALBEMARLE ST | CONTRAFLOW | . |
4 | ALBEMARLE ST | SHARED BUS BIKE | . |
5 | ALBEMARLE ST | SHARROW | 109.914 |
6 | ALBEMARLE ST | SIDEPATH | . |
7 | ALBEMARLE ST | SIGNED ROUTE | 234.919 |
8 | ALBEMARLE ST | NA | . |
9 | ALICEANNA ST | BIKE BOULEVARD | . |
10 | ALICEANNA ST | BIKE LANE | . |
Read in the roads and crashes .csv files and call them
road
andcrash
.
PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/roads.csv"
out = road dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/crashes.csv"
out = crash dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC PRINT data = road;
RUN;
PROC PRINT data = crash(OBS=10);
RUN;
Obs | Road | District | Length |
---|---|---|---|
1 | Interstate 65 | Greenfield | 262 |
2 | Interstate 70 | Vincennes | 156 |
3 | US-36 | Crawfordsville | 139 |
4 | US-40 | Greenfield | 150 |
5 | US-52 | Crawfordsville | 172 |
Obs | Year | Road | N_Crashes | Volume |
---|---|---|---|---|
1 | 1991 | Interstate 65 | 25 | 40000 |
2 | 1992 | Interstate 65 | 37 | 41000 |
3 | 1993 | Interstate 65 | 45 | 45000 |
4 | 1994 | Interstate 65 | 46 | 45600 |
5 | 1995 | Interstate 65 | 46 | 49000 |
6 | 1996 | Interstate 65 | 59 | 51000 |
7 | 1997 | Interstate 65 | 76 | 52000 |
8 | 1998 | Interstate 65 | 90 | 58000 |
9 | 1999 | Interstate 65 | 95 | 65000 |
10 | 2000 | Interstate 65 | 95 | 74000 |
Replace (using
tranwrd
) any hyphens (-
) with a space inRoad
variable ofcrash
. Call this datacrash2
. Table theRoad
variable with PROC FREQ.
DATA crash2;
SET crash;
road = tranwrd(road, '-', ' ');
RUN;
PROC FREQ data = crash2;
TABLE road;
RUN;
The FREQ Procedure
Road | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Interstate 275 | 22 | 20.00 | 22 | 20.00 |
Interstate 65 | 22 | 20.00 | 44 | 40.00 |
Interstate 70 | 22 | 20.00 | 66 | 60.00 |
US 36 | 22 | 20.00 | 88 | 80.00 |
US 40 | 22 | 20.00 | 110 | 100.00 |
How many observations are in each of the
crash
androad
datasets?
PROC CONTENTS data = crash;
RUN;
PROC CONTENTS data = road;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.CRASH | Observations | 110 |
---|---|---|---|
Member Type | DATA | Variables | 4 |
Engine | V9 | Indexes | 0 |
Created | 10/18/2020 17:00:48 | Observation Length | 40 |
Last Modified | 10/18/2020 17:00:48 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1632 |
Obs in First Data Page | 110 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/crash.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671741 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
3 | N_Crashes | Num | 8 | BEST12. | BEST32. |
2 | Road | Char | 14 | $14. | $14. |
4 | Volume | Num | 8 | BEST12. | BEST32. |
1 | Year | Num | 8 | BEST12. | BEST32. |
The CONTENTS Procedure
Data Set Name | WORK.ROAD | Observations | 5 |
---|---|---|---|
Member Type | DATA | Variables | 3 |
Engine | V9 | Indexes | 0 |
Created | 10/18/2020 17:00:48 | Observation Length | 40 |
Last Modified | 10/18/2020 17:00:48 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1632 |
Obs in First Data Page | 5 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/road.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671709 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
2 | District | Char | 14 | $14. | $14. |
3 | Length | Num | 8 | BEST12. | BEST32. |
1 | Road | Char | 13 | $13. | $13. |
The crash dataset has 110 observations and the road datset has 5.
Separate the
Road
column (usingscan
) into (type
andnumber
) incrash2
. Reassign this tocrash2
. Tabletype
fromcrash2
using PROC FREQ. Then create a new variable calling itroad_hyphen
using one of the concatenate functions (such as CAT). Unite thetype
andnumber
columns using a hyphen (-
) and then tableroad_hyphen
using PROC FREQ.
DATA crash2;
SET crash2;
type = scan(road, 1);
number = scan(road, 2);
RUN;
PROC FREQ data=crash2;
TABLE type;
RUN;
DATA crash2;
SET crash2;
road_hyphen = CATX('-', type, number);
RUN;
PROC FREQ data=crash2;
TABLE road_hyphen;
RUN;
The FREQ Procedure
type | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Interstate | 66 | 60.00 | 66 | 60.00 |
US | 44 | 40.00 | 110 | 100.00 |
The FREQ Procedure
road_hyphen | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
Interstate-275 | 22 | 20.00 | 22 | 20.00 |
Interstate-65 | 22 | 20.00 | 44 | 40.00 |
Interstate-70 | 22 | 20.00 | 66 | 60.00 |
US-36 | 22 | 20.00 | 88 | 80.00 |
US-40 | 22 | 20.00 | 110 | 100.00 |
Which and how many years were data collected in the
crash
dataset?
PROC FREQ data=crash;
TABLE year;
RUN;
The FREQ Procedure
Year | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
1991 | 5 | 4.55 | 5 | 4.55 |
1992 | 5 | 4.55 | 10 | 9.09 |
1993 | 5 | 4.55 | 15 | 13.64 |
1994 | 5 | 4.55 | 20 | 18.18 |
1995 | 5 | 4.55 | 25 | 22.73 |
1996 | 5 | 4.55 | 30 | 27.27 |
1997 | 5 | 4.55 | 35 | 31.82 |
1998 | 5 | 4.55 | 40 | 36.36 |
1999 | 5 | 4.55 | 45 | 40.91 |
2000 | 5 | 4.55 | 50 | 45.45 |
2001 | 5 | 4.55 | 55 | 50.00 |
2002 | 5 | 4.55 | 60 | 54.55 |
2003 | 5 | 4.55 | 65 | 59.09 |
2004 | 5 | 4.55 | 70 | 63.64 |
2005 | 5 | 4.55 | 75 | 68.18 |
2006 | 5 | 4.55 | 80 | 72.73 |
2007 | 5 | 4.55 | 85 | 77.27 |
2008 | 5 | 4.55 | 90 | 81.82 |
2009 | 5 | 4.55 | 95 | 86.36 |
2010 | 5 | 4.55 | 100 | 90.91 |
2011 | 5 | 4.55 | 105 | 95.45 |
2012 | 5 | 4.55 | 110 | 100.00 |
The crash dataset contains data on the years from 1991 to 2012 for a total of 22 years.
Read in the dataset Bike_Lanes.csv and call it
bike
.
PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
out = bike dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
PROC PRINT data=bike(obs=5);
RUN;
Obs | subType | name | block | type | numLanes | project | route | length | dateInstalled |
---|---|---|---|---|---|---|---|---|---|
1 | BIKE BOULEVARD | 1 | GUILFORD AVE BIKE BLVD | 435.73787702 | 0 | ||||
2 | SIDEPATH | 1 | NORTHERN | 1024.6746249 | 2010 | ||||
3 | SIGNED ROUTE | 1 | SOUTHEAST BIKE NETWORK | 3749.3226377 | 2010 | ||||
4 | HUNTINGDON PATH | SIDEPATH | 1 | 0 | 0 | ||||
5 | STCLN | EDMONDSON AVE | 5300 BLK EDMONDSON AVE | BIKE LANE | 1 | OPERATION ORANGE CONE | 180.92546489 | 2011 |
Keep rows where the record is not missing
type
and not missingname
and re-assign the output tobike
.
DATA bike;
SET bike;
WHERE NOT MISSING(type);
RUN;
PROC PRINT data=bike(obs=5);
RUN;
Obs | subType | name | block | type | numLanes | project | route | length | dateInstalled |
---|---|---|---|---|---|---|---|---|---|
1 | BIKE BOULEVARD | 1 | GUILFORD AVE BIKE BLVD | 435.73787702 | 0 | ||||
2 | SIDEPATH | 1 | NORTHERN | 1024.6746249 | 2010 | ||||
3 | SIGNED ROUTE | 1 | SOUTHEAST BIKE NETWORK | 3749.3226377 | 2010 | ||||
4 | HUNTINGDON PATH | SIDEPATH | 1 | 0 | 0 | ||||
5 | STCLN | EDMONDSON AVE | 5300 BLK EDMONDSON AVE | BIKE LANE | 1 | OPERATION ORANGE CONE | 180.92546489 | 2011 |
Using PROC MEANS with a BY statement grouping
name
andtype
(i.e for each type within each name), find thesum
of thelength
. Use an OUTPUT statement to get this summary dataset and only keep thename
,type
and sum of thelength
column (after renaming this columnlength
). Call this data setsub
.
PROC SORT data = bike;
BY name type;
RUN;
PROC MEANS data = bike noprint;
BY name type;
VAR length;
OUTPUT out=sub sum=length;
RUN;
DATA sub;
SET sub;
keep name type length;
RUN;
PROC PRINT data=sub(obs=8);
RUN;
Obs | name | type | length |
---|---|---|---|
1 | BIKE BOULEVARD | 435.73787702 | |
2 | BIKE LANE | 695.69831669 | |
3 | SHARROW | 186.96669187 | |
4 | SIDEPATH | 1024.6746249 | |
5 | SIGNED ROUTE | 6527.883024 | |
6 | ALBEMARLE ST | SHARROW | 439.65669098 |
7 | ALBEMARLE ST | SIGNED ROUTE | 1879.3520247 |
8 | ALICEANNA ST | SHARROW | 4763.495767 |
PROC FREQ data=sub;
TABLE type;
RUN;
The FREQ Procedure
type | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
BIKE BOULEVARD | 3 | 1.80 | 3 | 1.80 |
BIKE LANE | 54 | 32.34 | 57 | 34.13 |
CONTRAFLOW | 4 | 2.40 | 61 | 36.53 |
SHARED BUS BIKE | 4 | 2.40 | 65 | 38.92 |
SHARROW | 50 | 29.94 | 115 | 68.86 |
SIDEPATH | 5 | 2.99 | 120 | 71.86 |
SIGNED ROUTE | 47 | 28.14 | 167 | 100.00 |
Reshape
sub
from long to wide by taking thetype
to be the new columns andlength
to be the value in those columns. (NOTE: thenames
have spaces in them. Do we need to replace the spaces with a character before changing them to column names?)
DATA new_wide;
SET sub;
BY name;
RETAIN BIKE_BOULEVARD BIKE_LANE CONTRAFLOW SHARED_BUS_BIKE
SHARROW SIDEPATH SIGNED_ROUTE;
array atype (*) BIKE_BOULEVARD BIKE_LANE CONTRAFLOW SHARED_BUS_BIKE
SHARROW SIDEPATH SIGNED_ROUTE;
IF first.name THEN call missing(of atype(*));
IF type = 'BIKE BOULEVARD' THEN atype(1) = length;
ELSE IF type = 'BIKE LANE' THEN atype(2) = length;
ELSE IF type = 'CONTRAFLOW' THEN atype(3) = length;
ELSE IF type = 'SHARED BUS BIKE' THEN atype(4) = length;
ELSE IF type = 'SHARROW' THEN atype(5) = length;
ELSE IF type = 'SIDEPATH' THEN atype(6) = length;
ELSE IF type = 'SIGNED ROUTE' THEN atype(7) = length;
IF last.name THEN OUTPUT;
DROP type length;
RUN;
PROC PRINT data=new_wide(OBS=5);
TITLE 'Wide Dataset Using DATA Step';
RUN;
PROC TRANSPOSE data = sub out = new_wide2(drop=_NAME_);
BY name;
VAR length;
ID type;
RUN;
PROC PRINT data = new_wide2(OBS=5);
TITLE 'Wide Dataset Using PROC TRANSPOSE';
RUN;
TITLE ;
Obs | name | BIKE_BOULEVARD | BIKE_LANE | CONTRAFLOW | SHARED_BUS_BIKE | SHARROW | SIDEPATH | SIGNED_ROUTE |
---|---|---|---|---|---|---|---|---|
1 | 435.738 | 695.70 | . | . | 186.97 | 1024.67 | 6527.88 | |
2 | ALBEMARLE ST | . | . | . | . | 439.66 | . | 1879.35 |
3 | ALICEANNA ST | . | . | . | . | 4763.50 | . | . |
4 | ARGONNE DR | . | 1312.61 | . | . | . | . | . |
5 | ART MUSEUM DR | . | . | . | . | . | . | 832.97 |
Obs | name | BIKE BOULEVARD | BIKE LANE | SHARROW | SIDEPATH | SIGNED ROUTE | CONTRAFLOW | SHARED BUS BIKE |
---|---|---|---|---|---|---|---|---|
1 | 435.73787702 | 695.69831669 | 186.96669187 | 1024.6746249 | 6527.883024 | . | . | |
2 | ALBEMARLE ST | . | . | 439.65669098 | . | 1879.3520247 | . | . |
3 | ALICEANNA ST | . | . | 4763.495767 | . | . | . | . |
4 | ARGONNE DR | . | 1312.6083729 | . | . | . | . | . |
5 | ART MUSEUM DR | . | . | . | . | 832.97396136 | . | . |
Join data in the crash and road datasets to retain only complete data, (using an inner join on road) Merge by the variable
Road
. Call the output merged. How many observations are there?
PROC SORT data = road;
BY road;
RUN;
PROC SORT data = crash;
BY road;
RUN;
DATA merged;
MERGE road (IN = in1)
crash (IN = in2);
BY road;
IF in1 = 1 and in2 = 1;
RUN;
PROC CONTENTS data = merged;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.MERGED | Observations | 88 |
---|---|---|---|
Member Type | DATA | Variables | 6 |
Engine | V9 | Indexes | 0 |
Created | 10/18/2020 20:34:03 | Observation Length | 64 |
Last Modified | 10/18/2020 20:34:03 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1021 |
Obs in First Data Page | 88 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/merged.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671747 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
2 | District | Char | 14 | $14. | $14. |
3 | Length | Num | 8 | BEST12. | BEST32. |
5 | N_Crashes | Num | 8 | BEST12. | BEST32. |
1 | Road | Char | 13 | $13. | $13. |
6 | Volume | Num | 8 | BEST12. | BEST32. |
4 | Year | Num | 8 | BEST12. | BEST32. |
There are 88 rows.
Join data using a
full_join
. Call the outputfull
. How many observations are there?
DATA full_join;
MERGE road crash;
BY road;
RUN;
PROC PRINT data = full_join(OBS=10);
RUN;
Obs | Road | District | Length | Year | N_Crashes | Volume |
---|---|---|---|---|---|---|
1 | Interstate 27 | . | 1991 | 27 | 20350 | |
2 | Interstate 27 | . | 1992 | 26 | 21200 | |
3 | Interstate 27 | . | 1993 | 22 | 23200 | |
4 | Interstate 27 | . | 1994 | 21 | 21200 | |
5 | Interstate 27 | . | 1995 | 28 | 23200 | |
6 | Interstate 27 | . | 1996 | 22 | 20000 | |
7 | Interstate 27 | . | 1997 | 27 | 18000 | |
8 | Interstate 27 | . | 1998 | 21 | 19500 | |
9 | Interstate 27 | . | 1999 | 22 | 21000 | |
10 | Interstate 27 | . | 2000 | 29 | 20700 |
Do a left join of the
road
andcrash
. ORDER matters here! How many observations are there?
DATA left_join;
MERGE road (in = in1) crash;
BY road;
IF in1 = 1;
RUN;
PROC CONTENTS data = left_join;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.LEFT_JOIN | Observations | 89 |
---|---|---|---|
Member Type | DATA | Variables | 6 |
Engine | V9 | Indexes | 0 |
Created | 10/18/2020 20:38:16 | Observation Length | 64 |
Last Modified | 10/18/2020 20:38:16 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1021 |
Obs in First Data Page | 89 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/left_join.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671749 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
2 | District | Char | 14 | $14. | $14. |
3 | Length | Num | 8 | BEST12. | BEST32. |
5 | N_Crashes | Num | 8 | BEST12. | BEST32. |
1 | Road | Char | 13 | $13. | $13. |
6 | Volume | Num | 8 | BEST12. | BEST32. |
4 | Year | Num | 8 | BEST12. | BEST32. |
There are 89 rows.
Repeat above with a
right_join
with the same order of the arguments. How many observations are there?
DATA right_join;
MERGE road crash (in = in1);
BY road;
IF in1 = 1;
RUN;
PROC CONTENTS data = right_join;
RUN;
The CONTENTS Procedure
Data Set Name | WORK.RIGHT_JOIN | Observations | 110 |
---|---|---|---|
Member Type | DATA | Variables | 6 |
Engine | V9 | Indexes | 0 |
Created | 10/18/2020 20:39:48 | Observation Length | 64 |
Last Modified | 10/18/2020 20:39:48 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
Encoding | utf-8 Unicode (UTF-8) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1021 |
Obs in First Data Page | 110 |
Number of Data Set Repairs | 0 |
Filename | /tmp/SAS_workA5B9000010B1_localhost.localdomain/right_join.sas7bdat |
Release Created | 9.0401M6 |
Host Created | Linux |
Inode Number | 671750 |
Access Permission | rw-r--r-- |
Owner Name | sasdemo |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
2 | District | Char | 14 | $14. | $14. |
3 | Length | Num | 8 | BEST12. | BEST32. |
5 | N_Crashes | Num | 8 | BEST12. | BEST32. |
1 | Road | Char | 13 | $13. | $13. |
6 | Volume | Num | 8 | BEST12. | BEST32. |
4 | Year | Num | 8 | BEST12. | BEST32. |
There are 110 observations in the right joined dataset.
17.11. The Output Delivery System and Graphics¶
PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Charm_City_Circulator_Ridership.csv"
out = circ dbms = csv replace;
getnames = yes;
guessingrows = max;
RUN;
DATA long;
SET circ;
ARRAY larray(*) orangeBoardings -- bannerAverage;
DO i = 1 TO dim(larray);
var = vname(larray(i));
number = larray(i);
var = tranwrd(var, 'Board', ' Board');
var = tranwrd(var, 'Alight', ' Alight');
var = tranwrd(var, 'Average', ' Average');
route = scan(var, 1);
type = scan(var, 2);
OUTPUT;
END;
DROP i var orangeBoardings -- bannerAverage;
RUN;
DATA avg;
SET long;
WHERE type = 'Average' and number ne .;
RUN;
PROC PRINT data = long(obs = 10);
RUN;
PROC PRINT data = avg(obs=10);
RUN;
Obs | day | date | daily | number | route | type |
---|---|---|---|---|---|---|
1 | Monday | 01/11/2010 | 952 | 877 | orange | Boardings |
2 | Monday | 01/11/2010 | 952 | 1027 | orange | Alightings |
3 | Monday | 01/11/2010 | 952 | 952 | orange | Average |
4 | Monday | 01/11/2010 | 952 | . | purple | Boardings |
5 | Monday | 01/11/2010 | 952 | . | purple | Alightings |
6 | Monday | 01/11/2010 | 952 | . | purple | Average |
7 | Monday | 01/11/2010 | 952 | . | green | Boardings |
8 | Monday | 01/11/2010 | 952 | . | green | Alightings |
9 | Monday | 01/11/2010 | 952 | . | green | Average |
10 | Monday | 01/11/2010 | 952 | . | banner | Boardings |
Obs | day | date | daily | number | route | type |
---|---|---|---|---|---|---|
1 | Monday | 01/11/2010 | 952 | 952.0 | orange | Average |
2 | Tuesday | 01/12/2010 | 796 | 796.0 | orange | Average |
3 | Wednesday | 01/13/2010 | 1211.5 | 1211.5 | orange | Average |
4 | Thursday | 01/14/2010 | 1213.5 | 1213.5 | orange | Average |
5 | Friday | 01/15/2010 | 1644 | 1644.0 | orange | Average |
6 | Saturday | 01/16/2010 | 1490.5 | 1490.5 | orange | Average |
7 | Sunday | 01/17/2010 | 888.5 | 888.5 | orange | Average |
8 | Monday | 01/18/2010 | 999.5 | 999.5 | orange | Average |
9 | Tuesday | 01/19/2010 | 1035 | 1035.0 | orange | Average |
10 | Wednesday | 01/20/2010 | 1395.5 | 1395.5 | orange | Average |
Plot average ridership (avg data set) by date using a scatterplot. a. Color the points by route (orange, purple, green, banner) b. Add black smoothed curves for each route c. Color the points by day of the week
PROC SGPLOT data = avg;
TITLE '1a';
SCATTER Y = number X = date / group = route;
RUN;
PROC SGPLOT data = avg;
TITLE '1b';
SCATTER Y = number X = date / group = route;
LOESS Y = number X = date / group = route lineattrs=(color = black);
RUN;
PROC SGPLOT data = avg;
TITLE '1c';
SCATTER Y = number X = date / group = day;
RUN;
Replot 1a where the colors of the points are the name of the route (with banner –> blue). Note: you will need to make a data attribute map.
DATA marker_map;
INPUT id $2. +1 VALUE $6. +1 MARKERCOLOR $6.;
DATALINES;
mc banner blue
mc green green
mc orange orange
mc purple purple
;
RUN;
PROC SGPLOT data = avg dattrmap = marker_map;
TITLE '2';
SCATTER Y = number X = date / group = route attrid = mc;
RUN;
Plot a scatterplot of average ridership by date with one panel per route.
PROC SGPANEL data = avg;
TITLE '3';
PANELBY route;
SCATTER Y = number X = date;
RUN;
Plot a scatterplpot of average ridership by date with separate panels by day of the week, colored by route.
*Default colors are fine here to, but we can also use a data attribute map like we did before;
PROC SGPANEL data = avg dattrmap = marker_map;
TITLE '4';
PANELBY day;
SCATTER Y = number X = date / group = route attrid = mc;
RUN;
Plot a scatterplot of average ridership (avg) by date, colored by route (same as 1a). (do not take an average, use the average column for each route). Make the x-label “Year”. Make the y-label “Number of People”.
PROC SGPLOT data = avg dattrmap = marker_map;
TITLE '5';
SCATTER Y = number X = date / group = route attrid = mc;
XAXIS LABEL = "Year";
YAXIS LABEL = "Number of People";
RUN;
Plot average ridership on the orange route versus date as a solid line, and add dashed “error” lines based on the boardings and alightings. The line colors should be orange.
DATA orange;
SET long;
WHERE route = "orange";
RUN;
DATA line_map;
INPUT id $2. +1 VALUE $10. +1 linepattern $5.;
DATALINES;
lp Alightings dash
lp Average solid
lp Boardings dash
;
RUN;
PROC SGPLOT data = orange dattrmap = line_map;
TITLE '6';
SERIES Y = number X = date / group = type attrid = lp lineattrs=(color = orange);
RUN;
TITLE;
17.12. Statistical Analysis in SAS¶
FILENAME cardata '/folders/myfolders/SAS_Notes/data/kaggleCarAuction.csv';
PROC IMPORT datafile = cardata out = cars dbms = CSV replace;
getnames = yes;
guessingrows = 1000;
RUN;
FILENAME mortdat '/folders/myfolders/SAS_Notes/data/indicatordeadkids35.csv';
PROC IMPORT datafile = mortdat out = mort dbms = CSV replace;
getnames = yes;
guessingrows = 500;
RUN;
Compute the correlation between the
1980
,1990
,2000
, and2010
mortality data. Just display the result to the screen. Then compute using the NOMMISS option. (Note: The column names are numbers, which are invalid standard SAS names, so to refer to the variable 1980 in your code use ‘1980’n.)
/*
PROC CORR data = mort;
VAR '1980'n '1990'n '2000'n '2010'n;
RUN;
PROC CORR data = mort nomiss;
VAR '1980'n '1990'n '2000'n '2010'n;
RUN;
*/
PROC CORR data = mort;
VAR _1980 _1990 _2000 _2010;
RUN;
PROC CORR data = mort nomiss;
VAR _1980 _1990 _2000 _2010;
RUN;
The CORR Procedure
4 Variables: | 1980 1990 2000 2010 |
---|
Simple Statistics | ||||||
---|---|---|---|---|---|---|
Variable | N | Mean | Std Dev | Sum | Minimum | Maximum |
1980 | 197 | 0.94731 | 0.88694 | 186.62013 | 0.07049 | 3.49035 |
1990 | 197 | 0.70911 | 0.77929 | 139.69428 | 0.04922 | 3.28983 |
2000 | 197 | 0.55004 | 0.68922 | 108.35786 | 0.03232 | 2.81186 |
2010 | 193 | 0.39601 | 0.50021 | 76.42946 | 0.03173 | 2.02857 |
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations |
||||
---|---|---|---|---|
1980 | 1990 | 2000 | 2010 | |
1980 |
1.00000
197
|
0.96015
<.0001
197
|
0.88884
<.0001
197
|
0.84683
<.0001
193
|
1990 |
0.96015
<.0001
197
|
1.00000
197
|
0.96138
<.0001
197
|
0.92472
<.0001
193
|
2000 |
0.88884
<.0001
197
|
0.96138
<.0001
197
|
1.00000
197
|
0.98623
<.0001
193
|
2010 |
0.84683
<.0001
193
|
0.92472
<.0001
193
|
0.98623
<.0001
193
|
1.00000
193
|
The CORR Procedure
4 Variables: | 1980 1990 2000 2010 |
---|
Simple Statistics | ||||||
---|---|---|---|---|---|---|
Variable | N | Mean | Std Dev | Sum | Minimum | Maximum |
1980 | 193 | 0.96250 | 0.88966 | 185.76175 | 0.07049 | 3.49035 |
1990 | 193 | 0.72132 | 0.78263 | 139.21550 | 0.05512 | 3.28983 |
2000 | 193 | 0.55976 | 0.69297 | 108.03459 | 0.03271 | 2.81186 |
2010 | 193 | 0.39601 | 0.50021 | 76.42946 | 0.03173 | 2.02857 |
Pearson Correlation Coefficients, N = 193 Prob > |r| under H0: Rho=0 |
||||
---|---|---|---|---|
1980 | 1990 | 2000 | 2010 | |
1980 |
1.00000
|
0.95968
<.0001
|
0.88774
<.0001
|
0.84683
<.0001
|
1990 |
0.95968
<.0001
|
1.00000
|
0.96103
<.0001
|
0.92472
<.0001
|
2000 |
0.88774
<.0001
|
0.96103
<.0001
|
1.00000
|
0.98623
<.0001
|
2010 |
0.84683
<.0001
|
0.92472
<.0001
|
0.98623
<.0001
|
1.00000
|
a. Compute the correlation between the
Myanmar
,China
, andUnited States
mortality data. Store this correlation matrix in an object calledcountry_cor
using ODS OUTPUT. b. Extract the Myanmar-US correlation from the correlation matrix.
DATA mort_sub;
SET mort(RENAME=(Var1 = Country));
WHERE country in ('Myanmar' 'China' 'United States');
RUN;
/*
PROC TRANSPOSE data = mort_sub out = mort_sub_transposed;
VAR '1760'n -- '2099'n;
ID country;
RUN;
*/
PROC TRANSPOSE data = mort_sub out = mort_sub_transposed;
VAR _1760 -- _2099;
ID country;
RUN;
/*
PROC CORR DATA = mort_sub_transposed;
VAR China Myanmar 'United States'n;
ODS OUTPUT PearsonCorr = country_cor;
RUN;
*/
*ODS TRACE ON;
PROC CORR DATA = mort_sub_transposed;
VAR China Myanmar United_States;
ODS OUTPUT PearsonCorr = country_cor;
RUN;
*ODS TRACE OFF;
/*
PROC PRINT data = country_cor;
Var Myanmar;
WHERE Variable = 'United States';
RUN;
*/
PROC PRINT data = country_cor;
Var Myanmar;
WHERE Variable = 'United_States';
RUN;
The CORR Procedure
3 Variables: | China Myanmar United States |
---|
Simple Statistics | ||||||
---|---|---|---|---|---|---|
Variable | N | Mean | Std Dev | Sum | Minimum | Maximum |
China | 214 | 2.37608 | 1.15555 | 508.48153 | 0.04325 | 3.61453 |
Myanmar | 214 | 2.82215 | 1.12767 | 603.94031 | 0.05150 | 3.76418 |
United States | 214 | 1.36080 | 1.14407 | 291.21132 | 0.02867 | 3.17018 |
Pearson Correlation Coefficients, N = 214 Prob > |r| under H0: Rho=0 |
|||
---|---|---|---|
China | Myanmar | United States | |
China |
1.00000
|
0.97436
<.0001
|
0.69093
<.0001
|
Myanmar |
0.97436
<.0001
|
1.00000
|
0.62926
<.0001
|
United States |
0.69093
<.0001
|
0.62926
<.0001
|
1.00000
|
Obs | Myanmar |
---|---|
3 | 0.62926 |
Is there a difference between mortality information from
1990
and2000
? Run a paired t-test and a Wilcoxon signed rank test to assess this. Hint: to extract the column of information for1990
, use ‘1990’n.
/*
PROC TTEST data = mort;
PAIRED '1990'n*'2000'n;
RUN;
DATA diff;
SET mort;
diff = '2000'n - '1990'n;
RUN;
*/
PROC TTEST data = mort;
PAIRED _1990*_2000;
RUN;
DATA diff;
SET mort;
diff = _2000 - _1990;
RUN;
PROC UNIVARIATE data = diff;
VAR diff;
RUN;
The TTEST Procedure
Difference: 1990 - 2000
N | Mean | Std Dev | Std Err | Minimum | Maximum |
---|---|---|---|---|---|
197 | 0.1591 | 0.2227 | 0.0159 | -0.5456 | 0.8726 |
Mean | 95% CL Mean | Std Dev | 95% CL Std Dev | ||
---|---|---|---|---|---|
0.1591 | 0.1278 | 0.1904 | 0.2227 | 0.2027 | 0.2472 |
DF | t Value | Pr > |t| |
---|---|---|
196 | 10.03 | <.0001 |
The UNIVARIATE Procedure
Variable: diff
Moments | |||
---|---|---|---|
N | 197 | Sum Weights | 197 |
Mean | -0.1590681 | Sum Observations | -31.336415 |
Std Deviation | 0.22269839 | Variance | 0.04959457 |
Skewness | -0.9205453 | Kurtosis | 2.01791651 |
Uncorrected SS | 14.7051599 | Corrected SS | 9.72053615 |
Coeff Variation | -140.00192 | Std Error Mean | 0.0158666 |
Basic Statistical Measures | |||
---|---|---|---|
Location | Variability | ||
Mean | -0.15907 | Std Deviation | 0.22270 |
Median | -0.09007 | Variance | 0.04959 |
Mode | . | Range | 1.41823 |
Interquartile Range | 0.21805 |
Tests for Location: Mu0=0 | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Student's t | t | -10.0253 | Pr > |t| | <.0001 |
Sign | M | -84.5 | Pr >= |M| | <.0001 |
Signed Rank | S | -8245.5 | Pr >= |S| | <.0001 |
Quantiles (Definition 5) | |
---|---|
Level | Quantile |
100% Max | 0.5456100 |
99% | 0.3990751 |
95% | 0.0830986 |
90% | -0.0104786 |
75% Q3 | -0.0286619 |
50% Median | -0.0900725 |
25% Q1 | -0.2467130 |
10% | -0.4798594 |
5% | -0.6188205 |
1% | -0.8710244 |
0% Min | -0.8726242 |
Extreme Observations | |||
---|---|---|---|
Lowest | Highest | ||
Value | Obs | Value | Obs |
-0.872624 | 19 | 0.272621 | 32 |
-0.871024 | 131 | 0.321500 | 196 |
-0.808735 | 96 | 0.329956 | 22 |
-0.804626 | 54 | 0.399075 | 99 |
-0.794268 | 106 | 0.545610 | 197 |
Using the cars dataset, fit a linear regression model with vehicle cost (
VehBCost
) as the outcome and vehicle age (VehicleAge
) and whether it’s an online sale (IsOnlineSale
) as predictors as well as their interaction.
PROC GLM data = cars;
MODEL VehBCost = VehicleAge IsOnlineSale VehicleAge*IsOnlineSale;
RUN;
The GLM Procedure
Number of Observations Read | 72983 |
---|---|
Number of Observations Used | 72983 |
The GLM Procedure
Dependent Variable: VehBCost
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 3 | 22422071293 | 7474023764.4 | 2652.08 | <.0001 |
Error | 72979 | 205667188003 | 2818169.4461 | ||
Corrected Total | 72982 | 228089259296 |
R-Square | Coeff Var | Root MSE | VehBCost Mean |
---|---|---|---|
0.098304 | 24.94067 | 1678.740 | 6730.934 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
VehicleAge | 1 | 22249980049 | 22249980049 | 7895.19 | <.0001 |
IsOnlineSale | 1 | 158865621 | 158865621 | 56.37 | <.0001 |
VehicleAg*IsOnlineSa | 1 | 13225623 | 13225623 | 4.69 | 0.0303 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
VehicleAge | 1 | 21533779839 | 21533779839 | 7641.05 | <.0001 |
IsOnlineSale | 1 | 64252535 | 64252535 | 22.80 | <.0001 |
VehicleAg*IsOnlineSa | 1 | 13225623 | 13225623 | 4.69 | 0.0303 |
Parameter | Estimate | Standard Error |
t Value | Pr > |t| |
---|---|---|---|---|
Intercept | 8062.701999 | 16.5870226 | 486.08 | <.0001 |
VehicleAge | -320.662108 | 3.6683525 | -87.41 | <.0001 |
IsOnlineSale | 514.308268 | 107.7114631 | 4.77 | <.0001 |
VehicleAg*IsOnlineSa | -55.372926 | 25.5607123 | -2.17 | 0.0303 |
Create a variable called
expensive
in thecars
data that indicates if the vehicle cost is over$10,000
. Use a chi-squared test to assess if there is a relationship between a car being expensive and it being labeled as a “bad buy” (IsBadBuy
).
DATA cars;
SET cars;
IF VehBCost > 10000 THEN expensive = 1;
ELSE expensive = 0;
RUN;
PROC FREQ data = cars;
TABLES expensive*IsBadBuy / chisq;
RUN;
The FREQ Procedure
|
|
Statistics for Table of expensive by IsBadBuy
Statistic | DF | Value | Prob |
---|---|---|---|
Chi-Square | 1 | 1.8948 | 0.1687 |
Likelihood Ratio Chi-Square | 1 | 1.9314 | 0.1646 |
Continuity Adj. Chi-Square | 1 | 1.8152 | 0.1779 |
Mantel-Haenszel Chi-Square | 1 | 1.8948 | 0.1687 |
Phi Coefficient | -0.0051 | ||
Contingency Coefficient | 0.0051 | ||
Cramer's V | -0.0051 |
Fisher's Exact Test | |
---|---|
Cell (1,1) Frequency (F) | 61505 |
Left-sided Pr <= F | 0.0880 |
Right-sided Pr >= F | 0.9211 |
Table Probability (P) | 0.0091 |
Two-sided Pr <= P | 0.1790 |
Sample Size = 72983
Fit a logistic regression model where the outcome is “bad buy” status and predictors are the
expensive
status and vehicle age (VehicleAge
). Request confidence intervals for the odds ratios.
PROC LOGISTIC data = cars;
MODEL IsBadBuy(event='1') = expensive VehicleAge / CLODDS = WALD;
RUN;
The LOGISTIC Procedure
Model Information | |
---|---|
Data Set | WORK.CARS |
Response Variable | IsBadBuy |
Number of Response Levels | 2 |
Model | binary logit |
Optimization Technique | Fisher's scoring |
Number of Observations Read | 72983 |
---|---|
Number of Observations Used | 72983 |
Response Profile | ||
---|---|---|
Ordered Value |
IsBadBuy | Total Frequency |
1 | 0 | 64007 |
2 | 1 | 8976 |
Probability modeled is IsBadBuy='1'.
Model Convergence Status |
---|
Convergence criterion (GCONV=1E-8) satisfied. |
Model Fit Statistics | ||
---|---|---|
Criterion | Intercept Only | Intercept and Covariates |
AIC | 54423.307 | 52447.788 |
SC | 54432.505 | 52475.381 |
-2 Log L | 54421.307 | 52441.788 |
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 1979.5198 | 2 | <.0001 |
Score | 2042.5760 | 2 | <.0001 |
Wald | 1961.8662 | 2 | <.0001 |
Analysis of Maximum Likelihood Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq |
Intercept | 1 | -3.2495 | 0.0332 | 9554.1782 | <.0001 |
expensive | 1 | -0.0804 | 0.0607 | 1.7542 | 0.1853 |
VehicleAge | 1 | 0.2866 | 0.00648 | 1958.7086 | <.0001 |
Association of Predicted Probabilities and Observed Responses | |||
---|---|---|---|
Percent Concordant | 56.9 | Somers' D | 0.279 |
Percent Discordant | 29.1 | Gamma | 0.324 |
Percent Tied | 14.0 | Tau-a | 0.060 |
Pairs | 574526832 | c | 0.639 |
Odds Ratio Estimates and Wald Confidence Intervals | ||||
---|---|---|---|---|
Effect | Unit | Estimate | 95% Confidence Limits | |
expensive | 1.0000 | 0.923 | 0.819 | 1.039 |
VehicleAge | 1.0000 | 1.332 | 1.315 | 1.349 |
17.13. Macros¶
Write a macro that discretizes a quantitative variable into four categories based on quantiles. That is:
if X < Q1 then group = 1
if Q1 < X < M then group = 2
if M < X < Q3 then group = 3
if Q3 < X then group = 4
The macro should have the following defnition - %quartilesmacro(mydata, qvar, round, out);
mydata: dataset containing the quantitative variable
qvar: name of quantitative variable
round: integer representing number of decimal places to round to
out: name of output dataset which contains the categorized variable
We will explore this by using the Framingham dataset as an example to apply the macro to.
LIBNAME phc6089 "/folders/myfolders/SAS_Notes/data/";
*Create a temporary data set, so that we don't save changes to the original data set.;
DATA fghmTemp;
SET phc6089.fghm113;
RUN;
/*Now let's code some variables with some more descriptive values.
SEX (Gender): 1=Men
2=Women
Period (Examination cycle): 1=Period1
2=Period2
3=Period3
BPMEDS (Use of anti-hypertensive meds): 0=Not currently
1=Currently use
CURSMOKE (Currently smoke?): 0=No
1=Yes
DIABETES: 0=Not diabetic
1=Diabetic
PREVAP (Have angina pectoric?): 0=No
1=Yes
PREVCHD (Coronary heart disease?): 0=No
1=Yes
PREVMI (Myocardial infarction?): 0=No
1=Yes
PREVSTRK (Had a stroke?): 0=No
1=Yes
PREVHYP (Hypertensive? sys bp >=140 or dyas bp >= 90): 0=no
1=yes
*/
PROC FORMAT;
VALUE YNfmt 0="No"
1="Yes";
VALUE perfmt 1="Period 1"
2="Period 2"
3="Period 3";
VALUE gndrfmt 1="Men"
2="Women";
RUN;
DATA fghmtemp;
SET fghmtemp;
FORMAT prevap ynfmt.
diabetes ynfmt.
cursmoke ynfmt.
bpmeds ynfmt.
prevchd ynfmt.
prevmi ynfmt.
prevstrk ynfmt.
prevhyp ynfmt.
sex gndrfmt.;
RUN;
5880 ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
5880! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
5881
5882 LIBNAME phc6089 "/folders/myfolders/SAS_Notes/data/";
NOTE: Libref PHC6089 was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myfolders/SAS_Notes/data
5883
5884 *Create a temporary data set, so that we don't save changes to the original data set.;
5885 DATA fghmTemp;
NOTE: Data file PHC6089.FGHM113.DATA is in a format that is native to another host, or the file encoding does not match the session
encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
5886 SET phc6089.fghm113;
5887 RUN;
NOTE: There were 500 observations read from the data set PHC6089.FGHM113.
NOTE: The data set WORK.FGHMTEMP has 500 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds
5888
5889 /*Now let's code some variables with some more descriptive values.
5890 SEX (Gender): 1=Men
5891 2=Women
5892 Period (Examination cycle): 1=Period1
5893 2=Period2
5894 3=Period3
5895 BPMEDS (Use of anti-hypertensive meds): 0=Not currently
5896 1=Currently use
5897 CURSMOKE (Currently smoke?): 0=No
5898 1=Yes
5899 DIABETES: 0=Not diabetic
5900 1=Diabetic
5901 PREVAP (Have angina pectoric?): 0=No
5902 1=Yes
5903 PREVCHD (Coronary heart disease?): 0=No
5904 1=Yes
5905 PREVMI (Myocardial infarction?): 0=No
5906 1=Yes
5907 PREVSTRK (Had a stroke?): 0=No
5908 1=Yes
5909 PREVHYP (Hypertensive? sys bp >=140 or dyas bp >= 90): 0=no
5910 1=yes
5911 */
5912
5913 PROC FORMAT;
5914 VALUE YNfmt 0="No"
NOTE: Format YNFMT has been output.
5915 1="Yes";
5916 VALUE perfmt 1="Period 1"
5917 2="Period 2"
NOTE: Format PERFMT has been output.
5918 3="Period 3";
5919 VALUE gndrfmt 1="Men"
NOTE: Format GNDRFMT has been output.
5920 2="Women";
5921 RUN;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
5922
5923 DATA fghmtemp;
5924 SET fghmtemp;
5925 FORMAT prevap ynfmt.
5926 diabetes ynfmt.
5927 cursmoke ynfmt.
5928 bpmeds ynfmt.
5929 prevchd ynfmt.
5930 prevmi ynfmt.
5931 prevstrk ynfmt.
5932 prevhyp ynfmt.
5933 sex gndrfmt.;
5934 RUN;
NOTE: There were 500 observations read from the data set WORK.FGHMTEMP.
NOTE: The data set WORK.FGHMTEMP has 500 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
5935
5936 ods html5 (id=saspy_internal) close;ods listing;
5937
* Step 1 - write code with only macros variables - Let's categorize BMI by quartiles;
%let qvar = bmi;
* Find table name;
ODS trace on;
proc means data = fghmtemp Q1 Median Q3;
var &qvar;
run;
ODS trace off;
proc means data = fghmtemp Q1 Median Q3;
var &qvar;
ODS OUTPUT summary = quartiles;
run;
* See the contents of the ODS table;
proc print data = quartiles;
run;
data _null_;
set quartiles;
call symput("Q1", strip(&qvar._Q1));
call symput("M", strip(&qvar._Median));
call symput("Q3", strip(&qvar._Q3));
run;
%put _user_; * see the value of our macro variables in the log file;
data fghmtemp;
set fghmtemp;
if not missing(&qvar) and &qvar < &Q1 then &qvar._cat = 1;
else if &qvar < &M then &qvar._cat = 2;
else if &qvar < &Q3 then &qvar._cat = 3;
else &qvar._cat = 4;
run;
proc freq data = fghmtemp;
tables &qvar._cat;
run;
The MEANS Procedure
Analysis Variable : BMI Body Mass Index (kr/(M*M) | ||
---|---|---|
Lower Quartile | Median | Upper Quartile |
23.1000000 | 25.2800000 | 27.6300000 |
The MEANS Procedure
Analysis Variable : BMI Body Mass Index (kr/(M*M) | ||
---|---|---|
Lower Quartile | Median | Upper Quartile |
23.1000000 | 25.2800000 | 27.6300000 |
Obs | BMI_Q1 | BMI_Median | BMI_Q3 |
---|---|---|---|
1 | 23.1 | 25.28 | 27.63 |
The FREQ Procedure
bmi_cat | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
1 | 120 | 24.00 | 120 | 24.00 |
2 | 130 | 26.00 | 250 | 50.00 |
3 | 125 | 25.00 | 375 | 75.00 |
4 | 125 | 25.00 | 500 | 100.00 |
Now that we have the basic code working, let’s turn this into a MACRO.
%macro quartilesmacro(mydata, qvar, round, out = out_dat);
ODS select none;
proc means data = &mydata Q1 Median Q3 maxdec = &round;
var &qvar;
ODS OUTPUT summary = quartiles;
run;
ods select all;
data _null_;
set quartiles;
call symput("Q1", strip(&qvar._Q1));
call symput("M", strip(&qvar._Median));
call symput("Q3", strip(&qvar._Q3));
run;
data &out;
set &mydata;
if not missing(&qvar) and &qvar < &Q1 then &qvar._cat = 1;
else if &qvar < &M then &qvar._cat = 2;
else if &qvar < &Q3 then &qvar._cat = 3;
else &qvar._cat = 4;
run;
%mend;
%quartilesmacro(fghmtemp, bmi, round = 2, out = out_bmi);
proc freq data = out_bmi;
tables bmi_cat;
run;
The FREQ Procedure
bmi_cat | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
1 | 120 | 24.00 | 120 | 24.00 |
2 | 130 | 26.00 | 250 | 50.00 |
3 | 125 | 25.00 | 375 | 75.00 |
4 | 125 | 25.00 | 500 | 100.00 |
17.14. Simulations¶
In this example, we will explore the sampling distribution of a sample proportion. a) Generate 500 sample of size 50 from a Binomial distribution with n = 1 and p = 0.3. Do this by generating a dataset with 500 rows and 50 columns all filled with random variates from this binomial distribution. Hint: Use an array and a nested DO loop. b) Calculate the mean of each of these 500 rows using mean(of ) in a DATA step. Save this mean as a new column. c) Plot a histogram of the 500 means calculated in part b and use PROC MEANS to calculate the MEAN and standard deviation of the 500 means calculated in part b. This describes the sampling distribution of p-hat when the population distiribution has probability of success p = 0.3 when the sample size is 30.
DATA sim;
ARRAY data{*} x1-x50;
DO numsim = 1 TO 500;
DO n = 1 to 50;
data(n) = ranbin(1234, 1, 0.3);
END;
p_hat = mean(of x1-x50);
OUTPUT;
END;
RUN;
PROC SGPLOT data = sim;
HISTOGRAM p_hat;
RUN;
PROC MEANS data = sim mean std;
VAR p_hat;
RUN;
The MEANS Procedure
Analysis Variable : p_hat | |
---|---|
Mean | Std Dev |
0.2980400 | 0.0635376 |