17. Solutions to Exercises

17.1. SAS Windowing Environment

Use the sample SAS program code provided above and the Charm City Circulator Ridership dataset to experiment with using SAS by making the following changes to the provided code.

  1. Go through and change the colors using the attribute map dataset palette to something other than what they originally were. See https://support.sas.com/content/dam/SAS/support/en/books/pro-template-made-easy-a-guide-for-sas-users/62007_Appendix.pdf for a large list of colors in SAS.

  2. Change the days you are keeping to show "Sunday" instead of “Saturday".

  3. Change the plot from a line plot to a scatterplot by changing series to scatter in proc sgpanel.

%let working_directory = data; *Replace data with your path to your 
                                directory containing the dataset;

proc import datafile = "&working_directory/Charm_City_Circulator_Ridership.csv" 
    out = circ dbms = csv replace;
    getnames = yes;
    guessingrows = max;
run;

proc sort data = circ;
  by day date daily;
run;

proc transpose data = circ out = long (rename=(_NAME_ = var col1 = number));
  var orangeboardings -- banneraverage;
  by day date daily;
run;

data long2;
  set long;
  var = tranwrd(var, 'Board', '_Board');
  var = tranwrd(var, 'Alight', '_Alight');
  var = tranwrd(var, 'Average', '_Average');
  line = scan(var, 1, '_');
  type = scan(var, 2, '_');
  drop var;
run;

data avg;
  set long2;
  where (type = "Average");
run;

data palette;
  input id $ value $ linecolor $;
  markercolor = linecolor;
  datalines;
lc banner red
lc green yellow
lc orange darkblue
lc purple brown
;
run;

proc sgpanel data = avg dattrmap=palette; 
  where day in ("Monday" "Tuesday" "Friday" "Sunday");
  panelby day;
  scatter Y = number X = date / group = line attrid = lc;
run;
SAS Output
The SGPanel Procedure

17.2. Basic SAS Programming

  1. Submit the following simple SAS program. What does it do?

    PROC PRINT DATA = sashelp.heart (OBS=10);
    RUN;
    
    PROC MEANS DATA = sashelp.heart;
      VAR height weight;
    RUN;
    
PROC PRINT DATA = sashelp.heart (OBS=10);
RUN;

PROC MEANS DATA = sashelp.heart;
  VAR height weight;
RUN;
SAS Output

The SAS System

Obs Status DeathCause AgeCHDdiag Sex AgeAtStart Height Weight Diastolic Systolic MRW Smoking AgeAtDeath Cholesterol Chol_Status BP_Status Weight_Status Smoking_Status
1 Dead Other . Female 29 62.50 140 78 124 121 0 55 .   Normal Overweight Non-smoker
2 Dead Cancer . Female 41 59.75 194 92 144 183 0 57 181 Desirable High Overweight Non-smoker
3 Alive   . Female 57 62.25 132 90 170 114 10 . 250 High High Overweight Moderate (6-15)
4 Alive   . Female 39 65.75 158 80 128 123 0 . 242 High Normal Overweight Non-smoker
5 Alive   . Male 42 66.00 156 76 110 116 20 . 281 High Optimal Overweight Heavy (16-25)
6 Alive   . Female 58 61.75 131 92 176 117 0 . 196 Desirable High Overweight Non-smoker
7 Alive   . Female 36 64.75 136 80 112 110 15 . 196 Desirable Normal Overweight Moderate (6-15)
8 Dead Other . Male 53 65.50 130 80 114 99 0 77 276 High Normal Normal Non-smoker
9 Alive   . Male 35 71.00 194 68 132 124 0 . 211 Borderline Normal Overweight Non-smoker
10 Dead Cerebral Vascular Disease . Male 52 62.50 129 78 124 106 5 82 284 High Normal Normal Light (1-5)

The SAS System

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum
Height
Weight
5203
5203
64.8131847
153.0866808
3.5827074
28.9154261
51.5000000
67.0000000
76.5000000
300.0000000

This program prints out the first 10 rows of the sashelp.heart dataset and the summary statistics N (sample size), mean, standard deviation, minimum and maxium for the Height and Weight variables.

  1. Submit the following small program. How many DATA steps are there? How man PROC steps? Which steps are executed?

    DATA work.quant;
      SET sashelp.heart; /* Read in from the heart datset */
      KEEP height weight diastolic systolic; *Only keep these variable 
                                          in the new dataset;
    RUN;
    
    PROC PRINT DATA = work.quant (OBS=10);
    RUN;
    
    /*
    PROC MEANS DATA = work.quant MIN Q1 Median Q3 MAX MEAN STD;
    RUN;
    */
    
DATA work.quant;
  SET sashelp.heart; /* Read in from the heart datset */
  KEEP height weight diastolic systolic; *Only keep these variable 
                                       in the new dataset;
RUN;

PROC PRINT DATA = work.quant (OBS=10);
RUN;

/*
PROC MEANS DATA = work.quant MIN Q1 Median Q3 MAX MEAN STD;
RUN;
*/
SAS Output

The SAS System

Obs Height Weight Diastolic Systolic
1 62.50 140 78 124
2 59.75 194 92 144
3 62.25 132 90 170
4 65.75 158 80 128
5 66.00 156 76 110
6 61.75 131 92 176
7 64.75 136 80 112
8 65.50 130 80 114
9 71.00 194 68 132
10 62.50 129 78 124

There is one DATA step and there are two PROC steps; however, the last PROC MEANS is not run since it is within a paragraph comment.

  1. Run PROC CONTENTS on the dataset sashelp.heart. How many observations are in this datset?

PROC CONTENTS data = sashelp.heart;
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name SASHELP.HEART Observations 5209
Member Type DATA Variables 17
Engine V9 Indexes 0
Created 10/25/2018 02:21:26 Observation Length 168
Last Modified 10/25/2018 02:21:26 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label Framingham Heart Study    
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding us-ascii ASCII (ANSI)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 14
First Data Page 1
Max Obs per Page 389
Obs in First Data Page 365
Number of Data Set Repairs 0
Filename /opt/sasinside/SASHome/SASFoundation/9.4/sashelp/heart.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 6235
Access Permission rw-r--r--
Owner Name sas
File Size 960KB
File Size (bytes) 983040
Alphabetic List of Variables and Attributes
# Variable Type Len Label
12 AgeAtDeath Num 8 Age at Death
5 AgeAtStart Num 8 Age at Start
3 AgeCHDdiag Num 8 Age CHD Diagnosed
15 BP_Status Char 7 Blood Pressure Status
14 Chol_Status Char 10 Cholesterol Status
13 Cholesterol Num 8  
2 DeathCause Char 26 Cause of Death
8 Diastolic Num 8  
6 Height Num 8  
10 MRW Num 8 Metropolitan Relative Weight
4 Sex Char 6  
11 Smoking Num 8  
17 Smoking_Status Char 17 Smoking Status
1 Status Char 5  
9 Systolic Num 8  
7 Weight Num 8  
16 Weight_Status Char 11 Weight Status

There are 5209 observations (rows) in the dataset.

  1. Submit the following program to SAS. What is the value of CurrentDate? (This value represents the number of days since Januar 1, 1960, the reference date for the SAS system).

    DATA work.date;
       CurrentDate = Today();
    RUN;
    
    PROC PRINT DATA = work.date;
    RUN;
    
DATA work.date;
  CurrentDate = Today();
RUN;
   
PROC PRINT DATA = work.date;
RUN;

PROC PRINT DATA = work.date;
   format currentdate mmddyy10.;
RUN;
SAS Output

The SAS System

Obs CurrentDate
1 22205

The SAS System

Obs CurrentDate
1 10/17/2020

The value of current date will change depending on when you submit this program. I submitted this program in 10/17/2020 and received the value 22205 or 22205 days since January 1, 1960.

17.3. Data I/O

  1. Use PROC IMPORT to read in the Youth Tobacco study, Youth_Tobacco_Survey_YTS_Data.csv and name it youth.

PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Youth_Tobacco_Survey_YTS_Data.csv"
            out = youth dbms = csv replace;
    getnames = yes;
    guessingrows = 500;
RUN;

PROC PRINT data = youth (OBS=5);
RUN;
SAS Output

The SAS System

Obs YEAR LocationAbbr LocationDesc TopicType TopicDesc MeasureDesc DataSource Response Data_Value_Unit Data_Value_Type Data_Value Data_Value_Footnote_Symbol Data_Value_Footnote Data_Value_Std_Err Low_Confidence_Limit High_Confidence_Limit Sample_Size Gender Race Age Education GeoLocation TopicTypeId TopicId MeasureId StratificationID1 StratificationID2 StratificationID3 StratificationID4 SubMeasureID DisplayOrder
1 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth) Percent of Current Smokers Who Want to Quit YTS   % Percentage . * Data in these cells have been suppressed because of a small sample size . . . . Overall All Races All Ages Middle School (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES 1GEN 8AGE 6RAC 1EDU YTS01 1
2 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth) Percent of Current Smokers Who Want to Quit YTS   % Percentage . * Data in these cells have been suppressed because of a small sample size . . . . Male All Races All Ages Middle School (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES 2GEN 8AGE 6RAC 1EDU YTS02 2
3 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth) Percent of Current Smokers Who Want to Quit YTS   % Percentage . * Data in these cells have been suppressed because of a small sample size . . . . Female All Races All Ages Middle School (34.865970280000454, -111.76381127699972) BEH 105BEH 170CES 3GEN 8AGE 6RAC 1EDU YTS03 3
4 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth) Quit Attempt in Past Year Among Current Cigarette Smokers YTS   % Percentage . * Data in these cells have been suppressed because of a small sample size . . . . Overall All Races All Ages Middle School (34.865970280000454, -111.76381127699972) BEH 105BEH 169QUA 1GEN 8AGE 6RAC 1EDU YTS04 4
5 2015 AZ Arizona Tobacco Use - Survey Data Cessation (Youth) Quit Attempt in Past Year Among Current Cigarette Smokers YTS   % Percentage . * Data in these cells have been suppressed because of a small sample size . . . . Male All Races All Ages Middle School (34.865970280000454, -111.76381127699972) BEH 105BEH 169QUA 2GEN 8AGE 6RAC 1EDU YTS05 5
  1. Download an Excel version of the Monuments dataset, Monuments.xlsx, from CANVAS. Use PROC IMPORT to read in the dataset and call the output mon.

PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Monuments.xlsx"
            out = mon dbms = xlsx replace;
    getnames = yes;
RUN;

PROC PRINT data = mon (OBS=5);
RUN;
SAS Output

The SAS System

Obs name zipCode neighborhood councilDistrict policeDistrict Location 1
1 James Cardinal Gibbons 21201 Downtown 11 CENTRAL 408 CHARLES ST Baltimore, MD
2 The Battle Monument 21202 Downtown 11 CENTRAL  
3 Negro Heroes of the U.S Monument 21202 Downtown 11 CENTRAL  
4 Star Bangled Banner 21202 Downtown 11 CENTRAL 100 HOLLIDAY ST Baltimore, MD
5 Flame at the Holocaust Monument 21202 Downtown 11 CENTRAL 50 MARKET PL Baltimore, MD
  1. Write a DATA step that will read in the following column formatted data using DATALINES and print the results using PROC PRINT. This dataset contains information on esophageal cancer in Ille-et-Vilaine, France. The variables are

    • Age group

    • Alcohol consumption

    • Tobacco consumption

    • Number of Cases

    • Number of Controls

Write a DATA step to read in the data and print the results. Note: the first line is to help you count the columns. Each number corresponds to a multiple of 5.

----1----2----3----4----5----6----7----8
75+ 0-39g/day 0-9g/day      1        18
75+ 0-39g/day    10-19      2         6
75+ 0-39g/day      30+      1         3
75+     40-79 0-9g/day      2         5
75+     40-79    10-19      1         3
75+     40-79    20-29      0         3
75+     40-79      30+      1         1
75+    80-119 0-9g/day      1         1
75+    80-119    10-19      1         1
75+      120+ 0-9g/day      2         2
75+      120+    10-19      1         1
DATA eso;
   INPUT aggrp $ 1-3 alc $ 5-13 smoke $ 15-22 ncases 29 ncontrl 38-39;
   DATALINES;
75+ 0-39g/day 0-9g/day      1        18
75+ 0-39g/day    10-19      2         6
75+ 0-39g/day      30+      1         3
75+     40-79 0-9g/day      2         5
75+     40-79    10-19      1         3
75+     40-79    20-29      0         3
75+     40-79      30+      1         1
75+    80-119 0-9g/day      1         1
75+    80-119    10-19      1         1
75+      120+ 0-9g/day      2         2
75+      120+    10-19      1         1
;
RUN;

PROC PRINT DATA=eso;
RUN;
SAS Connection established. Subprocess id is 2198
SAS Output

The SAS System

Obs aggrp alc smoke ncases ncontrl
1 75+ 0-39g/day 0-9g/day 1 18
2 75+ 0-39g/day 10-19 2 6
3 75+ 0-39g/day 30+ 1 3
4 75+ 40-79 0-9g/day 2 5
5 75+ 40-79 10-19 1 3
6 75+ 40-79 20-29 0 3
7 75+ 40-79 30+ 1 1
8 75+ 80-119 0-9g/day 1 1
9 75+ 80-119 10-19 1 1
10 75+ 120+ 0-9g/day 2 2
11 75+ 120+ 10-19 1 1

17.4. SAS Variables and Assignment Statements

  1. Use the following SAS data step to create the new variables Grade and Course defined below:

data school;
   input Age Quiz : $1. Midterm Final;
   /* Add you statements here */
datalines;
12 A 92 95
12 B 88 88
13 C 78 75
13 A 92 93
12 F 55 62
13 B 88 82
;

Using If-Then-Else statements, compute two new variables as follows:

  • Grade (numeric), with a value of 6 if Age is 12 and a value of 8 if Age is 13.

  • The quiz grades have numerical equivalents as follows: A = 95, B = 85, C = 75, D = 70, and F = 65. Using this information, compute a course grade (Course) as a weighted average of the Quiz (20%), Midterm (30%) and Final (50%).

data school;
   input Age Quiz : $1. Midterm Final;
   IF Age = 12 THEN Grade = 6;
   ELSE IF AGE = 13 THEN Grade = 8;
   IF quiz = "A" THEN Course = 0.2*95 + 0.3*midterm + 0.5*final;
   ELSE IF quiz = "B" THEN Course = 0.2*85 + 0.3*midterm + 0.5*final;
   ELSE IF quiz = "C" THEN Course = 0.2*75 + 0.3*midterm + 0.5*final;
   ELSE IF quiz = "D" THEN Course = 0.2*70 + 0.3*midterm + 0.5*final;
   ELSE IF quiz = "F" THEN Course = 0.2*65 + 0.3*midterm + 0.5*final;
datalines;
12 A 92 95
12 B 88 88
13 C 78 75
13 A 92 93
12 F 55 62
13 B 88 82
;
RUN;

PROC PRINT DATA = school;
RUN;
SAS Output

The SAS System

Obs Age Quiz Midterm Final Grade Course
1 12 A 92 95 6 94.1
2 12 B 88 88 6 87.4
3 13 C 78 75 8 75.9
4 13 A 92 93 8 93.1
5 12 F 55 62 6 60.5
6 13 B 88 82 8 84.4

17.5. Working with Your Data

  1. Create a temporary SAS dataset called heart from the sashelp.heart dataset. The library sashelp if a permanent library that comes with SAS.

DATA heart;
  SET sashelp.heart;
RUN;

1348  ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
1348! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
1349
1350 DATA heart;
1351 SET sashelp.heart;
1352 RUN;
NOTE: There were 5209 observations read from the data set SASHELP.HEART.
NOTE: The data set WORK.HEART has 5209 observations and 17 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

1353
1354 ods html5 (id=saspy_internal) close;ods listing;

1355
  1. View the description of the heart dataset by running PROC CONTENTS. How many observations (rows) and variables (columns) are in the dataset?

PROC CONTENTS DATA = heart;
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.HEART Observations 5209
Member Type DATA Variables 17
Engine V9 Indexes 0
Created 10/17/2020 21:35:04 Observation Length 168
Last Modified 10/17/2020 21:35:04 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 14
First Data Page 1
Max Obs per Page 389
Obs in First Data Page 365
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671720
Access Permission rw-r--r--
Owner Name sasdemo
File Size 960KB
File Size (bytes) 983040
Alphabetic List of Variables and Attributes
# Variable Type Len Label
12 AgeAtDeath Num 8 Age at Death
5 AgeAtStart Num 8 Age at Start
3 AgeCHDdiag Num 8 Age CHD Diagnosed
15 BP_Status Char 7 Blood Pressure Status
14 Chol_Status Char 10 Cholesterol Status
13 Cholesterol Num 8  
2 DeathCause Char 26 Cause of Death
8 Diastolic Num 8  
6 Height Num 8  
10 MRW Num 8 Metropolitan Relative Weight
4 Sex Char 6  
11 Smoking Num 8  
17 Smoking_Status Char 17 Smoking Status
9 Systolic Num 8  
7 Weight Num 8  
16 Weight_Status Char 11 Weight Status
1 condition Char 5  

There are 5209 observations (rows) and 17 variables (columns).

  1. Use the RENAME= option to rename Status to condition. Show the variables table from PROC CONTENTS to see that the name has changed.

DATA heart2 (RENAME = (status = condition));
   SET heart;
RUN;

PROC CONTENTS data = heart2;
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.HEART2 Observations 5209
Member Type DATA Variables 17
Engine V9 Indexes 0
Created 10/17/2020 21:39:47 Observation Length 168
Last Modified 10/17/2020 21:39:47 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 14
First Data Page 1
Max Obs per Page 389
Obs in First Data Page 365
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart2.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671720
Access Permission rw-r--r--
Owner Name sasdemo
File Size 960KB
File Size (bytes) 983040
Alphabetic List of Variables and Attributes
# Variable Type Len Label
12 AgeAtDeath Num 8 Age at Death
5 AgeAtStart Num 8 Age at Start
3 AgeCHDdiag Num 8 Age CHD Diagnosed
15 BP_Status Char 7 Blood Pressure Status
14 Chol_Status Char 10 Cholesterol Status
13 Cholesterol Num 8  
2 DeathCause Char 26 Cause of Death
8 Diastolic Num 8  
6 Height Num 8  
10 MRW Num 8 Metropolitan Relative Weight
4 Sex Char 6  
11 Smoking Num 8  
17 Smoking_Status Char 17 Smoking Status
9 Systolic Num 8  
7 Weight Num 8  
16 Weight_Status Char 11 Weight Status
1 condition Char 5  
  1. Create a subset of heart called heart_sub that contains only the columns height, weight, and diastolic. What are the dimensions of heart_sub? Print the first five rows of heart_sub.

DATA heart_sub;
   SET heart;
   KEEP height weight diastolic;
RUN;

PROC CONTENTS data = heart_sub;
RUN;

PROC PRINT data = heart_sub(OBS=5);
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.HEART_SUB Observations 5209
Member Type DATA Variables 3
Engine V9 Indexes 0
Created 10/17/2020 21:39:50 Observation Length 24
Last Modified 10/17/2020 21:39:50 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 2
First Data Page 1
Max Obs per Page 2714
Obs in First Data Page 2625
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671721
Access Permission rw-r--r--
Owner Name sasdemo
File Size 192KB
File Size (bytes) 196608
Alphabetic List of Variables and Attributes
# Variable Type Len
3 Diastolic Num 8
1 Height Num 8
2 Weight Num 8

The SAS System

Obs Height Weight Diastolic
1 62.50 140 78
2 59.75 194 92
3 62.25 132 90
4 65.75 158 80
5 66.00 156 76

The new dataset heart_sub contains 5209 rows and 3 columns.

  1. Subset the patients (rows) of heart that have a diastolic blood pressure (diastolic) more than 90 into a dataset called heart_sub2. How many are there? Print the first five rows of heart_sub2.

DATA heart_sub2;
   SET heart;
   WHERE diastolic > 90;
RUN;

PROC CONTENTS data = heart_sub2;
RUN;

PROC PRINT data = heart_sub2(OBS=5);
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.HEART_SUB2 Observations 1366
Member Type DATA Variables 17
Engine V9 Indexes 0
Created 10/17/2020 21:39:55 Observation Length 168
Last Modified 10/17/2020 21:39:55 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 4
First Data Page 1
Max Obs per Page 389
Obs in First Data Page 365
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub2.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671723
Access Permission rw-r--r--
Owner Name sasdemo
File Size 320KB
File Size (bytes) 327680
Alphabetic List of Variables and Attributes
# Variable Type Len Label
12 AgeAtDeath Num 8 Age at Death
5 AgeAtStart Num 8 Age at Start
3 AgeCHDdiag Num 8 Age CHD Diagnosed
15 BP_Status Char 7 Blood Pressure Status
14 Chol_Status Char 10 Cholesterol Status
13 Cholesterol Num 8  
2 DeathCause Char 26 Cause of Death
8 Diastolic Num 8  
6 Height Num 8  
10 MRW Num 8 Metropolitan Relative Weight
4 Sex Char 6  
11 Smoking Num 8  
17 Smoking_Status Char 17 Smoking Status
1 Status Char 5  
9 Systolic Num 8  
7 Weight Num 8  
16 Weight_Status Char 11 Weight Status

The SAS System

Obs Status DeathCause AgeCHDdiag Sex AgeAtStart Height Weight Diastolic Systolic MRW Smoking AgeAtDeath Cholesterol Chol_Status BP_Status Weight_Status Smoking_Status
1 Dead Cancer . Female 41 59.75 194 92 144 183 0 57 181 Desirable High Overweight Non-smoker
2 Alive   . Female 58 61.75 131 92 176 117 0 . 196 Desirable High Overweight Non-smoker
3 Alive   . Female 42 67.75 162 96 138 119 1 . 200 Borderline High Overweight Light (1-5)
4 Dead Coronary Heart Disease 71 Female 49 60.50 153 110 196 140 5 73 221 Borderline High Overweight Light (1-5)
5 Alive   43 Male 33 66.50 172 106 146 127 0 . 247 High High Overweight Non-smoker

There are 1366 patients with a diastolic blood pressure above 90.

  1. Subset the patients (rows) of heart that have a diastolic blood pressure less than or equal to 90 and have a weight more than 180 pounds into a dataset called heart_sub3. How many are there? Print the first five rows of heart_sub3.

DATA heart_sub3;
   SET heart;
   WHERE diastolic <= 90 AND weight > 180;
RUN;

PROC CONTENTS data = heart_sub3;
RUN;

PROC PRINT data = heart_sub3(OBS=5);
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.HEART_SUB3 Observations 481
Member Type DATA Variables 17
Engine V9 Indexes 0
Created 10/17/2020 21:43:36 Observation Length 168
Last Modified 10/17/2020 21:43:36 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 2
First Data Page 1
Max Obs per Page 389
Obs in First Data Page 365
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub3.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671722
Access Permission rw-r--r--
Owner Name sasdemo
File Size 192KB
File Size (bytes) 196608
Alphabetic List of Variables and Attributes
# Variable Type Len Label
12 AgeAtDeath Num 8 Age at Death
5 AgeAtStart Num 8 Age at Start
3 AgeCHDdiag Num 8 Age CHD Diagnosed
15 BP_Status Char 7 Blood Pressure Status
14 Chol_Status Char 10 Cholesterol Status
13 Cholesterol Num 8  
2 DeathCause Char 26 Cause of Death
8 Diastolic Num 8  
6 Height Num 8  
10 MRW Num 8 Metropolitan Relative Weight
4 Sex Char 6  
11 Smoking Num 8  
17 Smoking_Status Char 17 Smoking Status
1 Status Char 5  
9 Systolic Num 8  
7 Weight Num 8  
16 Weight_Status Char 11 Weight Status

The SAS System

Obs Status DeathCause AgeCHDdiag Sex AgeAtStart Height Weight Diastolic Systolic MRW Smoking AgeAtDeath Cholesterol Chol_Status BP_Status Weight_Status Smoking_Status
1 Alive   . Male 35 71.00 194 68 132 124 0 . 211 Borderline Normal Overweight Non-smoker
2 Alive   . Female 50 67.50 185 88 150 136 15 . 228 Borderline High Overweight Moderate (6-15)
3 Alive   . Male 42 72.25 182 78 136 113 0 . 221 Borderline Normal Overweight Non-smoker
4 Alive   68 Male 40 70.00 189 78 124 124 0 . 319 High Normal Overweight Non-smoker
5 Alive   68 Male 40 70.00 195 76 132 128 20 . 205 Borderline Normal Overweight Heavy (16-25)

There are 481 patients with a diastolic blood pressure less than or equal to 90 and with a weight above 180 pounds.

  1. Create a subset of heart that only contains the columns weight, systolic, and cholesterol for those that are heavy smokers smoker (Smoking_Status = “Heavy (16-25)”). Save this to heart_sub. What are the dimensions of the dataset? Print the first five rows of heart_sub.

DATA heart_sub;
   SET heart;
   WHERE Smoking_Status = "Heavy (16-25)";
   KEEP weight systolic cholesterol;
RUN;

PROC CONTENTS data = heart_sub;
RUN;

PROC PRINT data = heart_sub(OBS=5);
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.HEART_SUB Observations 1046
Member Type DATA Variables 3
Engine V9 Indexes 0
Created 10/17/2020 21:47:19 Observation Length 24
Last Modified 10/17/2020 21:47:19 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 2714
Obs in First Data Page 1046
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671725
Access Permission rw-r--r--
Owner Name sasdemo
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len
3 Cholesterol Num 8
2 Systolic Num 8
1 Weight Num 8

The SAS System

Obs Weight Systolic Cholesterol
1 156 110 281
2 195 132 205
3 181 144 223
4 152 144 263
5 196 128 .

The new heart_sub dataset contains 1046 rows and three columns.

  1. Reorder the rows of heart_sub in decreasing order by cholesterol. Print the first 5 rows of the sorted dataset.

PROC SORT data = heart_sub out = heart_sub_srt;
  BY descending cholesterol;
RUN;

PROC PRINT data = heart_sub_srt(OBS=5);
RUN;
SAS Output

The SAS System

Obs Weight Systolic Cholesterol
1 169 155 568
2 205 154 534
3 194 150 429
4 104 122 392
5 137 185 368

17.6. Data Summarization

  1. How many bike lanes are currently in Baltimore? You can assume that each observation/row is a different bike lane.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
            out = bike dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC CONTENTS data = bike;
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.BIKE Observations 1631
Member Type DATA Variables 9
Engine V9 Indexes 0
Created 10/17/2020 22:01:39 Observation Length 136
Last Modified 10/17/2020 22:01:39 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 4
First Data Page 1
Max Obs per Page 481
Obs in First Data Page 459
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/bike.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671721
Access Permission rw-r--r--
Owner Name sasdemo
File Size 320KB
File Size (bytes) 327680
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
3 block Char 29 $29. $29.
9 dateInstalled Num 8 BEST12. BEST32.
8 length Num 8 BEST12. BEST32.
2 name Char 20 $20. $20.
5 numLanes Num 8 BEST12. BEST32.
6 project Char 25 $25. $25.
7 route Char 13 $13. $13.
1 subType Char 6 $6. $6.
4 type Char 15 $15. $15.

There are 1631 rows.

  1. How many (a) feet and (b) miles of total bike lanes are currently in Baltimore? (The length variable provides the length in feet.)

PROC MEANS data = bike sum;
   VAR length;
RUN;
SAS Output

The SAS System

The MEANS Procedure

Analysis Variable : length
Sum
439447.59
DATA miles;
   length_mi = 439447.59 / 5280;
RUN;

PROC PRINT data = miles;
RUN;
SAS Output

The SAS System

Obs length_mi
1 83.2287
  1. How many types (type) bike lanes are there? Which type (a) occurs the most and (b) has the longest average bike lane length?

PROC FREQ data = bike;
  TABLE type;
RUN;

PROC MEANS data = bike noprint;
   CLASS type;
   VAR length;
   OUTPUT out = type_mean mean = length_mean;
RUN;

PROC SORT data = type_mean;
  BY descending length_mean;
RUN;

PROC PRINT DATA = type_mean;
RUN;
SAS Output

The SAS System

The FREQ Procedure

type Frequency Percent Cumulative
Frequency
Cumulative
Percent
Frequency Missing = 9
BIKE BOULEVARD 49 3.02 49 3.02
BIKE LANE 621 38.29 670 41.31
CONTRAFLOW 13 0.80 683 42.11
SHARED BUS BIKE 39 2.40 722 44.51
SHARROW 589 36.31 1311 80.83
SIDEPATH 7 0.43 1318 81.26
SIGNED ROUTE 304 18.74 1622 100.00

The SAS System

Obs type _TYPE_ _FREQ_ length_mean
1 SIDEPATH 1 7 665.87751019
2 BIKE LANE 1 621 300.24183502
3 SHARED BUS BIKE 1 39 276.66575695
4   0 1622 269.48876975
5 SIGNED ROUTE 1 304 263.57986593
6 SHARROW 1 589 243.8718592
7 BIKE BOULEVARD 1 49 197.27815495
8 CONTRAFLOW 1 13 136.46646088

There are 7 bike lane types (8 if you include missing as a category). The lane type that occurrs the most often is bike lane with a count of 621. The lane type sidepath has the longest average length at 665.88 ft.

  1. How many different projects (project) do the bike lanes fall into? Which project category has the longest average bike lane length?

PROC FREQ data = bike;
  TABLE project;
RUN;

PROC MEANS data = bike noprint;
   CLASS project;
   VAR length;
   OUTPUT out = project_mean mean = length_mean;
RUN;

PROC SORT data = project_mean;
  BY descending length_mean;
RUN;

PROC PRINT DATA = project_mean;
RUN;
SAS Output

The SAS System

The FREQ Procedure

project Frequency Percent Cumulative
Frequency
Cumulative
Percent
Frequency Missing = 74
CHARM CITY CIRCULATOR 39 2.50 39 2.50
COLLEGETOWN 339 21.77 378 24.28
COLLEGETOWN NETWORK 13 0.83 391 25.11
ENGINEERING CONSTRUCTION 12 0.77 403 25.88
GUILFORD AVE BIKE BLVD 49 3.15 452 29.03
MAINTENANCE 4 0.26 456 29.29
OPERATION ORANGE CONE 458 29.42 914 58.70
PARK HEIGHTS BIKE NETWORK 172 11.05 1086 69.75
PLANNING TRAFFIC 18 1.16 1104 70.91
SOUTHEAST BIKE NETWORK 323 20.75 1427 91.65
TRAFFIC 51 3.28 1478 94.93
TRAFFIC CALMING 79 5.07 1557 100.00

The SAS System

Obs project _TYPE_ _FREQ_ length_mean
1 MAINTENANCE 1 4 1942.1522856
2 ENGINEERING CONSTRUCTION 1 12 512.09762483
3 TRAFFIC 1 51 419.52878809
4 COLLEGETOWN 1 339 320.68363126
5 PARK HEIGHTS BIKE NETWORK 1 172 283.22519367
6 CHARM CITY CIRCULATOR 1 39 276.66575695
7   0 1557 272.0534698
8 TRAFFIC CALMING 1 79 268.53137627
9 OPERATION ORANGE CONE 1 458 250.07840273
10 COLLEGETOWN NETWORK 1 13 213.63728443
11 SOUTHEAST BIKE NETWORK 1 323 210.82827388
12 PLANNING TRAFFIC 1 18 209.42893827
13 GUILFORD AVE BIKE BLVD 1 49 197.27815495

There are 12 project types (13 if you include missing as a project type). The maintenance project has the longest average bike lane length at 1942.15 ft.

  1. What was the average bike lane length per year that they were installed? (Be sure to first set dateInstalled to . if it is equal to zero.)

DATA bike;
   SET bike;
   IF dateinstalled = 0 THEN dateinstalled = .;
RUN;

PROC MEANS data = bike mean;
  CLASS dateinstalled;
  VAR length;
RUN;
SAS Output

The SAS System

The MEANS Procedure

Analysis Variable : length
dateInstalled N Obs Mean
2006 2 1469.35
2007 368 309.8903287
2008 206 249.4810027
2009 86 407.2947463
2010 625 245.6015777
2011 101 232.8127593
2012 107 271.4321825
2013 10 290.4263702
  1. (a) Numerically and (b) graphically describe the distribution of bike lane lengths (length).

PROC UNIVARIATE data = bike plots;
  VAR length;
RUN;
SAS Output

The SAS System

The UNIVARIATE Procedure

Variable: length

Moments
N 1631 Sum Weights 1631
Mean 269.434449 Sum Observations 439447.587
Std Deviation 277.706646 Variance 77120.9813
Skewness 5.1297861 Kurtosis 43.3888947
Uncorrected SS 244109518 Corrected SS 125707200
Coeff Variation 103.070208 Std Error Mean 6.87637089
Basic Statistical Measures
Location Variability
Mean 269.4344 Std Deviation 277.70665
Median 200.3027 Variance 77121
Mode . Range 3749
    Interquartile Range 217.16479
Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 39.18265 Pr > |t| <.0001
Sign M 815 Pr >= |M| <.0001
Signed Rank S 664632.5 Pr >= |S| <.0001
Quantiles (Definition 5)
Level Quantile
100% Max 3749.3226
99% 1341.8291
95% 664.2305
90% 502.2232
75% Q3 341.1493
50% Median 200.3027
25% Q1 123.9845
10% 67.8206
5% 45.3655
1% 14.8383
0% Min 0.0000
Extreme Observations
Lowest Highest
Value Obs Value Obs
0.00000 4 2536.73 1284
5.51372 523 2715.20 1344
6.85712 724 2916.53 903
8.16243 490 3467.68 904
9.71914 595 3749.32 3
Plots for length
  1. Describe the distribution of bike lane lengths numerically and graphically after stratifying them by (a) type and then by (b) number of lanes (numLanes).

PROC MEANS data=bike mean std;
   CLASS type;
   VAR length;
RUN;

PROC SGPLOT data = bike;
   VBOX length / category = type;
RUN;

PROC MEANS data=bike mean std;
   CLASS numlanes;
   VAR length;
RUN;

PROC SGPLOT data = bike;
   VBOX length / category = numlanes;
RUN;
SAS Output

The SAS System

The MEANS Procedure

Analysis Variable : length
type N Obs Mean Std Dev
BIKE BOULEVARD 49 197.2781549 113.1318019
BIKE LANE 621 300.2418350 314.6605599
CONTRAFLOW 13 136.4664609 48.5165725
SHARED BUS BIKE 39 276.6657570 140.1811711
SHARROW 589 243.8718592 188.4222254
SIDEPATH 7 665.8775102 619.3894227
SIGNED ROUTE 304 263.5798659 347.8501541

The SGPlot Procedure

The SAS System

The MEANS Procedure

Analysis Variable : length
numLanes N Obs Mean Std Dev
0 21 308.3767969 201.5548096
1 900 277.9016282 317.8281529
2 710 257.5495894 218.6179448

The SGPlot Procedure

17.7. SAS Formats and Dates

  1. Use proc freq to make a table of the different bike lane types.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
            out = bike dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC FREQ data = bike;
   TABLE type;
RUN;
SAS Output

The SAS System

The FREQ Procedure

type Frequency Percent Cumulative
Frequency
Cumulative
Percent
Frequency Missing = 9
BIKE BOULEVARD 49 3.02 49 3.02
BIKE LANE 621 38.29 670 41.31
CONTRAFLOW 13 0.80 683 42.11
SHARED BUS BIKE 39 2.40 722 44.51
SHARROW 589 36.31 1311 80.83
SIDEPATH 7 0.43 1318 81.26
SIGNED ROUTE 304 18.74 1622 100.00
  1. Create a format that changes changes the bike lane types by mapping “SIDEPATH”, “BIKE BOULEVARD”, and “BIKE LANE” to itself but does not provide a value for the other types. Apply this format to type and make a table of type using PROC FREQ. What happens to the other bike lane types?

PROC FORMAT;
   VALUE $typefmt "SIDEPATH" = "SIDEPATH"
                   "BIKE BOULEVARD" = "BIKE BOULEVARD"
                   "BIKE LANE" = "BIKE LANE"
                   OTHER = " ";
RUN;

DATA bike2;
  SET bike;
  FORMAT type $typefmt.;
RUN;

PROC FREQ data = bike2;
   TABLE type;
RUN;
SAS Output

The SAS System

The FREQ Procedure

type Frequency Percent Cumulative
Frequency
Cumulative
Percent
Frequency Missing = 954
BIKE BOULEVARD 49 7.24 49 7.24
BIKE LANE 621 91.73 670 98.97
SIDEPATH 7 1.03 677 100.00

We have translated all but BIKE BOULEVARD, BIKE LANE, and SIDEPATE to missing, so we only get these three listed as categories in the table.

  1. Create a format that changes the bike lane types to be “CONTRAFLOW”, “SHARED BUS BIKE”, “SHARROW”, “SIGNED ROUTE”, or “OTHER” if the lane type is anything else.

PROC FORMAT;
   VALUE $type2fmt "CONTRAFLOW" = "CONTRAFLOW"
                   "SHARED BUS BIKE" = "SHARED BUS BIKE"
                   "SHARROW" = "SHARROW"
                   "SIGNED ROUTE" = "SIGNED ROUTE"
                   "SIDEPATH", "BIKE BOULEVARD", "BIKE LANE" = "OTHER";
RUN;

DATA bike3;
  SET bike;
  FORMAT type $type2fmt.;
RUN;

PROC FREQ data = bike3;
   TABLE type;
RUN;
SAS Output

The SAS System

The FREQ Procedure

type Frequency Percent Cumulative
Frequency
Cumulative
Percent
Frequency Missing = 9
OTHER 677 41.74 677 41.74
CONTRAFLOW 13 0.80 690 42.54
SHARED BUS BIKE 39 2.40 729 44.94
SHARROW 589 36.31 1318 81.26
SIGNED ROUTE 304 18.74 1622 100.00
  1. Write a datatep that can be used to read in the following small dataset with three dates and one time. Then print the dataset using readable date formats.


DATA temp;
  INPUT date1 /*informat1*/ +1 date2 /*informat2*/ +1
        date3 /*informat3*/ @32 time /*informat4*/;
  DATALINES;
2014/02/14 06Jan2018 4/5/2016   03:2:22
;
RUN;

DATA temp;
INPUT date1 yymmdd10. +1 date2 DATE9. +1
     date3 mmddyy10. @32 time time8.;
DATALINES;
2014/02/14 06Jan2018 4/5/2016   03:2:22
;
RUN;

PROC PRINT data = temp;
  FORMAT date1 date2 date3 mmddyy10. time time8.;
RUN;
SAS Output

The SAS System

Obs date1 date2 date3 time
1 02/14/2014 01/06/2018 04/05/2016 3:02:22

17.8. Character Functions

  1. Read in the Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, using PROC IMPORT.

PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Baltimore_City_Employee_Salaries_FY2015.csv"
            out = sal dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC PRINT data = sal(OBS=5);
RUN;
SAS Output

The SAS System

Obs name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay
1 Aaron,Patricia G Facilities/Office Services II A03031 OED-Employment Dev (031) 10/24/1979 55,314 53,626
2 Aaron,Petra L ASSISTANT STATE'S ATTORNEY A29045 States Attorneys Office (045) 09/25/2006 74,000 73,000
3 Abaineh,Yohannes T EPIDEMIOLOGIST A65026 HLTH-Health Department (026) 07/23/2009 64,500 64,404
4 Abbene,Anthony M POLICE OFFICER A99005 Police Department (005) 07/24/2013 46,309 59,620
5 Abbey,Emmanuel CONTRACT SERV SPEC II A40001 M-R Info Technology (001) 05/01/2013 60,060 54,060
  1. Make a temporary SAS dataset called health_sal using the salaries dataset, with only the agencies (JobTitle) of those with “fire” (anywhere in the job title), if any, in the name. Be sure to ignore case in the string matching for ‘fire’.

DATA health_sal;
    SET sal;
    WHERE index(lowcase(JobTitle), 'fire') > 0;
RUN;

PROC PRINT data=health_sal(OBS=5);
    VAR JobTitle;
RUN;
SAS Output

The SAS System

Obs JobTitle
1 EMT Firefighter Suppression
2 Fire Pump Operator Supp ALS
3 Firefighter/Paramedic Suppress
4 Fire Pump Operator Supp ALS
5 Firefighter Suppression
  1. Make a temporary data set called trans which contains only agencies that contain “TRANS” (case sensitive).

DATA trans;
    SET sal;
    WHERE index(JobTitle, 'TRANS') > 0;
RUN;

PROC PRINT data=trans(OBS=5);
    VAR JobTitle;
RUN;
SAS Output

The SAS System

Obs JobTitle
1 TRANSPORTATION ASSOC II
2 TAX TRANSFER CLERK I
3 GENL SUPT TRANSPORTATION MAINT
4 TAX TRANSFER CLERK I
5 TAX TRANSFER CLERK I
  1. What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.

PROC PRINT data=sal;
    VAR name JobTitle;
    WHERE index(lowcase(name), 'abra') > 0;
RUN;
SAS Output

The SAS System

Obs name JobTitle
26 Abraham,Donta D LABORER (Hourly)
27 Abraham,Santhosh ACCOUNTANT I
28 Abraham,Sharon M HOUSING INSPECTOR
29 Abrahams,Brandon A POLICE OFFICER TRAINEE
30 Abrams,Maria OFFICE SERVICES ASSISTANT II S
31 Abrams,Maxine COLLECTIONS REPRESENTATIVE I
32 Abrams,Terry RECREATION ARTS INSTRUCTOR
958 Bey,Abraham PROCUREMENT SPECIALIST II
3761 Elgamil,Abraham D AUDITOR SUPV
4545 Gatto,Abraham M POLICE OFFICER
11107 Schwartz,Abraham M GENERAL COUNSEL
12789 Velez,Abraham L POLICE OFFICER (EID)
  1. Create a subset of the salaries dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with (case sensitive) string matching. Call this emer. How many employees are in this new dataset?

DATA emer;
   SET sal;
   WHERE index(agency, "Sheriff's Office") > 0 OR
         index(agency, "Police Department") > 0 OR
         index(agency, "Fire Department") > 0;
RUN;

PROC PRINT data=emer(OBS=5);
   VAR agency;
RUN;
SAS Output

The SAS System

Obs Agency
1 Police Department (005)
2 Fire Department (120)
3 Police Department (127)
4 Sheriff's Office (410)
5 Fire Department (465)
  1. Create a variable called dept in the emer data set by extracting the the part of the Agency variable up until ‘ment’ or ‘ice’. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Use:


re = prxparse('/.*(ment|ice)/');
call prxsubstr(re, Agency, pos, len);
dept = substr(Agency, pos, len);
DATA emer;
   SET emer;
   re = prxparse('/.*(ment|ice)/');
   call prxsubstr(re, Agency, pos, len);
   dept = substr(Agency, pos, len);
RUN;

PROC FREQ data = emer;
   TABLE dept;
RUN;
SAS Output

The SAS System

The FREQ Procedure

dept Frequency Percent Cumulative
Frequency
Cumulative
Percent
Fire Department 1615 32.81 1615 32.81
Police Department 3097 62.92 4712 95.73
Sheriff's Office 210 4.27 4922 100.00

17.9. Data Cleaning

  1. Read in the bike lanes dataset Bike_Lanes.csv using PROC IMPORT and call it bike.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
            out = bike dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC PRINT DATA = bike(OBS=5);
RUN;
SAS Output

The SAS System

Obs subType name block type numLanes project route length dateInstalled
1       BIKE BOULEVARD 1 GUILFORD AVE BIKE BLVD   435.73787702 0
2       SIDEPATH 1   NORTHERN 1024.6746249 2010
3       SIGNED ROUTE 1 SOUTHEAST BIKE NETWORK   3749.3226377 2010
4   HUNTINGDON PATH   SIDEPATH 1     0 0
5 STCLN EDMONDSON AVE 5300 BLK EDMONDSON AVE BIKE LANE 1 OPERATION ORANGE CONE   180.92546489 2011
  1. How many rows are are in the bike dataset? How many are complete cases? Hint: Use cmiss(of ALL) in a dataset to create a 0/1 variable to indicate if it complete or not and then use PROC MEANS to sum this 0/1 variable.

PROC CONTENTS data = bike;
RUN;
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.BIKE Observations 1631
Member Type DATA Variables 9
Engine V9 Indexes 0
Created 10/18/2020 11:55:08 Observation Length 136
Last Modified 10/18/2020 11:55:08 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 4
First Data Page 1
Max Obs per Page 481
Obs in First Data Page 459
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/bike.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671716
Access Permission rw-r--r--
Owner Name sasdemo
File Size 320KB
File Size (bytes) 327680
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
3 block Char 29 $29. $29.
9 dateInstalled Num 8 BEST12. BEST32.
8 length Num 8 BEST12. BEST32.
2 name Char 20 $20. $20.
5 numLanes Num 8 BEST12. BEST32.
6 project Char 25 $25. $25.
7 route Char 13 $13. $13.
1 subType Char 6 $6. $6.
4 type Char 15 $15. $15.

There are 1631 rows in the bike data set.

DATA bike;
  SET bike;
  IF cmiss(of _ALL_) > 0 THEN complete = 0;
  ELSE complete = 1;
RUN;

PROC PRINT data = bike(OBS=5);
RUN;

PROC MEANS data = bike sum;
   VAR complete;
RUN;
SAS Output

The SAS System

Obs subType name block type numLanes project route length dateInstalled complete
1       BIKE BOULEVARD 1 GUILFORD AVE BIKE BLVD   435.73787702 0 0
2       SIDEPATH 1   NORTHERN 1024.6746249 2010 0
3       SIGNED ROUTE 1 SOUTHEAST BIKE NETWORK   3749.3226377 2010 0
4   HUNTINGDON PATH   SIDEPATH 1     0 0 0
5 STCLN EDMONDSON AVE 5300 BLK EDMONDSON AVE BIKE LANE 1 OPERATION ORANGE CONE   180.92546489 2011 0

The SAS System

The MEANS Procedure

Analysis Variable : complete
Sum
257.0000000
  1. Filter rows of bike that are NOT missing the route variable, assign this to the object have_route. Create a frequency table of the subType variable using PROC FREQ, including the missing subTypes.

DATA have_route;
   SET bike;
   WHERE NOT MISSING(route);
RUN;

PROC FREQ data = have_route;
   TABLE subType / missing;
RUN;
SAS Output

The SAS System

The FREQ Procedure

subType Frequency Percent Cumulative
Frequency
Cumulative
Percent
  1 0.28 1 0.28
STRALY 3 0.83 4 1.10
STRPRD 358 98.90 362 100.00

17.10. Data Manipulation

  1. Read in the Bike_Lanes_Wide.csv dataset and call is wide using PROC IMPORT. Print the first few rows of the dataset.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes_Wide.csv"
            out = wide dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC CONTENTS data = wide;
RUN;

PROC PRINT DATA = wide(OBS=5);
RUN;
SAS Connection established. Subprocess id is 8974
SAS Output

The SAS System

The CONTENTS Procedure

Data Set Name WORK.WIDE Observations 134
Member Type DATA Variables 9
Engine V9 Indexes 0
Created 10/29/2020 14:57:01 Observation Length 141
Last Modified 10/29/2020 14:57:01 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 464
Obs in First Data Page 134
Number of Data Set Repairs 0
Filename /tmp/SAS_workFB360000230E_localhost.localdomain/wide.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671855
Access Permission rw-r--r--
Owner Name sasdemo
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
2 BIKE BOULEVARD Char 11 $11. $11.
3 BIKE LANE Char 11 $11. $11.
4 CONTRAFLOW Char 11 $11. $11.
9 NA Char 11 $11. $11.
5 SHARED BUS BIKE Char 11 $11. $11.
6 SHARROW Char 11 $11. $11.
7 SIDEPATH Char 11 $11. $11.
8 SIGNED ROUTE Char 11 $11. $11.
1 name Char 53 $53. $53.

The SAS System

Obs name BIKE BOULEVARD BIKE LANE CONTRAFLOW SHARED BUS BIKE SHARROW SIDEPATH SIGNED ROUTE NA
1 ALBEMARLE ST NA NA NA NA 109.9141727 NA 234.9190031 NA
2 ALICEANNA ST NA NA NA NA 250.7103035 NA NA NA
3 ARGONNE DR NA 1312.608373 NA NA NA NA NA NA
4 ART MUSEUM DR NA NA NA NA NA NA 277.6579871 NA
5 AUCHENTOROLY TERR NA 1341.82908 NA NA NA NA NA NA
  1. Reshape wide using either PROC TRANSPOSE or a DATA step. You will need to gather all columns except the name column. Transform into a long dataset with two new columns lanetype (the former column names) and the_length the data values. In the variable the_length, replace ‘NA’ values with . and convert it to a numeric column.

DATA long;
   SET wide;
   array col{*} 'BIKE BOULEVARD'n -- NA; /*BIKE_BOULEVARD -- NA*/
   DO i = 1 TO dim(col);
      lanetype = vname(col(i));
      IF col(i) = 'NA' THEN the_length = .;
      ELSE the_length = INPUT(col(i), 20.7);
      OUTPUT;
   END;
   DROP 'BIKE BOULEVARD'n -- NA i; /*BIKE_BOULEVARD -- NA*/
RUN;

PROC PRINT data = long(obs=10);
   TITLE 'Using a DATA step';
RUN;

PROC SORT data=wide;
   BY name;
RUN;

PROC TRANSPOSE data = wide
               out = long2 (RENAME = (COL1 = the_length 
                                      _NAME_ = lanetype));
   BY name;
   VAR 'BIKE BOULEVARD'n -- NA;  /*BIKE_BOULEVARD -- NA*/
RUN;

DATA long2;
   SET long2(RENAME = (the_length = length));
   IF length = 'NA' THEN the_length = .;
   ELSE the_length = INPUT(length, 20.7);
   DROP length;
RUN;

PROC PRINT data = long2(OBS=10);
   TITLE 'Using PROC TRANSPOSE';
RUN;

TITLE;
SAS Output

Using a DATA step

Obs name lanetype the_length
1 ALBEMARLE ST BIKE BOULEVARD .
2 ALBEMARLE ST BIKE LANE .
3 ALBEMARLE ST CONTRAFLOW .
4 ALBEMARLE ST SHARED BUS BIKE .
5 ALBEMARLE ST SHARROW 109.914
6 ALBEMARLE ST SIDEPATH .
7 ALBEMARLE ST SIGNED ROUTE 234.919
8 ALBEMARLE ST NA .
9 ALICEANNA ST BIKE BOULEVARD .
10 ALICEANNA ST BIKE LANE .

Using PROC TRANSPOSE

Obs name lanetype the_length
1 ALBEMARLE ST BIKE BOULEVARD .
2 ALBEMARLE ST BIKE LANE .
3 ALBEMARLE ST CONTRAFLOW .
4 ALBEMARLE ST SHARED BUS BIKE .
5 ALBEMARLE ST SHARROW 109.914
6 ALBEMARLE ST SIDEPATH .
7 ALBEMARLE ST SIGNED ROUTE 234.919
8 ALBEMARLE ST NA .
9 ALICEANNA ST BIKE BOULEVARD .
10 ALICEANNA ST BIKE LANE .
  1. Read in the roads and crashes .csv files and call them road and crash.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/roads.csv"
            out = road dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/crashes.csv"
            out = crash dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC PRINT data = road;
RUN;

PROC PRINT data = crash(OBS=10);
RUN;
SAS Output
Obs Road District Length
1 Interstate 65 Greenfield 262
2 Interstate 70 Vincennes 156
3 US-36 Crawfordsville 139
4 US-40 Greenfield 150
5 US-52 Crawfordsville 172

Obs Year Road N_Crashes Volume
1 1991 Interstate 65 25 40000
2 1992 Interstate 65 37 41000
3 1993 Interstate 65 45 45000
4 1994 Interstate 65 46 45600
5 1995 Interstate 65 46 49000
6 1996 Interstate 65 59 51000
7 1997 Interstate 65 76 52000
8 1998 Interstate 65 90 58000
9 1999 Interstate 65 95 65000
10 2000 Interstate 65 95 74000
  1. Replace (using tranwrd) any hyphens (-) with a space in Road variable of crash. Call this data crash2. Table the Road variable with PROC FREQ.

DATA crash2;
   SET crash;
   road = tranwrd(road, '-', ' ');
RUN;

PROC FREQ data = crash2;
   TABLE road;
RUN;
SAS Output

The FREQ Procedure

Road Frequency Percent Cumulative
Frequency
Cumulative
Percent
Interstate 275 22 20.00 22 20.00
Interstate 65 22 20.00 44 40.00
Interstate 70 22 20.00 66 60.00
US 36 22 20.00 88 80.00
US 40 22 20.00 110 100.00
  1. How many observations are in each of the crash and road datasets?

PROC CONTENTS data = crash;
RUN;

PROC CONTENTS data = road;
RUN;
SAS Output

The CONTENTS Procedure

Data Set Name WORK.CRASH Observations 110
Member Type DATA Variables 4
Engine V9 Indexes 0
Created 10/18/2020 17:00:48 Observation Length 40
Last Modified 10/18/2020 17:00:48 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 1632
Obs in First Data Page 110
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/crash.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671741
Access Permission rw-r--r--
Owner Name sasdemo
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
3 N_Crashes Num 8 BEST12. BEST32.
2 Road Char 14 $14. $14.
4 Volume Num 8 BEST12. BEST32.
1 Year Num 8 BEST12. BEST32.

The CONTENTS Procedure

Data Set Name WORK.ROAD Observations 5
Member Type DATA Variables 3
Engine V9 Indexes 0
Created 10/18/2020 17:00:48 Observation Length 40
Last Modified 10/18/2020 17:00:48 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 1632
Obs in First Data Page 5
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/road.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671709
Access Permission rw-r--r--
Owner Name sasdemo
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
2 District Char 14 $14. $14.
3 Length Num 8 BEST12. BEST32.
1 Road Char 13 $13. $13.

The crash dataset has 110 observations and the road datset has 5.

  1. Separate the Road column (using scan) into (type and number) in crash2. Reassign this to crash2. Table type from crash2 using PROC FREQ. Then create a new variable calling it road_hyphen using one of the concatenate functions (such as CAT). Unite the type and number columns using a hyphen (-) and then table road_hyphen using PROC FREQ.

DATA crash2;
   SET crash2;
   type = scan(road, 1);
   number = scan(road, 2);
RUN;

PROC FREQ data=crash2;
   TABLE type;
RUN;

DATA crash2;
   SET crash2;
   road_hyphen = CATX('-', type, number);
RUN;

PROC FREQ data=crash2;
   TABLE road_hyphen;
RUN;
SAS Output

The FREQ Procedure

type Frequency Percent Cumulative
Frequency
Cumulative
Percent
Interstate 66 60.00 66 60.00
US 44 40.00 110 100.00

The FREQ Procedure

road_hyphen Frequency Percent Cumulative
Frequency
Cumulative
Percent
Interstate-275 22 20.00 22 20.00
Interstate-65 22 20.00 44 40.00
Interstate-70 22 20.00 66 60.00
US-36 22 20.00 88 80.00
US-40 22 20.00 110 100.00
  1. Which and how many years were data collected in the crash dataset?

PROC FREQ data=crash;
   TABLE year;
RUN;
SAS Output

The FREQ Procedure

Year Frequency Percent Cumulative
Frequency
Cumulative
Percent
1991 5 4.55 5 4.55
1992 5 4.55 10 9.09
1993 5 4.55 15 13.64
1994 5 4.55 20 18.18
1995 5 4.55 25 22.73
1996 5 4.55 30 27.27
1997 5 4.55 35 31.82
1998 5 4.55 40 36.36
1999 5 4.55 45 40.91
2000 5 4.55 50 45.45
2001 5 4.55 55 50.00
2002 5 4.55 60 54.55
2003 5 4.55 65 59.09
2004 5 4.55 70 63.64
2005 5 4.55 75 68.18
2006 5 4.55 80 72.73
2007 5 4.55 85 77.27
2008 5 4.55 90 81.82
2009 5 4.55 95 86.36
2010 5 4.55 100 90.91
2011 5 4.55 105 95.45
2012 5 4.55 110 100.00

The crash dataset contains data on the years from 1991 to 2012 for a total of 22 years.

  1. Read in the dataset Bike_Lanes.csv and call it bike.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
            out = bike dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC PRINT data=bike(obs=5);
RUN;
SAS Output

The SAS System

Obs subType name block type numLanes project route length dateInstalled
1       BIKE BOULEVARD 1 GUILFORD AVE BIKE BLVD   435.73787702 0
2       SIDEPATH 1   NORTHERN 1024.6746249 2010
3       SIGNED ROUTE 1 SOUTHEAST BIKE NETWORK   3749.3226377 2010
4   HUNTINGDON PATH   SIDEPATH 1     0 0
5 STCLN EDMONDSON AVE 5300 BLK EDMONDSON AVE BIKE LANE 1 OPERATION ORANGE CONE   180.92546489 2011
  1. Keep rows where the record is not missing type and not missing name and re-assign the output to bike.

DATA bike;
   SET bike;
   WHERE NOT MISSING(type);
RUN;

PROC PRINT data=bike(obs=5);
RUN;
SAS Output

The SAS System

Obs subType name block type numLanes project route length dateInstalled
1       BIKE BOULEVARD 1 GUILFORD AVE BIKE BLVD   435.73787702 0
2       SIDEPATH 1   NORTHERN 1024.6746249 2010
3       SIGNED ROUTE 1 SOUTHEAST BIKE NETWORK   3749.3226377 2010
4   HUNTINGDON PATH   SIDEPATH 1     0 0
5 STCLN EDMONDSON AVE 5300 BLK EDMONDSON AVE BIKE LANE 1 OPERATION ORANGE CONE   180.92546489 2011
  1. Using PROC MEANS with a BY statement grouping name and type (i.e for each type within each name), find the sum of the length. Use an OUTPUT statement to get this summary dataset and only keep the name, type and sum of the length column (after renaming this column length). Call this data set sub.

PROC SORT data = bike;
  BY name type;
RUN;

PROC MEANS data = bike noprint;
  BY name type;
  VAR length;
  OUTPUT out=sub sum=length;
RUN;

DATA sub;
   SET sub;
   keep name type length;
RUN;

PROC PRINT data=sub(obs=8);
RUN;
SAS Output

The SAS System

Obs name type length
1   BIKE BOULEVARD 435.73787702
2   BIKE LANE 695.69831669
3   SHARROW 186.96669187
4   SIDEPATH 1024.6746249
5   SIGNED ROUTE 6527.883024
6 ALBEMARLE ST SHARROW 439.65669098
7 ALBEMARLE ST SIGNED ROUTE 1879.3520247
8 ALICEANNA ST SHARROW 4763.495767
PROC FREQ data=sub;
   TABLE type;
RUN;
SAS Output

The SAS System

The FREQ Procedure

type Frequency Percent Cumulative
Frequency
Cumulative
Percent
BIKE BOULEVARD 3 1.80 3 1.80
BIKE LANE 54 32.34 57 34.13
CONTRAFLOW 4 2.40 61 36.53
SHARED BUS BIKE 4 2.40 65 38.92
SHARROW 50 29.94 115 68.86
SIDEPATH 5 2.99 120 71.86
SIGNED ROUTE 47 28.14 167 100.00
  1. Reshape sub from long to wide by taking the type to be the new columns and length to be the value in those columns. (NOTE: the names have spaces in them. Do we need to replace the spaces with a character before changing them to column names?)

DATA new_wide;
   SET sub;
   BY name;
   RETAIN BIKE_BOULEVARD BIKE_LANE CONTRAFLOW SHARED_BUS_BIKE 
          SHARROW SIDEPATH SIGNED_ROUTE;
   array atype (*) BIKE_BOULEVARD BIKE_LANE CONTRAFLOW SHARED_BUS_BIKE 
                   SHARROW SIDEPATH SIGNED_ROUTE;
                   
   IF first.name THEN call missing(of atype(*));
   
   IF type = 'BIKE BOULEVARD' THEN atype(1) = length;
   ELSE IF type = 'BIKE LANE' THEN atype(2) = length;
   ELSE IF type = 'CONTRAFLOW' THEN atype(3) = length;
   ELSE IF type = 'SHARED BUS BIKE' THEN atype(4) = length;
   ELSE IF type = 'SHARROW' THEN atype(5) = length;
   ELSE IF type = 'SIDEPATH' THEN atype(6) = length;
   ELSE IF type = 'SIGNED ROUTE' THEN atype(7) = length;
   
   IF last.name THEN OUTPUT;
   DROP type length;
RUN;

PROC PRINT data=new_wide(OBS=5);
   TITLE 'Wide Dataset Using DATA Step';
RUN;

PROC TRANSPOSE data = sub out = new_wide2(drop=_NAME_);
   BY name;
   VAR length;
   ID type;
RUN;

PROC PRINT data = new_wide2(OBS=5);
   TITLE 'Wide Dataset Using PROC TRANSPOSE';
RUN;

TITLE ;
SAS Output

Wide Dataset Using DATA Step

Obs name BIKE_BOULEVARD BIKE_LANE CONTRAFLOW SHARED_BUS_BIKE SHARROW SIDEPATH SIGNED_ROUTE
1   435.738 695.70 . . 186.97 1024.67 6527.88
2 ALBEMARLE ST . . . . 439.66 . 1879.35
3 ALICEANNA ST . . . . 4763.50 . .
4 ARGONNE DR . 1312.61 . . . . .
5 ART MUSEUM DR . . . . . . 832.97

Wide Dataset Using PROC TRANSPOSE

Obs name BIKE BOULEVARD BIKE LANE SHARROW SIDEPATH SIGNED ROUTE CONTRAFLOW SHARED BUS BIKE
1   435.73787702 695.69831669 186.96669187 1024.6746249 6527.883024 . .
2 ALBEMARLE ST . . 439.65669098 . 1879.3520247 . .
3 ALICEANNA ST . . 4763.495767 . . . .
4 ARGONNE DR . 1312.6083729 . . . . .
5 ART MUSEUM DR . . . . 832.97396136 . .
  1. Join data in the crash and road datasets to retain only complete data, (using an inner join on road) Merge by the variable Road. Call the output merged. How many observations are there?

PROC SORT data = road;
  BY road;
RUN;

PROC SORT data = crash;
  BY road;
RUN;

DATA merged;
  MERGE road (IN = in1) 
        crash (IN = in2);
  BY road;
  IF in1 = 1 and in2 = 1;
RUN;

PROC CONTENTS data = merged;
RUN;
SAS Output

The CONTENTS Procedure

Data Set Name WORK.MERGED Observations 88
Member Type DATA Variables 6
Engine V9 Indexes 0
Created 10/18/2020 20:34:03 Observation Length 64
Last Modified 10/18/2020 20:34:03 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 1021
Obs in First Data Page 88
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/merged.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671747
Access Permission rw-r--r--
Owner Name sasdemo
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
2 District Char 14 $14. $14.
3 Length Num 8 BEST12. BEST32.
5 N_Crashes Num 8 BEST12. BEST32.
1 Road Char 13 $13. $13.
6 Volume Num 8 BEST12. BEST32.
4 Year Num 8 BEST12. BEST32.

There are 88 rows.

  1. Join data using a full_join. Call the output full. How many observations are there?

DATA full_join;
  MERGE road crash;
  BY road;
RUN;

PROC PRINT data = full_join(OBS=10);
RUN;
SAS Output
Obs Road District Length Year N_Crashes Volume
1 Interstate 27   . 1991 27 20350
2 Interstate 27   . 1992 26 21200
3 Interstate 27   . 1993 22 23200
4 Interstate 27   . 1994 21 21200
5 Interstate 27   . 1995 28 23200
6 Interstate 27   . 1996 22 20000
7 Interstate 27   . 1997 27 18000
8 Interstate 27   . 1998 21 19500
9 Interstate 27   . 1999 22 21000
10 Interstate 27   . 2000 29 20700
  1. Do a left join of the road and crash. ORDER matters here! How many observations are there?

DATA left_join;
   MERGE road (in = in1) crash;
   BY road;
   IF in1 = 1;
RUN;

PROC CONTENTS data = left_join;
RUN;
SAS Output

The CONTENTS Procedure

Data Set Name WORK.LEFT_JOIN Observations 89
Member Type DATA Variables 6
Engine V9 Indexes 0
Created 10/18/2020 20:38:16 Observation Length 64
Last Modified 10/18/2020 20:38:16 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 1021
Obs in First Data Page 89
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/left_join.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671749
Access Permission rw-r--r--
Owner Name sasdemo
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
2 District Char 14 $14. $14.
3 Length Num 8 BEST12. BEST32.
5 N_Crashes Num 8 BEST12. BEST32.
1 Road Char 13 $13. $13.
6 Volume Num 8 BEST12. BEST32.
4 Year Num 8 BEST12. BEST32.

There are 89 rows.

  1. Repeat above with a right_join with the same order of the arguments. How many observations are there?

DATA right_join;
   MERGE road crash (in = in1);
   BY road;
   IF in1 = 1;
RUN;

PROC CONTENTS data = right_join;
RUN;
SAS Output

The CONTENTS Procedure

Data Set Name WORK.RIGHT_JOIN Observations 110
Member Type DATA Variables 6
Engine V9 Indexes 0
Created 10/18/2020 20:39:48 Observation Length 64
Last Modified 10/18/2020 20:39:48 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 1021
Obs in First Data Page 110
Number of Data Set Repairs 0
Filename /tmp/SAS_workA5B9000010B1_localhost.localdomain/right_join.sas7bdat
Release Created 9.0401M6
Host Created Linux
Inode Number 671750
Access Permission rw-r--r--
Owner Name sasdemo
File Size 128KB
File Size (bytes) 131072
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
2 District Char 14 $14. $14.
3 Length Num 8 BEST12. BEST32.
5 N_Crashes Num 8 BEST12. BEST32.
1 Road Char 13 $13. $13.
6 Volume Num 8 BEST12. BEST32.
4 Year Num 8 BEST12. BEST32.

There are 110 observations in the right joined dataset.

17.11. The Output Delivery System and Graphics

PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Charm_City_Circulator_Ridership.csv"
            out = circ dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

DATA long;
   SET circ;
   ARRAY larray(*) orangeBoardings -- bannerAverage;
   DO i = 1 TO dim(larray);
      var = vname(larray(i));
      number = larray(i);
      var = tranwrd(var, 'Board', ' Board');
      var = tranwrd(var, 'Alight', ' Alight');
      var = tranwrd(var, 'Average', ' Average');
      route = scan(var, 1);
      type = scan(var, 2);
      OUTPUT;
    END;
    
    DROP i var orangeBoardings -- bannerAverage;
RUN;

DATA avg;
  SET long;
  WHERE type = 'Average' and number ne .;
RUN;

PROC PRINT data = long(obs = 10);
RUN;

PROC PRINT data = avg(obs=10);
RUN;
SAS Output
Obs day date daily number route type
1 Monday 01/11/2010 952 877 orange Boardings
2 Monday 01/11/2010 952 1027 orange Alightings
3 Monday 01/11/2010 952 952 orange Average
4 Monday 01/11/2010 952 . purple Boardings
5 Monday 01/11/2010 952 . purple Alightings
6 Monday 01/11/2010 952 . purple Average
7 Monday 01/11/2010 952 . green Boardings
8 Monday 01/11/2010 952 . green Alightings
9 Monday 01/11/2010 952 . green Average
10 Monday 01/11/2010 952 . banner Boardings

Obs day date daily number route type
1 Monday 01/11/2010 952 952.0 orange Average
2 Tuesday 01/12/2010 796 796.0 orange Average
3 Wednesday 01/13/2010 1211.5 1211.5 orange Average
4 Thursday 01/14/2010 1213.5 1213.5 orange Average
5 Friday 01/15/2010 1644 1644.0 orange Average
6 Saturday 01/16/2010 1490.5 1490.5 orange Average
7 Sunday 01/17/2010 888.5 888.5 orange Average
8 Monday 01/18/2010 999.5 999.5 orange Average
9 Tuesday 01/19/2010 1035 1035.0 orange Average
10 Wednesday 01/20/2010 1395.5 1395.5 orange Average
  1. Plot average ridership (avg data set) by date using a scatterplot. a. Color the points by route (orange, purple, green, banner) b. Add black smoothed curves for each route c. Color the points by day of the week

PROC SGPLOT data = avg;
   TITLE '1a';
   SCATTER Y = number X = date / group = route;
RUN;

PROC SGPLOT data = avg;
   TITLE '1b';
   SCATTER Y = number X = date / group = route;
   LOESS Y = number X = date / group = route lineattrs=(color = black);
RUN;

PROC SGPLOT data = avg;
   TITLE '1c';
   SCATTER Y = number X = date / group = day;
RUN;
SAS Output
The SGPlot Procedure

The SGPlot Procedure

The SGPlot Procedure
  1. Replot 1a where the colors of the points are the name of the route (with banner –> blue). Note: you will need to make a data attribute map.

DATA marker_map;
  INPUT id $2. +1 VALUE $6. +1 MARKERCOLOR $6.;
DATALINES;
mc banner blue
mc green  green
mc orange orange
mc purple purple
;
RUN;

PROC SGPLOT data = avg dattrmap = marker_map;
   TITLE '2';
   SCATTER Y = number X = date / group = route attrid = mc;
RUN;
SAS Output
The SGPlot Procedure
  1. Plot a scatterplot of average ridership by date with one panel per route.

PROC SGPANEL data = avg;
   TITLE '3';
   PANELBY route;
   SCATTER Y = number X = date;
RUN;
SAS Output
The SGPanel Procedure
  1. Plot a scatterplpot of average ridership by date with separate panels by day of the week, colored by route.

*Default colors are fine here to, but we can also use a data attribute map like we did before;
PROC SGPANEL data = avg dattrmap = marker_map;
   TITLE '4';
   PANELBY day;
   SCATTER Y = number X = date / group = route attrid = mc;
RUN;
SAS Output
The SGPanel Procedure
The SGPanel Procedure
The SGPanel Procedure
The SGPanel Procedure
  1. Plot a scatterplot of average ridership (avg) by date, colored by route (same as 1a). (do not take an average, use the average column for each route). Make the x-label “Year”. Make the y-label “Number of People”.

PROC SGPLOT data = avg dattrmap = marker_map;
   TITLE '5';
   SCATTER Y = number X = date / group = route attrid = mc;
   XAXIS LABEL = "Year";
   YAXIS LABEL = "Number of People";
RUN;
SAS Output
The SGPlot Procedure
  1. Plot average ridership on the orange route versus date as a solid line, and add dashed “error” lines based on the boardings and alightings. The line colors should be orange.

DATA orange;
  SET long;
  WHERE route = "orange";
RUN;

DATA line_map;
  INPUT id $2. +1 VALUE $10. +1 linepattern $5.;
DATALINES;
lp Alightings dash
lp Average    solid
lp Boardings  dash
;
RUN;

PROC SGPLOT data = orange dattrmap = line_map;
  TITLE '6';
  SERIES Y = number X = date / group = type attrid = lp lineattrs=(color = orange);
RUN;

TITLE;
SAS Output
The SGPlot Procedure

17.12. Statistical Analysis in SAS

FILENAME cardata '/folders/myfolders/SAS_Notes/data/kaggleCarAuction.csv';

PROC IMPORT datafile = cardata out = cars dbms = CSV replace;
   getnames = yes;
   guessingrows = 1000;
RUN;

FILENAME mortdat '/folders/myfolders/SAS_Notes/data/indicatordeadkids35.csv';

PROC IMPORT datafile = mortdat out = mort dbms = CSV replace;
   getnames = yes;
   guessingrows = 500;
RUN;
  1. Compute the correlation between the 1980, 1990, 2000, and 2010 mortality data. Just display the result to the screen. Then compute using the NOMMISS option. (Note: The column names are numbers, which are invalid standard SAS names, so to refer to the variable 1980 in your code use ‘1980’n.)

/*
PROC CORR data = mort;
   VAR '1980'n '1990'n '2000'n '2010'n;
RUN;

PROC CORR data = mort nomiss;
   VAR '1980'n '1990'n '2000'n '2010'n;
RUN;
*/

PROC CORR data = mort;
   VAR _1980 _1990 _2000 _2010;
RUN;

PROC CORR data = mort nomiss;
   VAR _1980 _1990 _2000 _2010;
RUN;
SAS Output

The CORR Procedure

4 Variables: 1980 1990 2000 2010
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
1980 197 0.94731 0.88694 186.62013 0.07049 3.49035
1990 197 0.70911 0.77929 139.69428 0.04922 3.28983
2000 197 0.55004 0.68922 108.35786 0.03232 2.81186
2010 193 0.39601 0.50021 76.42946 0.03173 2.02857
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
  1980 1990 2000 2010
1980
1.00000
 
197
0.96015
<.0001
197
0.88884
<.0001
197
0.84683
<.0001
193
1990
0.96015
<.0001
197
1.00000
 
197
0.96138
<.0001
197
0.92472
<.0001
193
2000
0.88884
<.0001
197
0.96138
<.0001
197
1.00000
 
197
0.98623
<.0001
193
2010
0.84683
<.0001
193
0.92472
<.0001
193
0.98623
<.0001
193
1.00000
 
193

The CORR Procedure

4 Variables: 1980 1990 2000 2010
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
1980 193 0.96250 0.88966 185.76175 0.07049 3.49035
1990 193 0.72132 0.78263 139.21550 0.05512 3.28983
2000 193 0.55976 0.69297 108.03459 0.03271 2.81186
2010 193 0.39601 0.50021 76.42946 0.03173 2.02857
Pearson Correlation Coefficients, N = 193
Prob > |r| under H0: Rho=0
  1980 1990 2000 2010
1980
1.00000
 
0.95968
<.0001
0.88774
<.0001
0.84683
<.0001
1990
0.95968
<.0001
1.00000
 
0.96103
<.0001
0.92472
<.0001
2000
0.88774
<.0001
0.96103
<.0001
1.00000
 
0.98623
<.0001
2010
0.84683
<.0001
0.92472
<.0001
0.98623
<.0001
1.00000
 
  1. a. Compute the correlation between the Myanmar, China, and United States mortality data. Store this correlation matrix in an object called country_cor using ODS OUTPUT. b. Extract the Myanmar-US correlation from the correlation matrix.

DATA mort_sub;
   SET mort(RENAME=(Var1 = Country));
   WHERE country in ('Myanmar' 'China' 'United States');
RUN;

/*
PROC TRANSPOSE data = mort_sub out = mort_sub_transposed;
   VAR '1760'n -- '2099'n;
   ID country;
RUN;
*/

PROC TRANSPOSE data = mort_sub out = mort_sub_transposed;
   VAR _1760 -- _2099;
   ID country;
RUN;

/*
PROC CORR DATA = mort_sub_transposed;
   VAR China Myanmar 'United States'n;
   ODS OUTPUT PearsonCorr = country_cor;
RUN;
*/

*ODS TRACE ON;
PROC CORR DATA = mort_sub_transposed;
   VAR China Myanmar United_States;
   ODS OUTPUT PearsonCorr = country_cor;
RUN;
*ODS TRACE OFF;

/*
PROC PRINT data = country_cor;
   Var Myanmar;
   WHERE Variable = 'United States';
RUN;
*/

PROC PRINT data = country_cor;
   Var Myanmar;
   WHERE Variable = 'United_States';
RUN;   
SAS Output

The CORR Procedure

3 Variables: China Myanmar United States
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
China 214 2.37608 1.15555 508.48153 0.04325 3.61453
Myanmar 214 2.82215 1.12767 603.94031 0.05150 3.76418
United States 214 1.36080 1.14407 291.21132 0.02867 3.17018
Pearson Correlation Coefficients, N = 214
Prob > |r| under H0: Rho=0
  China Myanmar United States
China
1.00000
 
0.97436
<.0001
0.69093
<.0001
Myanmar
0.97436
<.0001
1.00000
 
0.62926
<.0001
United States
0.69093
<.0001
0.62926
<.0001
1.00000
 

Obs Myanmar
3 0.62926
  1. Is there a difference between mortality information from 1990 and 2000? Run a paired t-test and a Wilcoxon signed rank test to assess this. Hint: to extract the column of information for 1990, use ‘1990’n.

/*
PROC TTEST data = mort;
   PAIRED '1990'n*'2000'n;
RUN;

DATA diff;
   SET mort;
   diff = '2000'n - '1990'n;
RUN;
*/

PROC TTEST data = mort;
   PAIRED _1990*_2000;
RUN;

DATA diff;
   SET mort;
   diff = _2000 - _1990;
RUN;

PROC UNIVARIATE data = diff;
  VAR diff;
RUN;
SAS Output

The TTEST Procedure

 

Difference: 1990 - 2000

N Mean Std Dev Std Err Minimum Maximum
197 0.1591 0.2227 0.0159 -0.5456 0.8726
Mean 95% CL Mean Std Dev 95% CL Std Dev
0.1591 0.1278 0.1904 0.2227 0.2027 0.2472
DF t Value Pr > |t|
196 10.03 <.0001
Summary Panel for Difference of 1990 and 2000
Profiles Plot for 1990 and 2000
Agreement Plot for 1990 and 2000
Q-Q Plot for Difference of 1990 and 2000

The UNIVARIATE Procedure

Variable: diff

Moments
N 197 Sum Weights 197
Mean -0.1590681 Sum Observations -31.336415
Std Deviation 0.22269839 Variance 0.04959457
Skewness -0.9205453 Kurtosis 2.01791651
Uncorrected SS 14.7051599 Corrected SS 9.72053615
Coeff Variation -140.00192 Std Error Mean 0.0158666
Basic Statistical Measures
Location Variability
Mean -0.15907 Std Deviation 0.22270
Median -0.09007 Variance 0.04959
Mode . Range 1.41823
    Interquartile Range 0.21805
Tests for Location: Mu0=0
Test Statistic p Value
Student's t t -10.0253 Pr > |t| <.0001
Sign M -84.5 Pr >= |M| <.0001
Signed Rank S -8245.5 Pr >= |S| <.0001
Quantiles (Definition 5)
Level Quantile
100% Max 0.5456100
99% 0.3990751
95% 0.0830986
90% -0.0104786
75% Q3 -0.0286619
50% Median -0.0900725
25% Q1 -0.2467130
10% -0.4798594
5% -0.6188205
1% -0.8710244
0% Min -0.8726242
Extreme Observations
Lowest Highest
Value Obs Value Obs
-0.872624 19 0.272621 32
-0.871024 131 0.321500 196
-0.808735 96 0.329956 22
-0.804626 54 0.399075 99
-0.794268 106 0.545610 197
  1. Using the cars dataset, fit a linear regression model with vehicle cost (VehBCost) as the outcome and vehicle age (VehicleAge) and whether it’s an online sale (IsOnlineSale) as predictors as well as their interaction.

PROC GLM data = cars;
   MODEL VehBCost = VehicleAge IsOnlineSale VehicleAge*IsOnlineSale;
RUN;
SAS Output

The GLM Procedure

Number of Observations Read 72983
Number of Observations Used 72983

The GLM Procedure

 

Dependent Variable: VehBCost

Source DF Sum of Squares Mean Square F Value Pr > F
Model 3 22422071293 7474023764.4 2652.08 <.0001
Error 72979 205667188003 2818169.4461    
Corrected Total 72982 228089259296      
R-Square Coeff Var Root MSE VehBCost Mean
0.098304 24.94067 1678.740 6730.934
Source DF Type I SS Mean Square F Value Pr > F
VehicleAge 1 22249980049 22249980049 7895.19 <.0001
IsOnlineSale 1 158865621 158865621 56.37 <.0001
VehicleAg*IsOnlineSa 1 13225623 13225623 4.69 0.0303
Source DF Type III SS Mean Square F Value Pr > F
VehicleAge 1 21533779839 21533779839 7641.05 <.0001
IsOnlineSale 1 64252535 64252535 22.80 <.0001
VehicleAg*IsOnlineSa 1 13225623 13225623 4.69 0.0303
Parameter Estimate Standard
Error
t Value Pr > |t|
Intercept 8062.701999 16.5870226 486.08 <.0001
VehicleAge -320.662108 3.6683525 -87.41 <.0001
IsOnlineSale 514.308268 107.7114631 4.77 <.0001
VehicleAg*IsOnlineSa -55.372926 25.5607123 -2.17 0.0303
  1. Create a variable called expensive in the cars data that indicates if the vehicle cost is over $10,000. Use a chi-squared test to assess if there is a relationship between a car being expensive and it being labeled as a “bad buy” (IsBadBuy).

DATA cars;
  SET cars;
  IF VehBCost > 10000 THEN expensive = 1;
  ELSE expensive = 0;
RUN;

PROC FREQ data = cars;
   TABLES expensive*IsBadBuy / chisq;
RUN;
SAS Output

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of expensive by IsBadBuy
expensive IsBadBuy
0 1 Total
0
61505
84.27
87.67
96.09
8652
11.85
12.33
96.39
70157
96.13
 
 
1
2502
3.43
88.54
3.91
324
0.44
11.46
3.61
2826
3.87
 
 
Total
64007
87.70
8976
12.30
72983
100.00

Statistics for Table of expensive by IsBadBuy

Statistic DF Value Prob
Chi-Square 1 1.8948 0.1687
Likelihood Ratio Chi-Square 1 1.9314 0.1646
Continuity Adj. Chi-Square 1 1.8152 0.1779
Mantel-Haenszel Chi-Square 1 1.8948 0.1687
Phi Coefficient   -0.0051  
Contingency Coefficient   0.0051  
Cramer's V   -0.0051  
Fisher's Exact Test
Cell (1,1) Frequency (F) 61505
Left-sided Pr <= F 0.0880
Right-sided Pr >= F 0.9211
   
Table Probability (P) 0.0091
Two-sided Pr <= P 0.1790

Sample Size = 72983

  1. Fit a logistic regression model where the outcome is “bad buy” status and predictors are the expensive status and vehicle age (VehicleAge). Request confidence intervals for the odds ratios.

PROC LOGISTIC data = cars;
   MODEL IsBadBuy(event='1') = expensive VehicleAge / CLODDS = WALD;
RUN;
SAS Output

The LOGISTIC Procedure

Model Information
Data Set WORK.CARS
Response Variable IsBadBuy
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 72983
Number of Observations Used 72983
Response Profile
Ordered
Value
IsBadBuy Total
Frequency
1 0 64007
2 1 8976

Probability modeled is IsBadBuy='1'.

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 54423.307 52447.788
SC 54432.505 52475.381
-2 Log L 54421.307 52441.788
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 1979.5198 2 <.0001
Score 2042.5760 2 <.0001
Wald 1961.8662 2 <.0001
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -3.2495 0.0332 9554.1782 <.0001
expensive 1 -0.0804 0.0607 1.7542 0.1853
VehicleAge 1 0.2866 0.00648 1958.7086 <.0001
Association of Predicted Probabilities and Observed Responses
Percent Concordant 56.9 Somers' D 0.279
Percent Discordant 29.1 Gamma 0.324
Percent Tied 14.0 Tau-a 0.060
Pairs 574526832 c 0.639
Odds Ratio Estimates and Wald Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
expensive 1.0000 0.923 0.819 1.039
VehicleAge 1.0000 1.332 1.315 1.349
Plot of Odds Ratios with 95% Wald Confidence Limits

17.13. Macros

  1. Write a macro that discretizes a quantitative variable into four categories based on quantiles. That is:

    • if X < Q1 then group = 1

    • if Q1 < X < M then group = 2

    • if M < X < Q3 then group = 3

    • if Q3 < X then group = 4

The macro should have the following defnition - %quartilesmacro(mydata, qvar, round, out);

  • mydata: dataset containing the quantitative variable

  • qvar: name of quantitative variable

  • round: integer representing number of decimal places to round to

  • out: name of output dataset which contains the categorized variable

We will explore this by using the Framingham dataset as an example to apply the macro to.

LIBNAME phc6089 "/folders/myfolders/SAS_Notes/data/";

*Create a temporary data set, so that we don't save changes to the original data set.;
DATA fghmTemp;
   SET phc6089.fghm113;
RUN;

/*Now let's code some variables with some more descriptive values.
  SEX (Gender): 1=Men 
				2=Women
  Period (Examination cycle): 1=Period1 
							  2=Period2
							  3=Period3
  BPMEDS (Use of anti-hypertensive meds): 0=Not currently
										  1=Currently use
  CURSMOKE (Currently smoke?): 0=No
							   1=Yes
  DIABETES: 0=Not diabetic
			1=Diabetic
  PREVAP (Have angina pectoric?): 0=No
								  1=Yes
  PREVCHD (Coronary heart disease?): 0=No
									 1=Yes
  PREVMI (Myocardial infarction?): 0=No
								   1=Yes
  PREVSTRK (Had a stroke?): 0=No
				   			1=Yes
  PREVHYP (Hypertensive? sys bp >=140 or dyas bp >= 90): 0=no
														 1=yes
*/

PROC FORMAT;
VALUE YNfmt 0="No"
			1="Yes";
VALUE perfmt 1="Period 1"
			 2="Period 2"
			 3="Period 3";
VALUE gndrfmt 1="Men"
			  2="Women";
RUN;

DATA fghmtemp;
SET fghmtemp;
FORMAT prevap ynfmt.
	   diabetes ynfmt.
	   cursmoke ynfmt.
	   bpmeds ynfmt.
	   prevchd ynfmt.
	   prevmi ynfmt.
	   prevstrk ynfmt.
	   prevhyp ynfmt.
	   sex gndrfmt.;
RUN;

5880  ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
5880! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
5881
5882 LIBNAME phc6089 "/folders/myfolders/SAS_Notes/data/";
NOTE: Libref PHC6089 was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myfolders/SAS_Notes/data
5883
5884 *Create a temporary data set, so that we don't save changes to the original data set.;
5885 DATA fghmTemp;
NOTE: Data file PHC6089.FGHM113.DATA is in a format that is native to another host, or the file encoding does not match the session
encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
5886 SET phc6089.fghm113;
5887 RUN;
NOTE: There were 500 observations read from the data set PHC6089.FGHM113.
NOTE: The data set WORK.FGHMTEMP has 500 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.00 seconds

5888
5889 /*Now let's code some variables with some more descriptive values.
5890 SEX (Gender): 1=Men
5891 2=Women
5892 Period (Examination cycle): 1=Period1
5893 2=Period2
5894 3=Period3
5895 BPMEDS (Use of anti-hypertensive meds): 0=Not currently
5896 1=Currently use
5897 CURSMOKE (Currently smoke?): 0=No
5898 1=Yes
5899 DIABETES: 0=Not diabetic
5900 1=Diabetic
5901 PREVAP (Have angina pectoric?): 0=No
5902 1=Yes
5903 PREVCHD (Coronary heart disease?): 0=No
5904 1=Yes
5905 PREVMI (Myocardial infarction?): 0=No
5906 1=Yes
5907 PREVSTRK (Had a stroke?): 0=No
5908 1=Yes
5909 PREVHYP (Hypertensive? sys bp >=140 or dyas bp >= 90): 0=no
5910 1=yes
5911 */
5912
5913 PROC FORMAT;
5914 VALUE YNfmt 0="No"
NOTE: Format YNFMT has been output.
5915 1="Yes";
5916 VALUE perfmt 1="Period 1"
5917 2="Period 2"
NOTE: Format PERFMT has been output.
5918 3="Period 3";
5919 VALUE gndrfmt 1="Men"
NOTE: Format GNDRFMT has been output.
5920 2="Women";
5921 RUN;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds

5922
5923 DATA fghmtemp;
5924 SET fghmtemp;
5925 FORMAT prevap ynfmt.
5926 diabetes ynfmt.
5927 cursmoke ynfmt.
5928 bpmeds ynfmt.
5929 prevchd ynfmt.
5930 prevmi ynfmt.
5931 prevstrk ynfmt.
5932 prevhyp ynfmt.
5933 sex gndrfmt.;
5934 RUN;
NOTE: There were 500 observations read from the data set WORK.FGHMTEMP.
NOTE: The data set WORK.FGHMTEMP has 500 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

5935
5936 ods html5 (id=saspy_internal) close;ods listing;

5937
* Step 1 - write code with only macros variables - Let's categorize BMI by quartiles;

%let qvar = bmi;

* Find table name;
ODS trace on;
proc means data = fghmtemp Q1 Median Q3;
 var &qvar;
run;
ODS trace off;

proc means data = fghmtemp Q1 Median Q3;
 var &qvar;
 ODS OUTPUT summary = quartiles;
run;

* See the contents of the ODS table;
proc print data = quartiles;
run;

data _null_;
 set quartiles;
 call symput("Q1", strip(&qvar._Q1));
 call symput("M", strip(&qvar._Median));
 call symput("Q3", strip(&qvar._Q3));
run;

%put _user_; * see the value of our macro variables in the log file;

data fghmtemp;
 set fghmtemp;
 if not missing(&qvar) and &qvar < &Q1 then &qvar._cat = 1;
 else if &qvar < &M then &qvar._cat = 2;
 else if &qvar < &Q3 then &qvar._cat = 3;
 else &qvar._cat = 4;
run;

proc freq data = fghmtemp;
 tables &qvar._cat;
run;
SAS Output

The MEANS Procedure

Analysis Variable : BMI Body Mass Index (kr/(M*M)
Lower Quartile Median Upper Quartile
23.1000000 25.2800000 27.6300000

The MEANS Procedure

Analysis Variable : BMI Body Mass Index (kr/(M*M)
Lower Quartile Median Upper Quartile
23.1000000 25.2800000 27.6300000

Obs BMI_Q1 BMI_Median BMI_Q3
1 23.1 25.28 27.63

The FREQ Procedure

bmi_cat Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 120 24.00 120 24.00
2 130 26.00 250 50.00
3 125 25.00 375 75.00
4 125 25.00 500 100.00

Now that we have the basic code working, let’s turn this into a MACRO.

%macro quartilesmacro(mydata, qvar, round, out = out_dat);

ODS select none;
proc means data = &mydata Q1 Median Q3 maxdec = &round;
 var &qvar;
 ODS OUTPUT summary = quartiles;
run;
ods select all;

data _null_;
 set quartiles;
 call symput("Q1", strip(&qvar._Q1));
 call symput("M", strip(&qvar._Median));
 call symput("Q3", strip(&qvar._Q3));
run;

data &out;
 set &mydata;
 if not missing(&qvar) and &qvar < &Q1 then &qvar._cat = 1;
 else if &qvar < &M then &qvar._cat = 2;
 else if &qvar < &Q3 then &qvar._cat = 3;
 else &qvar._cat = 4;
run;

%mend;

%quartilesmacro(fghmtemp, bmi, round = 2, out = out_bmi);

proc freq data = out_bmi;
 tables bmi_cat;
run;
SAS Output

The FREQ Procedure

bmi_cat Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 120 24.00 120 24.00
2 130 26.00 250 50.00
3 125 25.00 375 75.00
4 125 25.00 500 100.00

17.14. Simulations

  1. In this example, we will explore the sampling distribution of a sample proportion. a) Generate 500 sample of size 50 from a Binomial distribution with n = 1 and p = 0.3. Do this by generating a dataset with 500 rows and 50 columns all filled with random variates from this binomial distribution. Hint: Use an array and a nested DO loop. b) Calculate the mean of each of these 500 rows using mean(of ) in a DATA step. Save this mean as a new column. c) Plot a histogram of the 500 means calculated in part b and use PROC MEANS to calculate the MEAN and standard deviation of the 500 means calculated in part b. This describes the sampling distribution of p-hat when the population distiribution has probability of success p = 0.3 when the sample size is 30.

DATA sim;
   ARRAY data{*} x1-x50;
   DO numsim = 1 TO 500;
      DO n = 1 to 50;
         data(n) = ranbin(1234, 1, 0.3);
      END;
      p_hat = mean(of x1-x50);
      OUTPUT;
   END;
RUN;

PROC SGPLOT data = sim;
   HISTOGRAM p_hat;
RUN;

PROC MEANS data = sim mean std;
   VAR p_hat;
RUN;
SAS Output
The SGPlot Procedure

The MEANS Procedure

Analysis Variable : p_hat
Mean Std Dev
0.2980400 0.0635376