17. Solutions to Exercises¶

17.1. SAS Windowing Environment¶

Use the sample SAS program code provided above and the Charm City Circulator Ridership dataset to experiment with using SAS by making the following changes to the provided code.

Go through and change the colors using the attribute map dataset palette to something other than what they originally were. See https://support.sas.com/content/dam/SAS/support/en/books/pro-template-made-easy-a-guide-for-sas-users/62007_Appendix.pdf for a large list of colors in SAS.
Change the days you are keeping to show "Sunday" instead of “Saturday".
Change the plot from a line plot to a scatterplot by changing series to scatter in proc sgpanel.

%let working_directory = data; *Replace data with your path to your 
                                directory containing the dataset;

proc import datafile = "&working_directory/Charm_City_Circulator_Ridership.csv" 
    out = circ dbms = csv replace;
    getnames = yes;
    guessingrows = max;
run;

proc sort data = circ;
  by day date daily;
run;

proc transpose data = circ out = long (rename=(_NAME_ = var col1 = number));
  var orangeboardings -- banneraverage;
  by day date daily;
run;

data long2;
  set long;
  var = tranwrd(var, 'Board', '_Board');
  var = tranwrd(var, 'Alight', '_Alight');
  var = tranwrd(var, 'Average', '_Average');
  line = scan(var, 1, '_');
  type = scan(var, 2, '_');
  drop var;
run;

data avg;
  set long2;
  where (type = "Average");
run;

data palette;
  input id $ value $ linecolor $;
  markercolor = linecolor;
  datalines;
lc banner red
lc green yellow
lc orange darkblue
lc purple brown
;
run;

proc sgpanel data = avg dattrmap=palette; 
  where day in ("Monday" "Tuesday" "Friday" "Sunday");
  panelby day;
  scatter Y = number X = date / group = line attrid = lc;
run;

SAS Output

17.2. Basic SAS Programming¶

Submit the following simple SAS program. What does it do?

PROC PRINT DATA = sashelp.heart (OBS=10);
RUN;

PROC MEANS DATA = sashelp.heart;
  VAR height weight;
RUN;

PROC PRINT DATA = sashelp.heart (OBS=10);
RUN;

PROC MEANS DATA = sashelp.heart;
  VAR height weight;
RUN;

SAS Output

The SAS System


Obs	Status	DeathCause	AgeCHDdiag	Sex	AgeAtStart	Height	Weight	Diastolic	Systolic	MRW	Smoking	AgeAtDeath	Cholesterol	Chol_Status	BP_Status	Weight_Status	Smoking_Status
1	Dead	Other	.	Female	29	62.50	140	78	124	121	0	55	.		Normal	Overweight	Non-smoker
2	Dead	Cancer	.	Female	41	59.75	194	92	144	183	0	57	181	Desirable	High	Overweight	Non-smoker
3	Alive		.	Female	57	62.25	132	90	170	114	10	.	250	High	High	Overweight	Moderate (6-15)
4	Alive		.	Female	39	65.75	158	80	128	123	0	.	242	High	Normal	Overweight	Non-smoker
5	Alive		.	Male	42	66.00	156	76	110	116	20	.	281	High	Optimal	Overweight	Heavy (16-25)
6	Alive		.	Female	58	61.75	131	92	176	117	0	.	196	Desirable	High	Overweight	Non-smoker
7	Alive		.	Female	36	64.75	136	80	112	110	15	.	196	Desirable	Normal	Overweight	Moderate (6-15)
8	Dead	Other	.	Male	53	65.50	130	80	114	99	0	77	276	High	Normal	Normal	Non-smoker
9	Alive		.	Male	35	71.00	194	68	132	124	0	.	211	Borderline	Normal	Overweight	Non-smoker
10	Dead	Cerebral Vascular Disease	.	Male	52	62.50	129	78	124	106	5	82	284	High	Normal	Normal	Light (1-5)

The SAS System

The MEANS Procedure


Variable	N	Mean	Std Dev	Minimum	Maximum
Height Weight	5203 5203	64.8131847 153.0866808	3.5827074 28.9154261	51.5000000 67.0000000	76.5000000 300.0000000

This program prints out the first 10 rows of the sashelp.heart dataset and the summary statistics N (sample size), mean, standard deviation, minimum and maxium for the Height and Weight variables.

Submit the following small program. How many DATA steps are there? How man PROC steps? Which steps are executed?

DATA work.quant;
  SET sashelp.heart; /* Read in from the heart datset */
  KEEP height weight diastolic systolic; *Only keep these variable 
                                      in the new dataset;
RUN;

PROC PRINT DATA = work.quant (OBS=10);
RUN;

/*
PROC MEANS DATA = work.quant MIN Q1 Median Q3 MAX MEAN STD;
RUN;
*/

DATA work.quant;
  SET sashelp.heart; /* Read in from the heart datset */
  KEEP height weight diastolic systolic; *Only keep these variable 
                                       in the new dataset;
RUN;

PROC PRINT DATA = work.quant (OBS=10);
RUN;

/*
PROC MEANS DATA = work.quant MIN Q1 Median Q3 MAX MEAN STD;
RUN;
*/

SAS Output

The SAS System


Obs	Height	Weight	Diastolic	Systolic
1	62.50	140	78	124
2	59.75	194	92	144
3	62.25	132	90	170
4	65.75	158	80	128
5	66.00	156	76	110
6	61.75	131	92	176
7	64.75	136	80	112
8	65.50	130	80	114
9	71.00	194	68	132
10	62.50	129	78	124

There is one DATA step and there are two PROC steps; however, the last PROC MEANS is not run since it is within a paragraph comment.

Run PROC CONTENTS on the dataset sashelp.heart. How many observations are in this datset?

PROC CONTENTS data = sashelp.heart;
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	SASHELP.HEART	Observations	5209
Member Type	DATA	Variables	17
Engine	V9	Indexes	0
Created	10/25/2018 02:21:26	Observation Length	168
Last Modified	10/25/2018 02:21:26	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label	Framingham Heart Study
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	us-ascii ASCII (ANSI)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	14
First Data Page	1
Max Obs per Page	389
Obs in First Data Page	365
Number of Data Set Repairs	0
Filename	/opt/sasinside/SASHome/SASFoundation/9.4/sashelp/heart.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	6235
Access Permission	rw-r--r--
Owner Name	sas
File Size	960KB
File Size (bytes)	983040


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Label
12	AgeAtDeath	Num	8	Age at Death
5	AgeAtStart	Num	8	Age at Start
3	AgeCHDdiag	Num	8	Age CHD Diagnosed
15	BP_Status	Char	7	Blood Pressure Status
14	Chol_Status	Char	10	Cholesterol Status
13	Cholesterol	Num	8
2	DeathCause	Char	26	Cause of Death
8	Diastolic	Num	8
6	Height	Num	8
10	MRW	Num	8	Metropolitan Relative Weight
4	Sex	Char	6
11	Smoking	Num	8
17	Smoking_Status	Char	17	Smoking Status
1	Status	Char	5
9	Systolic	Num	8
7	Weight	Num	8
16	Weight_Status	Char	11	Weight Status

There are 5209 observations (rows) in the dataset.

Submit the following program to SAS. What is the value of CurrentDate? (This value represents the number of days since Januar 1, 1960, the reference date for the SAS system).
```
DATA work.date;
   CurrentDate = Today();
RUN;

PROC PRINT DATA = work.date;
RUN;
```

DATA work.date;
  CurrentDate = Today();
RUN;
   
PROC PRINT DATA = work.date;
RUN;

PROC PRINT DATA = work.date;
   format currentdate mmddyy10.;
RUN;

SAS Output

The SAS System


Obs	CurrentDate
1	22205

The SAS System


Obs	CurrentDate
1	10/17/2020

The value of current date will change depending on when you submit this program. I submitted this program in 10/17/2020 and received the value 22205 or 22205 days since January 1, 1960.

17.3. Data I/O¶

Use PROC IMPORT to read in the Youth Tobacco study, Youth_Tobacco_Survey_YTS_Data.csv and name it youth.

PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Youth_Tobacco_Survey_YTS_Data.csv"
            out = youth dbms = csv replace;
    getnames = yes;
    guessingrows = 500;
RUN;

PROC PRINT data = youth (OBS=5);
RUN;

SAS Output

The SAS System


Obs	YEAR	LocationAbbr	LocationDesc	TopicType	TopicDesc	MeasureDesc	DataSource	Data_Value_Unit	Data_Value_Type	Data_Value	Data_Value_Footnote_Symbol	Data_Value_Footnote	Data_Value_Std_Err	Low_Confidence_Limit	High_Confidence_Limit	Sample_Size	Gender	Race	Age	Education	GeoLocation	TopicTypeId	TopicId	MeasureId	StratificationID1	StratificationID2	StratificationID3	StratificationID4	SubMeasureID	DisplayOrder
1	2015	AZ	Arizona	Tobacco Use - Survey Data	Cessation (Youth)	Percent of Current Smokers Who Want to Quit	YTS	%	Percentage	.	*	Data in these cells have been suppressed because of a small sample size	.	.	.	.	Overall	All Races	All Ages	Middle School	(34.865970280000454, -111.76381127699972)	BEH	105BEH	170CES	1GEN	8AGE	6RAC	1EDU	YTS01	1
2	2015	AZ	Arizona	Tobacco Use - Survey Data	Cessation (Youth)	Percent of Current Smokers Who Want to Quit	YTS	%	Percentage	.	*	Data in these cells have been suppressed because of a small sample size	.	.	.	.	Male	All Races	All Ages	Middle School	(34.865970280000454, -111.76381127699972)	BEH	105BEH	170CES	2GEN	8AGE	6RAC	1EDU	YTS02	2
3	2015	AZ	Arizona	Tobacco Use - Survey Data	Cessation (Youth)	Percent of Current Smokers Who Want to Quit	YTS	%	Percentage	.	*	Data in these cells have been suppressed because of a small sample size	.	.	.	.	Female	All Races	All Ages	Middle School	(34.865970280000454, -111.76381127699972)	BEH	105BEH	170CES	3GEN	8AGE	6RAC	1EDU	YTS03	3
4	2015	AZ	Arizona	Tobacco Use - Survey Data	Cessation (Youth)	Quit Attempt in Past Year Among Current Cigarette Smokers	YTS	%	Percentage	.	*	Data in these cells have been suppressed because of a small sample size	.	.	.	.	Overall	All Races	All Ages	Middle School	(34.865970280000454, -111.76381127699972)	BEH	105BEH	169QUA	1GEN	8AGE	6RAC	1EDU	YTS04	4
5	2015	AZ	Arizona	Tobacco Use - Survey Data	Cessation (Youth)	Quit Attempt in Past Year Among Current Cigarette Smokers	YTS	%	Percentage	.	*	Data in these cells have been suppressed because of a small sample size	.	.	.	.	Male	All Races	All Ages	Middle School	(34.865970280000454, -111.76381127699972)	BEH	105BEH	169QUA	2GEN	8AGE	6RAC	1EDU	YTS05	5

Download an Excel version of the Monuments dataset, Monuments.xlsx, from CANVAS. Use PROC IMPORT to read in the dataset and call the output mon.

PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Monuments.xlsx"
            out = mon dbms = xlsx replace;
    getnames = yes;
RUN;

PROC PRINT data = mon (OBS=5);
RUN;

SAS Output

The SAS System


Obs	name	zipCode	neighborhood	councilDistrict	policeDistrict	Location 1
1	James Cardinal Gibbons	21201	Downtown	11	CENTRAL	408 CHARLES ST Baltimore, MD
2	The Battle Monument	21202	Downtown	11	CENTRAL
3	Negro Heroes of the U.S Monument	21202	Downtown	11	CENTRAL
4	Star Bangled Banner	21202	Downtown	11	CENTRAL	100 HOLLIDAY ST Baltimore, MD
5	Flame at the Holocaust Monument	21202	Downtown	11	CENTRAL	50 MARKET PL Baltimore, MD

Write a DATA step that will read in the following column formatted data using DATALINES and print the results using PROC PRINT. This dataset contains information on esophageal cancer in Ille-et-Vilaine, France. The variables are
- Age group
- Alcohol consumption
- Tobacco consumption
- Number of Cases
- Number of Controls

Write a DATA step to read in the data and print the results. Note: the first line is to help you count the columns. Each number corresponds to a multiple of 5.

----1----2----3----4----5----6----7----8
75+ 0-39g/day 0-9g/day      1        18
75+ 0-39g/day    10-19      2         6
75+ 0-39g/day      30+      1         3
75+     40-79 0-9g/day      2         5
75+     40-79    10-19      1         3
75+     40-79    20-29      0         3
75+     40-79      30+      1         1
75+    80-119 0-9g/day      1         1
75+    80-119    10-19      1         1
75+      120+ 0-9g/day      2         2
75+      120+    10-19      1         1

DATA eso;
   INPUT aggrp $ 1-3 alc $ 5-13 smoke $ 15-22 ncases 29 ncontrl 38-39;
   DATALINES;
75+ 0-39g/day 0-9g/day      1        18
75+ 0-39g/day    10-19      2         6
75+ 0-39g/day      30+      1         3
75+     40-79 0-9g/day      2         5
75+     40-79    10-19      1         3
75+     40-79    20-29      0         3
75+     40-79      30+      1         1
75+    80-119 0-9g/day      1         1
75+    80-119    10-19      1         1
75+      120+ 0-9g/day      2         2
75+      120+    10-19      1         1
;
RUN;

PROC PRINT DATA=eso;
RUN;

SAS Connection established. Subprocess id is 2198

SAS Output

The SAS System


Obs	aggrp	alc	smoke	ncases	ncontrl
1	75+	0-39g/day	0-9g/day	1	18
2	75+	0-39g/day	10-19	2	6
3	75+	0-39g/day	30+	1	3
4	75+	40-79	0-9g/day	2	5
5	75+	40-79	10-19	1	3
6	75+	40-79	20-29	0	3
7	75+	40-79	30+	1	1
8	75+	80-119	0-9g/day	1	1
9	75+	80-119	10-19	1	1
10	75+	120+	0-9g/day	2	2
11	75+	120+	10-19	1	1

17.4. SAS Variables and Assignment Statements¶

Use the following SAS data step to create the new variables Grade and Course defined below:

data school;
   input Age Quiz : $1. Midterm Final;
   /* Add you statements here */
datalines;
12 A 92 95
12 B 88 88
13 C 78 75
13 A 92 93
12 F 55 62
13 B 88 82
;

Using If-Then-Else statements, compute two new variables as follows:

Grade (numeric), with a value of 6 if Age is 12 and a value of 8 if Age is 13.
The quiz grades have numerical equivalents as follows: A = 95, B = 85, C = 75, D = 70, and F = 65. Using this information, compute a course grade (Course) as a weighted average of the Quiz (20%), Midterm (30%) and Final (50%).

data school;
   input Age Quiz : $1. Midterm Final;
   IF Age = 12 THEN Grade = 6;
   ELSE IF AGE = 13 THEN Grade = 8;
   IF quiz = "A" THEN Course = 0.2*95 + 0.3*midterm + 0.5*final;
   ELSE IF quiz = "B" THEN Course = 0.2*85 + 0.3*midterm + 0.5*final;
   ELSE IF quiz = "C" THEN Course = 0.2*75 + 0.3*midterm + 0.5*final;
   ELSE IF quiz = "D" THEN Course = 0.2*70 + 0.3*midterm + 0.5*final;
   ELSE IF quiz = "F" THEN Course = 0.2*65 + 0.3*midterm + 0.5*final;
datalines;
12 A 92 95
12 B 88 88
13 C 78 75
13 A 92 93
12 F 55 62
13 B 88 82
;
RUN;

PROC PRINT DATA = school;
RUN;

SAS Output

The SAS System


Obs	Age	Quiz	Midterm	Final	Grade	Course
1	12	A	92	95	6	94.1
2	12	B	88	88	6	87.4
3	13	C	78	75	8	75.9
4	13	A	92	93	8	93.1
5	12	F	55	62	6	60.5
6	13	B	88	82	8	84.4

17.5. Working with Your Data¶

Create a temporary SAS dataset called heart from the sashelp.heart dataset. The library sashelp if a permanent library that comes with SAS.

DATA heart;
  SET sashelp.heart;
RUN;

1348  ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
1348! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
1349  
1350  DATA heart;
1351    SET sashelp.heart;
1352  RUN;
NOTE: There were 5209 observations read from the data set SASHELP.HEART.
NOTE: The data set WORK.HEART has 5209 observations and 17 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
1353  
1354  ods html5 (id=saspy_internal) close;ods listing;

1355

View the description of the heart dataset by running PROC CONTENTS. How many observations (rows) and variables (columns) are in the dataset?

PROC CONTENTS DATA = heart;
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.HEART	Observations	5209
Member Type	DATA	Variables	17
Engine	V9	Indexes	0
Created	10/17/2020 21:35:04	Observation Length	168
Last Modified	10/17/2020 21:35:04	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	14
First Data Page	1
Max Obs per Page	389
Obs in First Data Page	365
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/heart.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671720
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	960KB
File Size (bytes)	983040


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Label
12	AgeAtDeath	Num	8	Age at Death
5	AgeAtStart	Num	8	Age at Start
3	AgeCHDdiag	Num	8	Age CHD Diagnosed
15	BP_Status	Char	7	Blood Pressure Status
14	Chol_Status	Char	10	Cholesterol Status
13	Cholesterol	Num	8
2	DeathCause	Char	26	Cause of Death
8	Diastolic	Num	8
6	Height	Num	8
10	MRW	Num	8	Metropolitan Relative Weight
4	Sex	Char	6
11	Smoking	Num	8
17	Smoking_Status	Char	17	Smoking Status
9	Systolic	Num	8
7	Weight	Num	8
16	Weight_Status	Char	11	Weight Status
1	condition	Char	5

There are 5209 observations (rows) and 17 variables (columns).

Use the RENAME= option to rename Status to condition. Show the variables table from PROC CONTENTS to see that the name has changed.

DATA heart2 (RENAME = (status = condition));
   SET heart;
RUN;

PROC CONTENTS data = heart2;
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.HEART2	Observations	5209
Member Type	DATA	Variables	17
Engine	V9	Indexes	0
Created	10/17/2020 21:39:47	Observation Length	168
Last Modified	10/17/2020 21:39:47	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	14
First Data Page	1
Max Obs per Page	389
Obs in First Data Page	365
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/heart2.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671720
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	960KB
File Size (bytes)	983040


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Label
12	AgeAtDeath	Num	8	Age at Death
5	AgeAtStart	Num	8	Age at Start
3	AgeCHDdiag	Num	8	Age CHD Diagnosed
15	BP_Status	Char	7	Blood Pressure Status
14	Chol_Status	Char	10	Cholesterol Status
13	Cholesterol	Num	8
2	DeathCause	Char	26	Cause of Death
8	Diastolic	Num	8
6	Height	Num	8
10	MRW	Num	8	Metropolitan Relative Weight
4	Sex	Char	6
11	Smoking	Num	8
17	Smoking_Status	Char	17	Smoking Status
9	Systolic	Num	8
7	Weight	Num	8
16	Weight_Status	Char	11	Weight Status
1	condition	Char	5

Create a subset of heart called heart_sub that contains only the columns height, weight, and diastolic. What are the dimensions of heart_sub? Print the first five rows of heart_sub.

DATA heart_sub;
   SET heart;
   KEEP height weight diastolic;
RUN;

PROC CONTENTS data = heart_sub;
RUN;

PROC PRINT data = heart_sub(OBS=5);
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.HEART_SUB	Observations	5209
Member Type	DATA	Variables	3
Engine	V9	Indexes	0
Created	10/17/2020 21:39:50	Observation Length	24
Last Modified	10/17/2020 21:39:50	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	2
First Data Page	1
Max Obs per Page	2714
Obs in First Data Page	2625
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671721
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	192KB
File Size (bytes)	196608


Alphabetic List of Variables and Attributes
#	Variable	Type	Len
3	Diastolic	Num	8
1	Height	Num	8
2	Weight	Num	8

The SAS System


Obs	Height	Weight	Diastolic
1	62.50	140	78
2	59.75	194	92
3	62.25	132	90
4	65.75	158	80
5	66.00	156	76

The new dataset heart_sub contains 5209 rows and 3 columns.

Subset the patients (rows) of heart that have a diastolic blood pressure (diastolic) more than 90 into a dataset called heart_sub2. How many are there? Print the first five rows of heart_sub2.

DATA heart_sub2;
   SET heart;
   WHERE diastolic > 90;
RUN;

PROC CONTENTS data = heart_sub2;
RUN;

PROC PRINT data = heart_sub2(OBS=5);
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.HEART_SUB2	Observations	1366
Member Type	DATA	Variables	17
Engine	V9	Indexes	0
Created	10/17/2020 21:39:55	Observation Length	168
Last Modified	10/17/2020 21:39:55	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	4
First Data Page	1
Max Obs per Page	389
Obs in First Data Page	365
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub2.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671723
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	320KB
File Size (bytes)	327680


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Label
12	AgeAtDeath	Num	8	Age at Death
5	AgeAtStart	Num	8	Age at Start
3	AgeCHDdiag	Num	8	Age CHD Diagnosed
15	BP_Status	Char	7	Blood Pressure Status
14	Chol_Status	Char	10	Cholesterol Status
13	Cholesterol	Num	8
2	DeathCause	Char	26	Cause of Death
8	Diastolic	Num	8
6	Height	Num	8
10	MRW	Num	8	Metropolitan Relative Weight
4	Sex	Char	6
11	Smoking	Num	8
17	Smoking_Status	Char	17	Smoking Status
1	Status	Char	5
9	Systolic	Num	8
7	Weight	Num	8
16	Weight_Status	Char	11	Weight Status

The SAS System


Obs	Status	DeathCause	AgeCHDdiag	Sex	AgeAtStart	Height	Weight	Diastolic	Systolic	MRW	Smoking	AgeAtDeath	Cholesterol	Chol_Status	BP_Status	Weight_Status	Smoking_Status
1	Dead	Cancer	.	Female	41	59.75	194	92	144	183	0	57	181	Desirable	High	Overweight	Non-smoker
2	Alive		.	Female	58	61.75	131	92	176	117	0	.	196	Desirable	High	Overweight	Non-smoker
3	Alive		.	Female	42	67.75	162	96	138	119	1	.	200	Borderline	High	Overweight	Light (1-5)
4	Dead	Coronary Heart Disease	71	Female	49	60.50	153	110	196	140	5	73	221	Borderline	High	Overweight	Light (1-5)
5	Alive		43	Male	33	66.50	172	106	146	127	0	.	247	High	High	Overweight	Non-smoker

There are 1366 patients with a diastolic blood pressure above 90.

Subset the patients (rows) of heart that have a diastolic blood pressure less than or equal to 90 and have a weight more than 180 pounds into a dataset called heart_sub3. How many are there? Print the first five rows of heart_sub3.

DATA heart_sub3;
   SET heart;
   WHERE diastolic <= 90 AND weight > 180;
RUN;

PROC CONTENTS data = heart_sub3;
RUN;

PROC PRINT data = heart_sub3(OBS=5);
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.HEART_SUB3	Observations	481
Member Type	DATA	Variables	17
Engine	V9	Indexes	0
Created	10/17/2020 21:43:36	Observation Length	168
Last Modified	10/17/2020 21:43:36	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	2
First Data Page	1
Max Obs per Page	389
Obs in First Data Page	365
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub3.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671722
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	192KB
File Size (bytes)	196608


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Label
12	AgeAtDeath	Num	8	Age at Death
5	AgeAtStart	Num	8	Age at Start
3	AgeCHDdiag	Num	8	Age CHD Diagnosed
15	BP_Status	Char	7	Blood Pressure Status
14	Chol_Status	Char	10	Cholesterol Status
13	Cholesterol	Num	8
2	DeathCause	Char	26	Cause of Death
8	Diastolic	Num	8
6	Height	Num	8
10	MRW	Num	8	Metropolitan Relative Weight
4	Sex	Char	6
11	Smoking	Num	8
17	Smoking_Status	Char	17	Smoking Status
1	Status	Char	5
9	Systolic	Num	8
7	Weight	Num	8
16	Weight_Status	Char	11	Weight Status

The SAS System


Obs	Status	AgeCHDdiag	Sex	AgeAtStart	Height	Weight	Diastolic	Systolic	MRW	Smoking	AgeAtDeath	Cholesterol	Chol_Status	BP_Status	Weight_Status	Smoking_Status
1	Alive	.	Male	35	71.00	194	68	132	124	0	.	211	Borderline	Normal	Overweight	Non-smoker
2	Alive	.	Female	50	67.50	185	88	150	136	15	.	228	Borderline	High	Overweight	Moderate (6-15)
3	Alive	.	Male	42	72.25	182	78	136	113	0	.	221	Borderline	Normal	Overweight	Non-smoker
4	Alive	68	Male	40	70.00	189	78	124	124	0	.	319	High	Normal	Overweight	Non-smoker
5	Alive	68	Male	40	70.00	195	76	132	128	20	.	205	Borderline	Normal	Overweight	Heavy (16-25)

There are 481 patients with a diastolic blood pressure less than or equal to 90 and with a weight above 180 pounds.

Create a subset of heart that only contains the columns weight, systolic, and cholesterol for those that are heavy smokers smoker (Smoking_Status = “Heavy (16-25)”). Save this to heart_sub. What are the dimensions of the dataset? Print the first five rows of heart_sub.

DATA heart_sub;
   SET heart;
   WHERE Smoking_Status = "Heavy (16-25)";
   KEEP weight systolic cholesterol;
RUN;

PROC CONTENTS data = heart_sub;
RUN;

PROC PRINT data = heart_sub(OBS=5);
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.HEART_SUB	Observations	1046
Member Type	DATA	Variables	3
Engine	V9	Indexes	0
Created	10/17/2020 21:47:19	Observation Length	24
Last Modified	10/17/2020 21:47:19	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	2714
Obs in First Data Page	1046
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/heart_sub.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671725
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	128KB
File Size (bytes)	131072


Alphabetic List of Variables and Attributes
#	Variable	Type	Len
3	Cholesterol	Num	8
2	Systolic	Num	8
1	Weight	Num	8

The SAS System


Obs	Weight	Systolic	Cholesterol
1	156	110	281
2	195	132	205
3	181	144	223
4	152	144	263
5	196	128	.

The new heart_sub dataset contains 1046 rows and three columns.

Reorder the rows of heart_sub in decreasing order by cholesterol. Print the first 5 rows of the sorted dataset.

PROC SORT data = heart_sub out = heart_sub_srt;
  BY descending cholesterol;
RUN;

PROC PRINT data = heart_sub_srt(OBS=5);
RUN;

SAS Output

The SAS System


Obs	Weight	Systolic	Cholesterol
1	169	155	568
2	205	154	534
3	194	150	429
4	104	122	392
5	137	185	368

17.6. Data Summarization¶

How many bike lanes are currently in Baltimore? You can assume that each observation/row is a different bike lane.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
            out = bike dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC CONTENTS data = bike;
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.BIKE	Observations	1631
Member Type	DATA	Variables	9
Engine	V9	Indexes	0
Created	10/17/2020 22:01:39	Observation Length	136
Last Modified	10/17/2020 22:01:39	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	4
First Data Page	1
Max Obs per Page	481
Obs in First Data Page	459
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/bike.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671721
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	320KB
File Size (bytes)	327680


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
3	block	Char	29	$29.	$29.
9	dateInstalled	Num	8	BEST12.	BEST32.
8	length	Num	8	BEST12.	BEST32.
2	name	Char	20	$20.	$20.
5	numLanes	Num	8	BEST12.	BEST32.
6	project	Char	25	$25.	$25.
7	route	Char	13	$13.	$13.
1	subType	Char	6	$6.	$6.
4	type	Char	15	$15.	$15.

There are 1631 rows.

How many (a) feet and (b) miles of total bike lanes are currently in Baltimore? (The length variable provides the length in feet.)

PROC MEANS data = bike sum;
   VAR length;
RUN;

SAS Output

The SAS System

The MEANS Procedure


Analysis Variable : length
Sum
439447.59

DATA miles;
   length_mi = 439447.59 / 5280;
RUN;

PROC PRINT data = miles;
RUN;

SAS Output

The SAS System


Obs	length_mi
1	83.2287

How many types (type) bike lanes are there? Which type (a) occurs the most and (b) has the longest average bike lane length?

PROC FREQ data = bike;
  TABLE type;
RUN;

PROC MEANS data = bike noprint;
   CLASS type;
   VAR length;
   OUTPUT out = type_mean mean = length_mean;
RUN;

PROC SORT data = type_mean;
  BY descending length_mean;
RUN;

PROC PRINT DATA = type_mean;
RUN;

SAS Output

The SAS System

The FREQ Procedure


type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Frequency Missing = 9
BIKE BOULEVARD	49	3.02	49	3.02
BIKE LANE	621	38.29	670	41.31
CONTRAFLOW	13	0.80	683	42.11
SHARED BUS BIKE	39	2.40	722	44.51
SHARROW	589	36.31	1311	80.83
SIDEPATH	7	0.43	1318	81.26
SIGNED ROUTE	304	18.74	1622	100.00

The SAS System


Obs	type	_TYPE_	_FREQ_	length_mean
1	SIDEPATH	1	7	665.87751019
2	BIKE LANE	1	621	300.24183502
3	SHARED BUS BIKE	1	39	276.66575695
4		0	1622	269.48876975
5	SIGNED ROUTE	1	304	263.57986593
6	SHARROW	1	589	243.8718592
7	BIKE BOULEVARD	1	49	197.27815495
8	CONTRAFLOW	1	13	136.46646088

There are 7 bike lane types (8 if you include missing as a category). The lane type that occurrs the most often is bike lane with a count of 621. The lane type sidepath has the longest average length at 665.88 ft.

How many different projects (project) do the bike lanes fall into? Which project category has the longest average bike lane length?

PROC FREQ data = bike;
  TABLE project;
RUN;

PROC MEANS data = bike noprint;
   CLASS project;
   VAR length;
   OUTPUT out = project_mean mean = length_mean;
RUN;

PROC SORT data = project_mean;
  BY descending length_mean;
RUN;

PROC PRINT DATA = project_mean;
RUN;

SAS Output

The SAS System

The FREQ Procedure


project	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Frequency Missing = 74
CHARM CITY CIRCULATOR	39	2.50	39	2.50
COLLEGETOWN	339	21.77	378	24.28
COLLEGETOWN NETWORK	13	0.83	391	25.11
ENGINEERING CONSTRUCTION	12	0.77	403	25.88
GUILFORD AVE BIKE BLVD	49	3.15	452	29.03
MAINTENANCE	4	0.26	456	29.29
OPERATION ORANGE CONE	458	29.42	914	58.70
PARK HEIGHTS BIKE NETWORK	172	11.05	1086	69.75
PLANNING TRAFFIC	18	1.16	1104	70.91
SOUTHEAST BIKE NETWORK	323	20.75	1427	91.65
TRAFFIC	51	3.28	1478	94.93
TRAFFIC CALMING	79	5.07	1557	100.00

The SAS System


Obs	project	_TYPE_	_FREQ_	length_mean
1	MAINTENANCE	1	4	1942.1522856
2	ENGINEERING CONSTRUCTION	1	12	512.09762483
3	TRAFFIC	1	51	419.52878809
4	COLLEGETOWN	1	339	320.68363126
5	PARK HEIGHTS BIKE NETWORK	1	172	283.22519367
6	CHARM CITY CIRCULATOR	1	39	276.66575695
7		0	1557	272.0534698
8	TRAFFIC CALMING	1	79	268.53137627
9	OPERATION ORANGE CONE	1	458	250.07840273
10	COLLEGETOWN NETWORK	1	13	213.63728443
11	SOUTHEAST BIKE NETWORK	1	323	210.82827388
12	PLANNING TRAFFIC	1	18	209.42893827
13	GUILFORD AVE BIKE BLVD	1	49	197.27815495

There are 12 project types (13 if you include missing as a project type). The maintenance project has the longest average bike lane length at 1942.15 ft.

What was the average bike lane length per year that they were installed? (Be sure to first set dateInstalled to . if it is equal to zero.)

DATA bike;
   SET bike;
   IF dateinstalled = 0 THEN dateinstalled = .;
RUN;

PROC MEANS data = bike mean;
  CLASS dateinstalled;
  VAR length;
RUN;

SAS Output

The SAS System

The MEANS Procedure


Analysis Variable : length
dateInstalled	N Obs	Mean
2006	2	1469.35
2007	368	309.8903287
2008	206	249.4810027
2009	86	407.2947463
2010	625	245.6015777
2011	101	232.8127593
2012	107	271.4321825
2013	10	290.4263702

(a) Numerically and (b) graphically describe the distribution of bike lane lengths (length).

PROC UNIVARIATE data = bike plots;
  VAR length;
RUN;

SAS Output

The SAS System

The UNIVARIATE Procedure

Variable: length


Moments
N	1631	Sum Weights	1631
Mean	269.434449	Sum Observations	439447.587
Std Deviation	277.706646	Variance	77120.9813
Skewness	5.1297861	Kurtosis	43.3888947
Uncorrected SS	244109518	Corrected SS	125707200
Coeff Variation	103.070208	Std Error Mean	6.87637089


Basic Statistical Measures
Location		Variability
Mean	269.4344	Std Deviation	277.70665
Median	200.3027	Variance	77121
Mode	.	Range	3749
		Interquartile Range	217.16479


Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	39.18265	Pr > \|t\|	<.0001
Sign	M	815	Pr >= \|M\|	<.0001
Signed Rank	S	664632.5	Pr >= \|S\|	<.0001


Quantiles (Definition 5)
Level	Quantile
100% Max	3749.3226
99%	1341.8291
95%	664.2305
90%	502.2232
75% Q3	341.1493
50% Median	200.3027
25% Q1	123.9845
10%	67.8206
5%	45.3655
1%	14.8383
0% Min	0.0000


Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
0.00000	4	2536.73	1284
5.51372	523	2715.20	1344
6.85712	724	2916.53	903
8.16243	490	3467.68	904
9.71914	595	3749.32	3

Describe the distribution of bike lane lengths numerically and graphically after stratifying them by (a) type and then by (b) number of lanes (numLanes).

PROC MEANS data=bike mean std;
   CLASS type;
   VAR length;
RUN;

PROC SGPLOT data = bike;
   VBOX length / category = type;
RUN;

PROC MEANS data=bike mean std;
   CLASS numlanes;
   VAR length;
RUN;

PROC SGPLOT data = bike;
   VBOX length / category = numlanes;
RUN;

SAS Output

The SAS System

The MEANS Procedure


Analysis Variable : length
type	N Obs	Mean	Std Dev
BIKE BOULEVARD	49	197.2781549	113.1318019
BIKE LANE	621	300.2418350	314.6605599
CONTRAFLOW	13	136.4664609	48.5165725
SHARED BUS BIKE	39	276.6657570	140.1811711
SHARROW	589	243.8718592	188.4222254
SIDEPATH	7	665.8775102	619.3894227
SIGNED ROUTE	304	263.5798659	347.8501541

The SAS System

The MEANS Procedure


Analysis Variable : length
numLanes	N Obs	Mean	Std Dev
0	21	308.3767969	201.5548096
1	900	277.9016282	317.8281529
2	710	257.5495894	218.6179448

17.7. SAS Formats and Dates¶

Use proc freq to make a table of the different bike lane types.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
            out = bike dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC FREQ data = bike;
   TABLE type;
RUN;

SAS Output

The SAS System

The FREQ Procedure


type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Frequency Missing = 9
BIKE BOULEVARD	49	3.02	49	3.02
BIKE LANE	621	38.29	670	41.31
CONTRAFLOW	13	0.80	683	42.11
SHARED BUS BIKE	39	2.40	722	44.51
SHARROW	589	36.31	1311	80.83
SIDEPATH	7	0.43	1318	81.26
SIGNED ROUTE	304	18.74	1622	100.00

Create a format that changes changes the bike lane types by mapping “SIDEPATH”, “BIKE BOULEVARD”, and “BIKE LANE” to itself but does not provide a value for the other types. Apply this format to type and make a table of type using PROC FREQ. What happens to the other bike lane types?

PROC FORMAT;
   VALUE $typefmt "SIDEPATH" = "SIDEPATH"
                   "BIKE BOULEVARD" = "BIKE BOULEVARD"
                   "BIKE LANE" = "BIKE LANE"
                   OTHER = " ";
RUN;

DATA bike2;
  SET bike;
  FORMAT type $typefmt.;
RUN;

PROC FREQ data = bike2;
   TABLE type;
RUN;

SAS Output

The SAS System

The FREQ Procedure


type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Frequency Missing = 954
BIKE BOULEVARD	49	7.24	49	7.24
BIKE LANE	621	91.73	670	98.97
SIDEPATH	7	1.03	677	100.00

We have translated all but BIKE BOULEVARD, BIKE LANE, and SIDEPATE to missing, so we only get these three listed as categories in the table.

Create a format that changes the bike lane types to be “CONTRAFLOW”, “SHARED BUS BIKE”, “SHARROW”, “SIGNED ROUTE”, or “OTHER” if the lane type is anything else.

PROC FORMAT;
   VALUE $type2fmt "CONTRAFLOW" = "CONTRAFLOW"
                   "SHARED BUS BIKE" = "SHARED BUS BIKE"
                   "SHARROW" = "SHARROW"
                   "SIGNED ROUTE" = "SIGNED ROUTE"
                   "SIDEPATH", "BIKE BOULEVARD", "BIKE LANE" = "OTHER";
RUN;

DATA bike3;
  SET bike;
  FORMAT type $type2fmt.;
RUN;

PROC FREQ data = bike3;
   TABLE type;
RUN;

SAS Output

The SAS System

The FREQ Procedure


type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Frequency Missing = 9
OTHER	677	41.74	677	41.74
CONTRAFLOW	13	0.80	690	42.54
SHARED BUS BIKE	39	2.40	729	44.94
SHARROW	589	36.31	1318	81.26
SIGNED ROUTE	304	18.74	1622	100.00

Write a datatep that can be used to read in the following small dataset with three dates and one time. Then print the dataset using readable date formats.


DATA temp;
  INPUT date1 /*informat1*/ +1 date2 /*informat2*/ +1
        date3 /*informat3*/ @32 time /*informat4*/;
  DATALINES;
2014/02/14 06Jan2018 4/5/2016   03:2:22
;
RUN;

DATA temp;
INPUT date1 yymmdd10. +1 date2 DATE9. +1
     date3 mmddyy10. @32 time time8.;
DATALINES;
2014/02/14 06Jan2018 4/5/2016   03:2:22
;
RUN;

PROC PRINT data = temp;
  FORMAT date1 date2 date3 mmddyy10. time time8.;
RUN;

SAS Output

The SAS System


Obs	date1	date2	date3	time
1	02/14/2014	01/06/2018	04/05/2016	3:02:22

17.8. Character Functions¶

Read in the Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, using PROC IMPORT.

PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Baltimore_City_Employee_Salaries_FY2015.csv"
            out = sal dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC PRINT data = sal(OBS=5);
RUN;

SAS Output

The SAS System


Obs	name	JobTitle	AgencyID	Agency	HireDate	AnnualSalary	GrossPay
1	Aaron,Patricia G	Facilities/Office Services II	A03031	OED-Employment Dev (031)	10/24/1979	55,314	53,626
2	Aaron,Petra L	ASSISTANT STATE'S ATTORNEY	A29045	States Attorneys Office (045)	09/25/2006	74,000	73,000
3	Abaineh,Yohannes T	EPIDEMIOLOGIST	A65026	HLTH-Health Department (026)	07/23/2009	64,500	64,404
4	Abbene,Anthony M	POLICE OFFICER	A99005	Police Department (005)	07/24/2013	46,309	59,620
5	Abbey,Emmanuel	CONTRACT SERV SPEC II	A40001	M-R Info Technology (001)	05/01/2013	60,060	54,060

Make a temporary SAS dataset called health_sal using the salaries dataset, with only the agencies (JobTitle) of those with “fire” (anywhere in the job title), if any, in the name. Be sure to ignore case in the string matching for ‘fire’.

DATA health_sal;
    SET sal;
    WHERE index(lowcase(JobTitle), 'fire') > 0;
RUN;

PROC PRINT data=health_sal(OBS=5);
    VAR JobTitle;
RUN;

SAS Output

The SAS System


Obs	JobTitle
1	EMT Firefighter Suppression
2	Fire Pump Operator Supp ALS
3	Firefighter/Paramedic Suppress
4	Fire Pump Operator Supp ALS
5	Firefighter Suppression

Make a temporary data set called trans which contains only agencies that contain “TRANS” (case sensitive).

DATA trans;
    SET sal;
    WHERE index(JobTitle, 'TRANS') > 0;
RUN;

PROC PRINT data=trans(OBS=5);
    VAR JobTitle;
RUN;

SAS Output

The SAS System


Obs	JobTitle
1	TRANSPORTATION ASSOC II
2	TAX TRANSFER CLERK I
3	GENL SUPT TRANSPORTATION MAINT
4	TAX TRANSFER CLERK I
5	TAX TRANSFER CLERK I

What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.

PROC PRINT data=sal;
    VAR name JobTitle;
    WHERE index(lowcase(name), 'abra') > 0;
RUN;

SAS Output

The SAS System


Obs	name	JobTitle
26	Abraham,Donta D	LABORER (Hourly)
27	Abraham,Santhosh	ACCOUNTANT I
28	Abraham,Sharon M	HOUSING INSPECTOR
29	Abrahams,Brandon A	POLICE OFFICER TRAINEE
30	Abrams,Maria	OFFICE SERVICES ASSISTANT II S
31	Abrams,Maxine	COLLECTIONS REPRESENTATIVE I
32	Abrams,Terry	RECREATION ARTS INSTRUCTOR
958	Bey,Abraham	PROCUREMENT SPECIALIST II
3761	Elgamil,Abraham D	AUDITOR SUPV
4545	Gatto,Abraham M	POLICE OFFICER
11107	Schwartz,Abraham M	GENERAL COUNSEL
12789	Velez,Abraham L	POLICE OFFICER (EID)

Create a subset of the salaries dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with (case sensitive) string matching. Call this emer. How many employees are in this new dataset?

DATA emer;
   SET sal;
   WHERE index(agency, "Sheriff's Office") > 0 OR
         index(agency, "Police Department") > 0 OR
         index(agency, "Fire Department") > 0;
RUN;

PROC PRINT data=emer(OBS=5);
   VAR agency;
RUN;

SAS Output

The SAS System


Obs	Agency
1	Police Department (005)
2	Fire Department (120)
3	Police Department (127)
4	Sheriff's Office (410)
5	Fire Department (465)

Create a variable called dept in the emer data set by extracting the the part of the Agency variable up until ‘ment’ or ‘ice’. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Use:


re = prxparse('/.*(ment|ice)/');
call prxsubstr(re, Agency, pos, len);
dept = substr(Agency, pos, len);

DATA emer;
   SET emer;
   re = prxparse('/.*(ment|ice)/');
   call prxsubstr(re, Agency, pos, len);
   dept = substr(Agency, pos, len);
RUN;

PROC FREQ data = emer;
   TABLE dept;
RUN;

SAS Output

The SAS System

The FREQ Procedure


dept	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Fire Department	1615	32.81	1615	32.81
Police Department	3097	62.92	4712	95.73
Sheriff's Office	210	4.27	4922	100.00

17.9. Data Cleaning¶

Read in the bike lanes dataset Bike_Lanes.csv using PROC IMPORT and call it bike.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
            out = bike dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC PRINT DATA = bike(OBS=5);
RUN;

SAS Output

The SAS System


Obs	subType	name	block	type	numLanes	project	route	length	dateInstalled
1				BIKE BOULEVARD	1	GUILFORD AVE BIKE BLVD		435.73787702	0
2				SIDEPATH	1		NORTHERN	1024.6746249	2010
3				SIGNED ROUTE	1	SOUTHEAST BIKE NETWORK		3749.3226377	2010
4		HUNTINGDON PATH		SIDEPATH	1			0	0
5	STCLN	EDMONDSON AVE	5300 BLK EDMONDSON AVE	BIKE LANE	1	OPERATION ORANGE CONE		180.92546489	2011

How many rows are are in the bike dataset? How many are complete cases? Hint: Use cmiss(of ALL) in a dataset to create a 0/1 variable to indicate if it complete or not and then use PROC MEANS to sum this 0/1 variable.

PROC CONTENTS data = bike;
RUN;

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.BIKE	Observations	1631
Member Type	DATA	Variables	9
Engine	V9	Indexes	0
Created	10/18/2020 11:55:08	Observation Length	136
Last Modified	10/18/2020 11:55:08	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	4
First Data Page	1
Max Obs per Page	481
Obs in First Data Page	459
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/bike.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671716
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	320KB
File Size (bytes)	327680


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
3	block	Char	29	$29.	$29.
9	dateInstalled	Num	8	BEST12.	BEST32.
8	length	Num	8	BEST12.	BEST32.
2	name	Char	20	$20.	$20.
5	numLanes	Num	8	BEST12.	BEST32.
6	project	Char	25	$25.	$25.
7	route	Char	13	$13.	$13.
1	subType	Char	6	$6.	$6.
4	type	Char	15	$15.	$15.

There are 1631 rows in the bike data set.

DATA bike;
  SET bike;
  IF cmiss(of _ALL_) > 0 THEN complete = 0;
  ELSE complete = 1;
RUN;

PROC PRINT data = bike(OBS=5);
RUN;

PROC MEANS data = bike sum;
   VAR complete;
RUN;

SAS Output

The SAS System


Obs	subType	name	block	type	numLanes	project	route	length	dateInstalled
1				BIKE BOULEVARD	1	GUILFORD AVE BIKE BLVD		435.73787702	0
2				SIDEPATH	1		NORTHERN	1024.6746249	2010
3				SIGNED ROUTE	1	SOUTHEAST BIKE NETWORK		3749.3226377	2010
4		HUNTINGDON PATH		SIDEPATH	1			0	0
5	STCLN	EDMONDSON AVE	5300 BLK EDMONDSON AVE	BIKE LANE	1	OPERATION ORANGE CONE		180.92546489	2011

The SAS System

The MEANS Procedure


Analysis Variable : complete
Sum
257.0000000

Filter rows of bike that are NOT missing the route variable, assign this to the object have_route. Create a frequency table of the subType variable using PROC FREQ, including the missing subTypes.

DATA have_route;
   SET bike;
   WHERE NOT MISSING(route);
RUN;

PROC FREQ data = have_route;
   TABLE subType / missing;
RUN;

SAS Output

The SAS System

The FREQ Procedure


subType	Frequency	Percent	Cumulative Frequency	Cumulative Percent
	1	0.28	1	0.28
STRALY	3	0.83	4	1.10
STRPRD	358	98.90	362	100.00

17.10. Data Manipulation¶

Read in the Bike_Lanes_Wide.csv dataset and call is wide using PROC IMPORT. Print the first few rows of the dataset.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes_Wide.csv"
            out = wide dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC CONTENTS data = wide;
RUN;

PROC PRINT DATA = wide(OBS=5);
RUN;

SAS Connection established. Subprocess id is 8974

SAS Output

The SAS System

The CONTENTS Procedure


Data Set Name	WORK.WIDE	Observations	134
Member Type	DATA	Variables	9
Engine	V9	Indexes	0
Created	10/29/2020 14:57:01	Observation Length	141
Last Modified	10/29/2020 14:57:01	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	464
Obs in First Data Page	134
Number of Data Set Repairs	0
Filename	/tmp/SAS_workFB360000230E_localhost.localdomain/wide.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671855
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	128KB
File Size (bytes)	131072


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
2	BIKE BOULEVARD	Char	11	$11.	$11.
3	BIKE LANE	Char	11	$11.	$11.
4	CONTRAFLOW	Char	11	$11.	$11.
9	NA	Char	11	$11.	$11.
5	SHARED BUS BIKE	Char	11	$11.	$11.
6	SHARROW	Char	11	$11.	$11.
7	SIDEPATH	Char	11	$11.	$11.
8	SIGNED ROUTE	Char	11	$11.	$11.
1	name	Char	53	$53.	$53.

The SAS System


Obs	name	BIKE BOULEVARD	BIKE LANE	CONTRAFLOW	SHARED BUS BIKE	SHARROW	SIDEPATH	SIGNED ROUTE	NA
1	ALBEMARLE ST	NA	NA	NA	NA	109.9141727	NA	234.9190031	NA
2	ALICEANNA ST	NA	NA	NA	NA	250.7103035	NA	NA	NA
3	ARGONNE DR	NA	1312.608373	NA	NA	NA	NA	NA	NA
4	ART MUSEUM DR	NA	NA	NA	NA	NA	NA	277.6579871	NA
5	AUCHENTOROLY TERR	NA	1341.82908	NA	NA	NA	NA	NA	NA

Reshape wide using either PROC TRANSPOSE or a DATA step. You will need to gather all columns except the name column. Transform into a long dataset with two new columns lanetype (the former column names) and the_length the data values. In the variable the_length, replace ‘NA’ values with . and convert it to a numeric column.

DATA long;
   SET wide;
   array col{*} 'BIKE BOULEVARD'n -- NA; /*BIKE_BOULEVARD -- NA*/
   DO i = 1 TO dim(col);
      lanetype = vname(col(i));
      IF col(i) = 'NA' THEN the_length = .;
      ELSE the_length = INPUT(col(i), 20.7);
      OUTPUT;
   END;
   DROP 'BIKE BOULEVARD'n -- NA i; /*BIKE_BOULEVARD -- NA*/
RUN;

PROC PRINT data = long(obs=10);
   TITLE 'Using a DATA step';
RUN;

PROC SORT data=wide;
   BY name;
RUN;

PROC TRANSPOSE data = wide
               out = long2 (RENAME = (COL1 = the_length 
                                      _NAME_ = lanetype));
   BY name;
   VAR 'BIKE BOULEVARD'n -- NA;  /*BIKE_BOULEVARD -- NA*/
RUN;

DATA long2;
   SET long2(RENAME = (the_length = length));
   IF length = 'NA' THEN the_length = .;
   ELSE the_length = INPUT(length, 20.7);
   DROP length;
RUN;

PROC PRINT data = long2(OBS=10);
   TITLE 'Using PROC TRANSPOSE';
RUN;

TITLE;

SAS Output

Using a DATA step


Obs	name	lanetype	the_length
1	ALBEMARLE ST	BIKE BOULEVARD	.
2	ALBEMARLE ST	BIKE LANE	.
3	ALBEMARLE ST	CONTRAFLOW	.
4	ALBEMARLE ST	SHARED BUS BIKE	.
5	ALBEMARLE ST	SHARROW	109.914
6	ALBEMARLE ST	SIDEPATH	.
7	ALBEMARLE ST	SIGNED ROUTE	234.919
8	ALBEMARLE ST	NA	.
9	ALICEANNA ST	BIKE BOULEVARD	.
10	ALICEANNA ST	BIKE LANE	.

Using PROC TRANSPOSE


Obs	name	lanetype	the_length
1	ALBEMARLE ST	BIKE BOULEVARD	.
2	ALBEMARLE ST	BIKE LANE	.
3	ALBEMARLE ST	CONTRAFLOW	.
4	ALBEMARLE ST	SHARED BUS BIKE	.
5	ALBEMARLE ST	SHARROW	109.914
6	ALBEMARLE ST	SIDEPATH	.
7	ALBEMARLE ST	SIGNED ROUTE	234.919
8	ALBEMARLE ST	NA	.
9	ALICEANNA ST	BIKE BOULEVARD	.
10	ALICEANNA ST	BIKE LANE	.

Read in the roads and crashes .csv files and call them road and crash.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/roads.csv"
            out = road dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/crashes.csv"
            out = crash dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC PRINT data = road;
RUN;

PROC PRINT data = crash(OBS=10);
RUN;

SAS Output


Obs	Road	District	Length
1	Interstate 65	Greenfield	262
2	Interstate 70	Vincennes	156
3	US-36	Crawfordsville	139
4	US-40	Greenfield	150
5	US-52	Crawfordsville	172


Obs	Year	Road	N_Crashes	Volume
1	1991	Interstate 65	25	40000
2	1992	Interstate 65	37	41000
3	1993	Interstate 65	45	45000
4	1994	Interstate 65	46	45600
5	1995	Interstate 65	46	49000
6	1996	Interstate 65	59	51000
7	1997	Interstate 65	76	52000
8	1998	Interstate 65	90	58000
9	1999	Interstate 65	95	65000
10	2000	Interstate 65	95	74000

Replace (using tranwrd) any hyphens (-) with a space in Road variable of crash. Call this data crash2. Table the Road variable with PROC FREQ.

DATA crash2;
   SET crash;
   road = tranwrd(road, '-', ' ');
RUN;

PROC FREQ data = crash2;
   TABLE road;
RUN;

SAS Output

The FREQ Procedure


Road	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Interstate 275	22	20.00	22	20.00
Interstate 65	22	20.00	44	40.00
Interstate 70	22	20.00	66	60.00
US 36	22	20.00	88	80.00
US 40	22	20.00	110	100.00

How many observations are in each of the crash and road datasets?

PROC CONTENTS data = crash;
RUN;

PROC CONTENTS data = road;
RUN;

SAS Output

The CONTENTS Procedure


Data Set Name	WORK.CRASH	Observations	110
Member Type	DATA	Variables	4
Engine	V9	Indexes	0
Created	10/18/2020 17:00:48	Observation Length	40
Last Modified	10/18/2020 17:00:48	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	1632
Obs in First Data Page	110
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/crash.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671741
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	128KB
File Size (bytes)	131072


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
3	N_Crashes	Num	8	BEST12.	BEST32.
2	Road	Char	14	$14.	$14.
4	Volume	Num	8	BEST12.	BEST32.
1	Year	Num	8	BEST12.	BEST32.

The CONTENTS Procedure


Data Set Name	WORK.ROAD	Observations	5
Member Type	DATA	Variables	3
Engine	V9	Indexes	0
Created	10/18/2020 17:00:48	Observation Length	40
Last Modified	10/18/2020 17:00:48	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	1632
Obs in First Data Page	5
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/road.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671709
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	128KB
File Size (bytes)	131072


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
2	District	Char	14	$14.	$14.
3	Length	Num	8	BEST12.	BEST32.
1	Road	Char	13	$13.	$13.

The crash dataset has 110 observations and the road datset has 5.

Separate the Road column (using scan) into (type and number) in crash2. Reassign this to crash2. Table type from crash2 using PROC FREQ. Then create a new variable calling it road_hyphen using one of the concatenate functions (such as CAT). Unite the type and number columns using a hyphen (-) and then table road_hyphen using PROC FREQ.

DATA crash2;
   SET crash2;
   type = scan(road, 1);
   number = scan(road, 2);
RUN;

PROC FREQ data=crash2;
   TABLE type;
RUN;

DATA crash2;
   SET crash2;
   road_hyphen = CATX('-', type, number);
RUN;

PROC FREQ data=crash2;
   TABLE road_hyphen;
RUN;

SAS Output

The FREQ Procedure


type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Interstate	66	60.00	66	60.00
US	44	40.00	110	100.00

The FREQ Procedure


road_hyphen	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Interstate-275	22	20.00	22	20.00
Interstate-65	22	20.00	44	40.00
Interstate-70	22	20.00	66	60.00
US-36	22	20.00	88	80.00
US-40	22	20.00	110	100.00

Which and how many years were data collected in the crash dataset?

PROC FREQ data=crash;
   TABLE year;
RUN;

SAS Output

The FREQ Procedure


Year	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1991	5	4.55	5	4.55
1992	5	4.55	10	9.09
1993	5	4.55	15	13.64
1994	5	4.55	20	18.18
1995	5	4.55	25	22.73
1996	5	4.55	30	27.27
1997	5	4.55	35	31.82
1998	5	4.55	40	36.36
1999	5	4.55	45	40.91
2000	5	4.55	50	45.45
2001	5	4.55	55	50.00
2002	5	4.55	60	54.55
2003	5	4.55	65	59.09
2004	5	4.55	70	63.64
2005	5	4.55	75	68.18
2006	5	4.55	80	72.73
2007	5	4.55	85	77.27
2008	5	4.55	90	81.82
2009	5	4.55	95	86.36
2010	5	4.55	100	90.91
2011	5	4.55	105	95.45
2012	5	4.55	110	100.00

The crash dataset contains data on the years from 1991 to 2012 for a total of 22 years.

Read in the dataset Bike_Lanes.csv and call it bike.

PROC IMPORT datafile="/folders/myfolders/SAS_Notes/data/Bike_Lanes.csv"
            out = bike dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

PROC PRINT data=bike(obs=5);
RUN;

SAS Output

The SAS System


Obs	subType	name	block	type	numLanes	project	route	length	dateInstalled
1				BIKE BOULEVARD	1	GUILFORD AVE BIKE BLVD		435.73787702	0
2				SIDEPATH	1		NORTHERN	1024.6746249	2010
3				SIGNED ROUTE	1	SOUTHEAST BIKE NETWORK		3749.3226377	2010
4		HUNTINGDON PATH		SIDEPATH	1			0	0
5	STCLN	EDMONDSON AVE	5300 BLK EDMONDSON AVE	BIKE LANE	1	OPERATION ORANGE CONE		180.92546489	2011

Keep rows where the record is not missing type and not missing name and re-assign the output to bike.

DATA bike;
   SET bike;
   WHERE NOT MISSING(type);
RUN;

PROC PRINT data=bike(obs=5);
RUN;

SAS Output

The SAS System


Obs	subType	name	block	type	numLanes	project	route	length	dateInstalled
1				BIKE BOULEVARD	1	GUILFORD AVE BIKE BLVD		435.73787702	0
2				SIDEPATH	1		NORTHERN	1024.6746249	2010
3				SIGNED ROUTE	1	SOUTHEAST BIKE NETWORK		3749.3226377	2010
4		HUNTINGDON PATH		SIDEPATH	1			0	0
5	STCLN	EDMONDSON AVE	5300 BLK EDMONDSON AVE	BIKE LANE	1	OPERATION ORANGE CONE		180.92546489	2011

Using PROC MEANS with a BY statement grouping name and type (i.e for each type within each name), find the sum of the length. Use an OUTPUT statement to get this summary dataset and only keep the name, type and sum of the length column (after renaming this column length). Call this data set sub.

PROC SORT data = bike;
  BY name type;
RUN;

PROC MEANS data = bike noprint;
  BY name type;
  VAR length;
  OUTPUT out=sub sum=length;
RUN;

DATA sub;
   SET sub;
   keep name type length;
RUN;

PROC PRINT data=sub(obs=8);
RUN;

SAS Output

The SAS System


Obs	name	type	length
1		BIKE BOULEVARD	435.73787702
2		BIKE LANE	695.69831669
3		SHARROW	186.96669187
4		SIDEPATH	1024.6746249
5		SIGNED ROUTE	6527.883024
6	ALBEMARLE ST	SHARROW	439.65669098
7	ALBEMARLE ST	SIGNED ROUTE	1879.3520247
8	ALICEANNA ST	SHARROW	4763.495767

PROC FREQ data=sub;
   TABLE type;
RUN;

SAS Output

The SAS System

The FREQ Procedure


type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
BIKE BOULEVARD	3	1.80	3	1.80
BIKE LANE	54	32.34	57	34.13
CONTRAFLOW	4	2.40	61	36.53
SHARED BUS BIKE	4	2.40	65	38.92
SHARROW	50	29.94	115	68.86
SIDEPATH	5	2.99	120	71.86
SIGNED ROUTE	47	28.14	167	100.00

Reshape sub from long to wide by taking the type to be the new columns and length to be the value in those columns. (NOTE: the names have spaces in them. Do we need to replace the spaces with a character before changing them to column names?)

DATA new_wide;
   SET sub;
   BY name;
   RETAIN BIKE_BOULEVARD BIKE_LANE CONTRAFLOW SHARED_BUS_BIKE 
          SHARROW SIDEPATH SIGNED_ROUTE;
   array atype (*) BIKE_BOULEVARD BIKE_LANE CONTRAFLOW SHARED_BUS_BIKE 
                   SHARROW SIDEPATH SIGNED_ROUTE;
                   
   IF first.name THEN call missing(of atype(*));
   
   IF type = 'BIKE BOULEVARD' THEN atype(1) = length;
   ELSE IF type = 'BIKE LANE' THEN atype(2) = length;
   ELSE IF type = 'CONTRAFLOW' THEN atype(3) = length;
   ELSE IF type = 'SHARED BUS BIKE' THEN atype(4) = length;
   ELSE IF type = 'SHARROW' THEN atype(5) = length;
   ELSE IF type = 'SIDEPATH' THEN atype(6) = length;
   ELSE IF type = 'SIGNED ROUTE' THEN atype(7) = length;
   
   IF last.name THEN OUTPUT;
   DROP type length;
RUN;

PROC PRINT data=new_wide(OBS=5);
   TITLE 'Wide Dataset Using DATA Step';
RUN;

PROC TRANSPOSE data = sub out = new_wide2(drop=_NAME_);
   BY name;
   VAR length;
   ID type;
RUN;

PROC PRINT data = new_wide2(OBS=5);
   TITLE 'Wide Dataset Using PROC TRANSPOSE';
RUN;

TITLE ;

SAS Output

Wide Dataset Using DATA Step


Obs	name	BIKE_BOULEVARD	BIKE_LANE	CONTRAFLOW	SHARED_BUS_BIKE	SHARROW	SIDEPATH	SIGNED_ROUTE
1		435.738	695.70	.	.	186.97	1024.67	6527.88
2	ALBEMARLE ST	.	.	.	.	439.66	.	1879.35
3	ALICEANNA ST	.	.	.	.	4763.50	.	.
4	ARGONNE DR	.	1312.61	.	.	.	.	.
5	ART MUSEUM DR	.	.	.	.	.	.	832.97

Wide Dataset Using PROC TRANSPOSE


Obs	name	BIKE BOULEVARD	BIKE LANE	SHARROW	SIDEPATH	SIGNED ROUTE	CONTRAFLOW	SHARED BUS BIKE
1		435.73787702	695.69831669	186.96669187	1024.6746249	6527.883024	.	.
2	ALBEMARLE ST	.	.	439.65669098	.	1879.3520247	.	.
3	ALICEANNA ST	.	.	4763.495767	.	.	.	.
4	ARGONNE DR	.	1312.6083729	.	.	.	.	.
5	ART MUSEUM DR	.	.	.	.	832.97396136	.	.

Join data in the crash and road datasets to retain only complete data, (using an inner join on road) Merge by the variable Road. Call the output merged. How many observations are there?

PROC SORT data = road;
  BY road;
RUN;

PROC SORT data = crash;
  BY road;
RUN;

DATA merged;
  MERGE road (IN = in1) 
        crash (IN = in2);
  BY road;
  IF in1 = 1 and in2 = 1;
RUN;

PROC CONTENTS data = merged;
RUN;

SAS Output

The CONTENTS Procedure


Data Set Name	WORK.MERGED	Observations	88
Member Type	DATA	Variables	6
Engine	V9	Indexes	0
Created	10/18/2020 20:34:03	Observation Length	64
Last Modified	10/18/2020 20:34:03	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	1021
Obs in First Data Page	88
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/merged.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671747
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	128KB
File Size (bytes)	131072


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
2	District	Char	14	$14.	$14.
3	Length	Num	8	BEST12.	BEST32.
5	N_Crashes	Num	8	BEST12.	BEST32.
1	Road	Char	13	$13.	$13.
6	Volume	Num	8	BEST12.	BEST32.
4	Year	Num	8	BEST12.	BEST32.

There are 88 rows.

Join data using a full_join. Call the output full. How many observations are there?

DATA full_join;
  MERGE road crash;
  BY road;
RUN;

PROC PRINT data = full_join(OBS=10);
RUN;

SAS Output


Obs	Road	Length	Year	N_Crashes	Volume
1	Interstate 27	.	1991	27	20350
2	Interstate 27	.	1992	26	21200
3	Interstate 27	.	1993	22	23200
4	Interstate 27	.	1994	21	21200
5	Interstate 27	.	1995	28	23200
6	Interstate 27	.	1996	22	20000
7	Interstate 27	.	1997	27	18000
8	Interstate 27	.	1998	21	19500
9	Interstate 27	.	1999	22	21000
10	Interstate 27	.	2000	29	20700

Do a left join of the road and crash. ORDER matters here! How many observations are there?

DATA left_join;
   MERGE road (in = in1) crash;
   BY road;
   IF in1 = 1;
RUN;

PROC CONTENTS data = left_join;
RUN;

SAS Output

The CONTENTS Procedure


Data Set Name	WORK.LEFT_JOIN	Observations	89
Member Type	DATA	Variables	6
Engine	V9	Indexes	0
Created	10/18/2020 20:38:16	Observation Length	64
Last Modified	10/18/2020 20:38:16	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	1021
Obs in First Data Page	89
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/left_join.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671749
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	128KB
File Size (bytes)	131072


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
2	District	Char	14	$14.	$14.
3	Length	Num	8	BEST12.	BEST32.
5	N_Crashes	Num	8	BEST12.	BEST32.
1	Road	Char	13	$13.	$13.
6	Volume	Num	8	BEST12.	BEST32.
4	Year	Num	8	BEST12.	BEST32.

There are 89 rows.

Repeat above with a right_join with the same order of the arguments. How many observations are there?

DATA right_join;
   MERGE road crash (in = in1);
   BY road;
   IF in1 = 1;
RUN;

PROC CONTENTS data = right_join;
RUN;

SAS Output

The CONTENTS Procedure


Data Set Name	WORK.RIGHT_JOIN	Observations	110
Member Type	DATA	Variables	6
Engine	V9	Indexes	0
Created	10/18/2020 20:39:48	Observation Length	64
Last Modified	10/18/2020 20:39:48	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	1021
Obs in First Data Page	110
Number of Data Set Repairs	0
Filename	/tmp/SAS_workA5B9000010B1_localhost.localdomain/right_join.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	671750
Access Permission	rw-r--r--
Owner Name	sasdemo
File Size	128KB
File Size (bytes)	131072


Alphabetic List of Variables and Attributes
#	Variable	Type	Len	Format	Informat
2	District	Char	14	$14.	$14.
3	Length	Num	8	BEST12.	BEST32.
5	N_Crashes	Num	8	BEST12.	BEST32.
1	Road	Char	13	$13.	$13.
6	Volume	Num	8	BEST12.	BEST32.
4	Year	Num	8	BEST12.	BEST32.

There are 110 observations in the right joined dataset.

17.11. The Output Delivery System and Graphics¶

PROC IMPORT datafile = "/folders/myfolders/SAS_Notes/data/Charm_City_Circulator_Ridership.csv"
            out = circ dbms = csv replace;
    getnames = yes;
    guessingrows = max;
RUN;

DATA long;
   SET circ;
   ARRAY larray(*) orangeBoardings -- bannerAverage;
   DO i = 1 TO dim(larray);
      var = vname(larray(i));
      number = larray(i);
      var = tranwrd(var, 'Board', ' Board');
      var = tranwrd(var, 'Alight', ' Alight');
      var = tranwrd(var, 'Average', ' Average');
      route = scan(var, 1);
      type = scan(var, 2);
      OUTPUT;
    END;
    
    DROP i var orangeBoardings -- bannerAverage;
RUN;

DATA avg;
  SET long;
  WHERE type = 'Average' and number ne .;
RUN;

PROC PRINT data = long(obs = 10);
RUN;

PROC PRINT data = avg(obs=10);
RUN;

SAS Output


Obs	day	date	daily	number	route	type
1	Monday	01/11/2010	952	877	orange	Boardings
2	Monday	01/11/2010	952	1027	orange	Alightings
3	Monday	01/11/2010	952	952	orange	Average
4	Monday	01/11/2010	952	.	purple	Boardings
5	Monday	01/11/2010	952	.	purple	Alightings
6	Monday	01/11/2010	952	.	purple	Average
7	Monday	01/11/2010	952	.	green	Boardings
8	Monday	01/11/2010	952	.	green	Alightings
9	Monday	01/11/2010	952	.	green	Average
10	Monday	01/11/2010	952	.	banner	Boardings


Obs	day	date	daily	number	route	type
1	Monday	01/11/2010	952	952.0	orange	Average
2	Tuesday	01/12/2010	796	796.0	orange	Average
3	Wednesday	01/13/2010	1211.5	1211.5	orange	Average
4	Thursday	01/14/2010	1213.5	1213.5	orange	Average
5	Friday	01/15/2010	1644	1644.0	orange	Average
6	Saturday	01/16/2010	1490.5	1490.5	orange	Average
7	Sunday	01/17/2010	888.5	888.5	orange	Average
8	Monday	01/18/2010	999.5	999.5	orange	Average
9	Tuesday	01/19/2010	1035	1035.0	orange	Average
10	Wednesday	01/20/2010	1395.5	1395.5	orange	Average

Plot average ridership (avg data set) by date using a scatterplot. a. Color the points by route (orange, purple, green, banner) b. Add black smoothed curves for each route c. Color the points by day of the week

PROC SGPLOT data = avg;
   TITLE '1a';
   SCATTER Y = number X = date / group = route;
RUN;

PROC SGPLOT data = avg;
   TITLE '1b';
   SCATTER Y = number X = date / group = route;
   LOESS Y = number X = date / group = route lineattrs=(color = black);
RUN;

PROC SGPLOT data = avg;
   TITLE '1c';
   SCATTER Y = number X = date / group = day;
RUN;

SAS Output

Replot 1a where the colors of the points are the name of the route (with banner –> blue). Note: you will need to make a data attribute map.

DATA marker_map;
  INPUT id $2. +1 VALUE $6. +1 MARKERCOLOR $6.;
DATALINES;
mc banner blue
mc green  green
mc orange orange
mc purple purple
;
RUN;

PROC SGPLOT data = avg dattrmap = marker_map;
   TITLE '2';
   SCATTER Y = number X = date / group = route attrid = mc;
RUN;

SAS Output

Plot a scatterplot of average ridership by date with one panel per route.

PROC SGPANEL data = avg;
   TITLE '3';
   PANELBY route;
   SCATTER Y = number X = date;
RUN;

SAS Output

Plot a scatterplpot of average ridership by date with separate panels by day of the week, colored by route.

*Default colors are fine here to, but we can also use a data attribute map like we did before;
PROC SGPANEL data = avg dattrmap = marker_map;
   TITLE '4';
   PANELBY day;
   SCATTER Y = number X = date / group = route attrid = mc;
RUN;

SAS Output

Plot a scatterplot of average ridership (avg) by date, colored by route (same as 1a). (do not take an average, use the average column for each route). Make the x-label “Year”. Make the y-label “Number of People”.

PROC SGPLOT data = avg dattrmap = marker_map;
   TITLE '5';
   SCATTER Y = number X = date / group = route attrid = mc;
   XAXIS LABEL = "Year";
   YAXIS LABEL = "Number of People";
RUN;

SAS Output

Plot average ridership on the orange route versus date as a solid line, and add dashed “error” lines based on the boardings and alightings. The line colors should be orange.

DATA orange;
  SET long;
  WHERE route = "orange";
RUN;

DATA line_map;
  INPUT id $2. +1 VALUE $10. +1 linepattern $5.;
DATALINES;
lp Alightings dash
lp Average    solid
lp Boardings  dash
;
RUN;

PROC SGPLOT data = orange dattrmap = line_map;
  TITLE '6';
  SERIES Y = number X = date / group = type attrid = lp lineattrs=(color = orange);
RUN;

TITLE;

SAS Output

17.12. Statistical Analysis in SAS¶

FILENAME cardata '/folders/myfolders/SAS_Notes/data/kaggleCarAuction.csv';

PROC IMPORT datafile = cardata out = cars dbms = CSV replace;
   getnames = yes;
   guessingrows = 1000;
RUN;

FILENAME mortdat '/folders/myfolders/SAS_Notes/data/indicatordeadkids35.csv';

PROC IMPORT datafile = mortdat out = mort dbms = CSV replace;
   getnames = yes;
   guessingrows = 500;
RUN;

Compute the correlation between the 1980, 1990, 2000, and 2010 mortality data. Just display the result to the screen. Then compute using the NOMMISS option. (Note: The column names are numbers, which are invalid standard SAS names, so to refer to the variable 1980 in your code use ‘1980’n.)

/*
PROC CORR data = mort;
   VAR '1980'n '1990'n '2000'n '2010'n;
RUN;

PROC CORR data = mort nomiss;
   VAR '1980'n '1990'n '2000'n '2010'n;
RUN;
*/

PROC CORR data = mort;
   VAR _1980 _1990 _2000 _2010;
RUN;

PROC CORR data = mort nomiss;
   VAR _1980 _1990 _2000 _2010;
RUN;

SAS Output

The CORR Procedure


4 Variables:	1980 1990 2000 2010


Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
1980	197	0.94731	0.88694	186.62013	0.07049	3.49035
1990	197	0.70911	0.77929	139.69428	0.04922	3.28983
2000	197	0.55004	0.68922	108.35786	0.03232	2.81186
2010	193	0.39601	0.50021	76.42946	0.03173	2.02857


Pearson Correlation Coefficients Prob > \|r\| under H0: Rho=0 Number of Observations
	1980	1990	2000	2010
1980	1.00000 197	0.96015 <.0001 197	0.88884 <.0001 197	0.84683 <.0001 193
1990	0.96015 <.0001 197	1.00000 197	0.96138 <.0001 197	0.92472 <.0001 193
2000	0.88884 <.0001 197	0.96138 <.0001 197	1.00000 197	0.98623 <.0001 193
2010	0.84683 <.0001 193	0.92472 <.0001 193	0.98623 <.0001 193	1.00000 193

The CORR Procedure


4 Variables:	1980 1990 2000 2010


Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
1980	193	0.96250	0.88966	185.76175	0.07049	3.49035
1990	193	0.72132	0.78263	139.21550	0.05512	3.28983
2000	193	0.55976	0.69297	108.03459	0.03271	2.81186
2010	193	0.39601	0.50021	76.42946	0.03173	2.02857


Pearson Correlation Coefficients, N = 193 Prob > \|r\| under H0: Rho=0
	1980	1990	2000	2010
1980	1.00000	0.95968 <.0001	0.88774 <.0001	0.84683 <.0001
1990	0.95968 <.0001	1.00000	0.96103 <.0001	0.92472 <.0001
2000	0.88774 <.0001	0.96103 <.0001	1.00000	0.98623 <.0001
2010	0.84683 <.0001	0.92472 <.0001	0.98623 <.0001	1.00000

a. Compute the correlation between the Myanmar, China, and United States mortality data. Store this correlation matrix in an object called country_cor using ODS OUTPUT. b. Extract the Myanmar-US correlation from the correlation matrix.

DATA mort_sub;
   SET mort(RENAME=(Var1 = Country));
   WHERE country in ('Myanmar' 'China' 'United States');
RUN;

/*
PROC TRANSPOSE data = mort_sub out = mort_sub_transposed;
   VAR '1760'n -- '2099'n;
   ID country;
RUN;
*/

PROC TRANSPOSE data = mort_sub out = mort_sub_transposed;
   VAR _1760 -- _2099;
   ID country;
RUN;

/*
PROC CORR DATA = mort_sub_transposed;
   VAR China Myanmar 'United States'n;
   ODS OUTPUT PearsonCorr = country_cor;
RUN;
*/

*ODS TRACE ON;
PROC CORR DATA = mort_sub_transposed;
   VAR China Myanmar United_States;
   ODS OUTPUT PearsonCorr = country_cor;
RUN;
*ODS TRACE OFF;

/*
PROC PRINT data = country_cor;
   Var Myanmar;
   WHERE Variable = 'United States';
RUN;
*/

PROC PRINT data = country_cor;
   Var Myanmar;
   WHERE Variable = 'United_States';
RUN;   

SAS Output

The CORR Procedure


3 Variables:	China Myanmar United States


Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
China	214	2.37608	1.15555	508.48153	0.04325	3.61453
Myanmar	214	2.82215	1.12767	603.94031	0.05150	3.76418
United States	214	1.36080	1.14407	291.21132	0.02867	3.17018


Pearson Correlation Coefficients, N = 214 Prob > \|r\| under H0: Rho=0
	China	Myanmar	United States
China	1.00000	0.97436 <.0001	0.69093 <.0001
Myanmar	0.97436 <.0001	1.00000	0.62926 <.0001
United States	0.69093 <.0001	0.62926 <.0001	1.00000


Obs	Myanmar
3	0.62926

Is there a difference between mortality information from 1990 and 2000? Run a paired t-test and a Wilcoxon signed rank test to assess this. Hint: to extract the column of information for 1990, use ‘1990’n.

/*
PROC TTEST data = mort;
   PAIRED '1990'n*'2000'n;
RUN;

DATA diff;
   SET mort;
   diff = '2000'n - '1990'n;
RUN;
*/

PROC TTEST data = mort;
   PAIRED _1990*_2000;
RUN;

DATA diff;
   SET mort;
   diff = _2000 - _1990;
RUN;

PROC UNIVARIATE data = diff;
  VAR diff;
RUN;

SAS Output

The TTEST Procedure

Difference: 1990 - 2000


N	Mean	Std Dev	Std Err	Minimum	Maximum
197	0.1591	0.2227	0.0159	-0.5456	0.8726


Mean	95% CL Mean		Std Dev	95% CL Std Dev
0.1591	0.1278	0.1904	0.2227	0.2027	0.2472


DF	t Value	Pr > \|t\|
196	10.03	<.0001

Summary Panel for Difference of 1990 and 2000

Q-Q Plot for Difference of 1990 and 2000

The UNIVARIATE Procedure

Variable: diff


Moments
N	197	Sum Weights	197
Mean	-0.1590681	Sum Observations	-31.336415
Std Deviation	0.22269839	Variance	0.04959457
Skewness	-0.9205453	Kurtosis	2.01791651
Uncorrected SS	14.7051599	Corrected SS	9.72053615
Coeff Variation	-140.00192	Std Error Mean	0.0158666


Basic Statistical Measures
Location		Variability
Mean	-0.15907	Std Deviation	0.22270
Median	-0.09007	Variance	0.04959
Mode	.	Range	1.41823
		Interquartile Range	0.21805


Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	-10.0253	Pr > \|t\|	<.0001
Sign	M	-84.5	Pr >= \|M\|	<.0001
Signed Rank	S	-8245.5	Pr >= \|S\|	<.0001


Quantiles (Definition 5)
Level	Quantile
100% Max	0.5456100
99%	0.3990751
95%	0.0830986
90%	-0.0104786
75% Q3	-0.0286619
50% Median	-0.0900725
25% Q1	-0.2467130
10%	-0.4798594
5%	-0.6188205
1%	-0.8710244
0% Min	-0.8726242


Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
-0.872624	19	0.272621	32
-0.871024	131	0.321500	196
-0.808735	96	0.329956	22
-0.804626	54	0.399075	99
-0.794268	106	0.545610	197

Using the cars dataset, fit a linear regression model with vehicle cost (VehBCost) as the outcome and vehicle age (VehicleAge) and whether it’s an online sale (IsOnlineSale) as predictors as well as their interaction.

PROC GLM data = cars;
   MODEL VehBCost = VehicleAge IsOnlineSale VehicleAge*IsOnlineSale;
RUN;

SAS Output

The GLM Procedure


Number of Observations Read	72983
Number of Observations Used	72983

The GLM Procedure

Dependent Variable: VehBCost


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	3	22422071293	7474023764.4	2652.08	<.0001
Error	72979	205667188003	2818169.4461
Corrected Total	72982	228089259296


R-Square	Coeff Var	Root MSE	VehBCost Mean
0.098304	24.94067	1678.740	6730.934


Source	DF	Type I SS	Mean Square	F Value	Pr > F
VehicleAge	1	22249980049	22249980049	7895.19	<.0001
IsOnlineSale	1	158865621	158865621	56.37	<.0001
VehicleAg*IsOnlineSa	1	13225623	13225623	4.69	0.0303


Source	DF	Type III SS	Mean Square	F Value	Pr > F
VehicleAge	1	21533779839	21533779839	7641.05	<.0001
IsOnlineSale	1	64252535	64252535	22.80	<.0001
VehicleAg*IsOnlineSa	1	13225623	13225623	4.69	0.0303


Parameter	Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	8062.701999	16.5870226	486.08	<.0001
VehicleAge	-320.662108	3.6683525	-87.41	<.0001
IsOnlineSale	514.308268	107.7114631	4.77	<.0001
VehicleAg*IsOnlineSa	-55.372926	25.5607123	-2.17	0.0303

Create a variable called expensive in the cars data that indicates if the vehicle cost is over $10,000. Use a chi-squared test to assess if there is a relationship between a car being expensive and it being labeled as a “bad buy” (IsBadBuy).

DATA cars;
  SET cars;
  IF VehBCost > 10000 THEN expensive = 1;
  ELSE expensive = 0;
RUN;

PROC FREQ data = cars;
   TABLES expensive*IsBadBuy / chisq;
RUN;

SAS Output

The FREQ Procedure

Frequency Percent Row Pct Col Pct


Table of expensive by IsBadBuy
expensive	IsBadBuy
expensive	0	1	Total
0	61505 84.27 87.67 96.09	8652 11.85 12.33 96.39	70157 96.13
1	2502 3.43 88.54 3.91	324 0.44 11.46 3.61	2826 3.87
Total	64007 87.70	8976 12.30	72983 100.00

Statistics for Table of expensive by IsBadBuy


Statistic	DF	Value	Prob
Chi-Square	1	1.8948	0.1687
Likelihood Ratio Chi-Square	1	1.9314	0.1646
Continuity Adj. Chi-Square	1	1.8152	0.1779
Mantel-Haenszel Chi-Square	1	1.8948	0.1687
Phi Coefficient		-0.0051
Contingency Coefficient		0.0051
Cramer's V		-0.0051


Fisher's Exact Test
Cell (1,1) Frequency (F)	61505
Left-sided Pr <= F	0.0880
Right-sided Pr >= F	0.9211

Table Probability (P)	0.0091
Two-sided Pr <= P	0.1790

Sample Size = 72983

Fit a logistic regression model where the outcome is “bad buy” status and predictors are the expensive status and vehicle age (VehicleAge). Request confidence intervals for the odds ratios.

PROC LOGISTIC data = cars;
   MODEL IsBadBuy(event='1') = expensive VehicleAge / CLODDS = WALD;
RUN;

SAS Output

The LOGISTIC Procedure


Model Information
Data Set	WORK.CARS
Response Variable	IsBadBuy
Number of Response Levels	2
Model	binary logit
Optimization Technique	Fisher's scoring


Number of Observations Read	72983
Number of Observations Used	72983


Response Profile
Ordered Value	IsBadBuy	Total Frequency
1	0	64007
2	1	8976

Probability modeled is IsBadBuy='1'.


Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.


Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
AIC	54423.307	52447.788
SC	54432.505	52475.381
-2 Log L	54421.307	52441.788


Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	1979.5198	2	<.0001
Score	2042.5760	2	<.0001
Wald	1961.8662	2	<.0001


Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	1	-3.2495	0.0332	9554.1782	<.0001
expensive	1	-0.0804	0.0607	1.7542	0.1853
VehicleAge	1	0.2866	0.00648	1958.7086	<.0001


Association of Predicted Probabilities and Observed Responses
Percent Concordant	56.9	Somers' D	0.279
Percent Discordant	29.1	Gamma	0.324
Percent Tied	14.0	Tau-a	0.060
Pairs	574526832	c	0.639


Odds Ratio Estimates and Wald Confidence Intervals
Effect	Unit	Estimate	95% Confidence Limits
expensive	1.0000	0.923	0.819	1.039
VehicleAge	1.0000	1.332	1.315	1.349

Plot of Odds Ratios with 95% Wald Confidence Limits

17.13. Macros¶

Write a macro that discretizes a quantitative variable into four categories based on quantiles. That is:
- if X < Q1 then group = 1
- if Q1 < X < M then group = 2
- if M < X < Q3 then group = 3
- if Q3 < X then group = 4

The macro should have the following defnition - %quartilesmacro(mydata, qvar, round, out);

mydata: dataset containing the quantitative variable
qvar: name of quantitative variable
round: integer representing number of decimal places to round to
out: name of output dataset which contains the categorized variable

We will explore this by using the Framingham dataset as an example to apply the macro to.

LIBNAME phc6089 "/folders/myfolders/SAS_Notes/data/";

*Create a temporary data set, so that we don't save changes to the original data set.;
DATA fghmTemp;
   SET phc6089.fghm113;
RUN;

/*Now let's code some variables with some more descriptive values.
  SEX (Gender): 1=Men 
				2=Women
  Period (Examination cycle): 1=Period1 
							  2=Period2
							  3=Period3
  BPMEDS (Use of anti-hypertensive meds): 0=Not currently
										  1=Currently use
  CURSMOKE (Currently smoke?): 0=No
							   1=Yes
  DIABETES: 0=Not diabetic
			1=Diabetic
  PREVAP (Have angina pectoric?): 0=No
								  1=Yes
  PREVCHD (Coronary heart disease?): 0=No
									 1=Yes
  PREVMI (Myocardial infarction?): 0=No
								   1=Yes
  PREVSTRK (Had a stroke?): 0=No
				   			1=Yes
  PREVHYP (Hypertensive? sys bp >=140 or dyas bp >= 90): 0=no
														 1=yes
*/

PROC FORMAT;
VALUE YNfmt 0="No"
			1="Yes";
VALUE perfmt 1="Period 1"
			 2="Period 2"
			 3="Period 3";
VALUE gndrfmt 1="Men"
			  2="Women";
RUN;

DATA fghmtemp;
SET fghmtemp;
FORMAT prevap ynfmt.
	   diabetes ynfmt.
	   cursmoke ynfmt.
	   bpmeds ynfmt.
	   prevchd ynfmt.
	   prevmi ynfmt.
	   prevstrk ynfmt.
	   prevhyp ynfmt.
	   sex gndrfmt.;
RUN;

5880  ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
5880! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
5881  
5882  LIBNAME phc6089 "/folders/myfolders/SAS_Notes/data/";
NOTE: Libref PHC6089 was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /folders/myfolders/SAS_Notes/data
5883  
5884  *Create a temporary data set, so that we don't save changes to the original data set.;
5885  DATA fghmTemp;
NOTE: Data file PHC6089.FGHM113.DATA is in a format that is native to another host, or the file encoding does not match the session 
      encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce 
      performance.
5886     SET phc6089.fghm113;
5887  RUN;
NOTE: There were 500 observations read from the data set PHC6089.FGHM113.
NOTE: The data set WORK.FGHMTEMP has 500 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds
      
5888  
5889  /*Now let's code some variables with some more descriptive values.
5890    SEX (Gender): 1=Men
5891  				2=Women
5892    Period (Examination cycle): 1=Period1
5893  							  2=Period2
5894  							  3=Period3
5895    BPMEDS (Use of anti-hypertensive meds): 0=Not currently
5896  										  1=Currently use
5897    CURSMOKE (Currently smoke?): 0=No
5898  							   1=Yes
5899    DIABETES: 0=Not diabetic
5900  			1=Diabetic
5901    PREVAP (Have angina pectoric?): 0=No
5902  								  1=Yes
5903    PREVCHD (Coronary heart disease?): 0=No
5904  									 1=Yes
5905    PREVMI (Myocardial infarction?): 0=No
5906  								   1=Yes
5907    PREVSTRK (Had a stroke?): 0=No
5908  				   			1=Yes
5909    PREVHYP (Hypertensive? sys bp >=140 or dyas bp >= 90): 0=no
5910  														 1=yes
5911  */
5912  
5913  PROC FORMAT;
5914  VALUE YNfmt 0="No"
NOTE: Format YNFMT has been output.
5915  			1="Yes";
5916  VALUE perfmt 1="Period 1"
5917  			 2="Period 2"
NOTE: Format PERFMT has been output.
5918  			 3="Period 3";
5919  VALUE gndrfmt 1="Men"
NOTE: Format GNDRFMT has been output.
5920  			  2="Women";
5921  RUN;
NOTE: PROCEDURE FORMAT used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
      
5922  
5923  DATA fghmtemp;
5924  SET fghmtemp;
5925  FORMAT prevap ynfmt.
5926  	   diabetes ynfmt.
5927  	   cursmoke ynfmt.
5928  	   bpmeds ynfmt.
5929  	   prevchd ynfmt.
5930  	   prevmi ynfmt.
5931  	   prevstrk ynfmt.
5932  	   prevhyp ynfmt.
5933  	   sex gndrfmt.;
5934  RUN;
NOTE: There were 500 observations read from the data set WORK.FGHMTEMP.
NOTE: The data set WORK.FGHMTEMP has 500 observations and 21 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
5935  
5936  ods html5 (id=saspy_internal) close;ods listing;

5937

* Step 1 - write code with only macros variables - Let's categorize BMI by quartiles;

%let qvar = bmi;

* Find table name;
ODS trace on;
proc means data = fghmtemp Q1 Median Q3;
 var &qvar;
run;
ODS trace off;

proc means data = fghmtemp Q1 Median Q3;
 var &qvar;
 ODS OUTPUT summary = quartiles;
run;

* See the contents of the ODS table;
proc print data = quartiles;
run;

data _null_;
 set quartiles;
 call symput("Q1", strip(&qvar._Q1));
 call symput("M", strip(&qvar._Median));
 call symput("Q3", strip(&qvar._Q3));
run;

%put _user_; * see the value of our macro variables in the log file;

data fghmtemp;
 set fghmtemp;
 if not missing(&qvar) and &qvar < &Q1 then &qvar._cat = 1;
 else if &qvar < &M then &qvar._cat = 2;
 else if &qvar < &Q3 then &qvar._cat = 3;
 else &qvar._cat = 4;
run;

proc freq data = fghmtemp;
 tables &qvar._cat;
run;

SAS Output

The MEANS Procedure


Analysis Variable : BMI Body Mass Index (kr/(M*M)
Lower Quartile	Median	Upper Quartile
23.1000000	25.2800000	27.6300000

The MEANS Procedure


Analysis Variable : BMI Body Mass Index (kr/(M*M)
Lower Quartile	Median	Upper Quartile
23.1000000	25.2800000	27.6300000


Obs	BMI_Q1	BMI_Median	BMI_Q3
1	23.1	25.28	27.63

The FREQ Procedure


bmi_cat	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1	120	24.00	120	24.00
2	130	26.00	250	50.00
3	125	25.00	375	75.00
4	125	25.00	500	100.00

Now that we have the basic code working, let’s turn this into a MACRO.

%macro quartilesmacro(mydata, qvar, round, out = out_dat);

ODS select none;
proc means data = &mydata Q1 Median Q3 maxdec = &round;
 var &qvar;
 ODS OUTPUT summary = quartiles;
run;
ods select all;

data _null_;
 set quartiles;
 call symput("Q1", strip(&qvar._Q1));
 call symput("M", strip(&qvar._Median));
 call symput("Q3", strip(&qvar._Q3));
run;

data &out;
 set &mydata;
 if not missing(&qvar) and &qvar < &Q1 then &qvar._cat = 1;
 else if &qvar < &M then &qvar._cat = 2;
 else if &qvar < &Q3 then &qvar._cat = 3;
 else &qvar._cat = 4;
run;

%mend;

%quartilesmacro(fghmtemp, bmi, round = 2, out = out_bmi);

proc freq data = out_bmi;
 tables bmi_cat;
run;

SAS Output

The FREQ Procedure


bmi_cat	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1	120	24.00	120	24.00
2	130	26.00	250	50.00
3	125	25.00	375	75.00
4	125	25.00	500	100.00

17.14. Simulations¶

In this example, we will explore the sampling distribution of a sample proportion. a) Generate 500 sample of size 50 from a Binomial distribution with n = 1 and p = 0.3. Do this by generating a dataset with 500 rows and 50 columns all filled with random variates from this binomial distribution. Hint: Use an array and a nested DO loop. b) Calculate the mean of each of these 500 rows using mean(of ) in a DATA step. Save this mean as a new column. c) Plot a histogram of the 500 means calculated in part b and use PROC MEANS to calculate the MEAN and standard deviation of the 500 means calculated in part b. This describes the sampling distribution of p-hat when the population distiribution has probability of success p = 0.3 when the sample size is 30.

DATA sim;
   ARRAY data{*} x1-x50;
   DO numsim = 1 TO 500;
      DO n = 1 to 50;
         data(n) = ranbin(1234, 1, 0.3);
      END;
      p_hat = mean(of x1-x50);
      OUTPUT;
   END;
RUN;

PROC SGPLOT data = sim;
   HISTOGRAM p_hat;
RUN;

PROC MEANS data = sim mean std;
   VAR p_hat;
RUN;

SAS Output

The MEANS Procedure


Analysis Variable : p_hat
Mean	Std Dev
0.2980400	0.0635376

16. Introduction to PROC SQL

Intro to SAS Notes

17. Solutions to Exercises¶

17.1. SAS Windowing Environment¶

17.2. Basic SAS Programming¶

17.3. Data I/O¶

17.4. SAS Variables and Assignment Statements¶

17.5. Working with Your Data¶

17.6. Data Summarization¶

17.7. SAS Formats and Dates¶

17.8. Character Functions¶

17.9. Data Cleaning¶

17.10. Data Manipulation¶

17.11. The Output Delivery System and Graphics¶

17.12. Statistical Analysis in SAS¶

17.13. Macros¶

17.14. Simulations¶