SAS: Data Manipulation

In this lecture, we will discuss data management and cleaning in SAS in the following topics

  • SAS Arrays
  • Wide and Long Data Formats
  • PROC TRANSPOSE
  • Combining Dataset
    • Stacking with SET
    • Merging with MERGE
  • Data Cleaing - checking for bad data

SAS Arrays

SAS arrays can be very useful when reading and anlayzing repetitive data. Using arrays allows us to replace long repetitive code with a short section of compact code. To examing using arrays, we will use a fictional diagnosis dataset I created. The dataset contains

  • Patient ID
  • Claim ID
  • Date - date visited doctor
  • DgCg1-17 - Diagnosis Codes.

In this example, the doctor could have diagnosed up to 17 conditions on a given visit.

In [12]:
DATA PatientDiag;
LENGTH DgCd1-DgCd17 $20.;
INFILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week5DataManagement\SAS\diagnosis.csv" DSD;
INPUT PatientID $ ClaimID ClaimDate :MMDDYY8. DgCd1-DgCd17;
FORMAT Date MMDDYY8.;
RUN;

DATA PatientDiag;
RETAIN PatientID ClaimID ClaimDate DgCd1-DgCd17;
SET PatientDiag;
RUN;

PROC PRINT DATA = PatientDiag;
FORMAT Date MMDDYY8.;
RUN;
Out[12]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.PATIENTDIAG

Obs PatientID ClaimID ClaimDate DgCd1 DgCd2 DgCd3 DgCd4 DgCd5 DgCd6 DgCd7 DgCd8 DgCd9 DgCd10 DgCd11 DgCd12 DgCd13 DgCd14 DgCd15 DgCd16 DgCd17 Date
1 A 111 20270 S102 R008 A304 A100 R205 R007                       .
2 A 112 20272 R008 R007 R202 A100 R005 S102 R003 R004 R001 R002 R205 R009 R006 S101 A503 A304   .
3 A 113 20423 S101 S102 R202 B173 R006 A503 R004 R205 R002 R003               .
4 A 114 20487 R202 R205 R001 R006 R002 R005 B173 R009 A304 S101 A503 S102           .
5 B 115 20006 R008 R005 R205 R202 R003 R006 S102 B173 R002 R001 A100 A503           .
6 B 116 20468 R006 A503 B173 A100 R004 R005                       .
7 C 117 20006 A304 R002 R004 R005                           .
8 C 118 20157 B173 R002 S101 R004 R003 R007 R006                     .
9 C 119 20579 R202 R005 R009 R006                           .

Let's create an indicator variable for whether or not the patient was diagnosed with condition A100 (Bronchitis), A304 (pneumonia), or A503 (asthma) during each visit. Without arrays there will be lots of repeated code.

In [6]:
DATA DiagResp;
SET PatientDiag;
RespInd = 0;
IF DgCd1 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd2 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd3 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd4 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd5 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd6 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd7 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd8 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd9 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd10 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd11 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd12 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd13 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd14 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd15 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd16 IN ('A100','A304','A503') THEN RespInd = 1;
ELSE IF DgCd17 IN ('A100','A304','A503') THEN RespInd = 1;
RUN;

PROC PRINT DATA = DiagResp;
TITLE2 "DiagResp";
RUN;

TITLE2;
Out[6]:
SAS Output

SAS Output

The SAS System

DiagResp

The PRINT Procedure

Data Set WORK.DIAGRESP

Obs PatientID ClaimID ClaimDate DgCd1 DgCd2 DgCd3 DgCd4 DgCd5 DgCd6 DgCd7 DgCd8 DgCd9 DgCd10 DgCd11 DgCd12 DgCd13 DgCd14 DgCd15 DgCd16 DgCd17 RespInd
1 A 111 20270 S102 R008 A304 A100 R205 R007                       1
2 A 112 20272 R008 R007 R202 A100 R005 S102 R003 R004 R001 R002 R205 R009 R006 S101 A503 A304   1
3 A 113 20423 S101 S102 R202 B173 R006 A503 R004 R205 R002 R003               1
4 A 114 20487 R202 R205 R001 R006 R002 R005 B173 R009 A304 S101 A503 S102           1
5 B 115 20006 R008 R005 R205 R202 R003 R006 S102 B173 R002 R001 A100 A503           1
6 B 116 20468 R006 A503 B173 A100 R004 R005                       1
7 C 117 20006 A304 R002 R004 R005                           1
8 C 118 20157 B173 R002 S101 R004 R003 R007 R006                     0
9 C 119 20579 R202 R005 R009 R006                           0

Using arrays, we can simplify this data step.

The syntax of using arrays

ARRAY <array name>{array size} <$> <array var1> <array var2> ... <array var n>

Declare with keyword array followed by the name of the array. In curly braces give the length of the array. This is followed by an optional $ if the variables are character variables. At the end, we give the names of each of the variables. We can list them individually or use SAS variable lists. e.g. Var1-Var5 will give Var1 Var2 Var3 Var4 Var5.

In [7]:
DATA DiagResp2;
SET PatientDiag;
ARRAY AllDiagCodes{17} $ DgCd1-DgCd17;
RespInd = 0;
DO i = 1 TO 17;
	IF AllDiagCodes{i} IN ('A100','A304','A503') THEN RespInd = 1;
END;
DROP i;
RUN;

PROC PRINT DATA = DiagResp2;
TITLE2 "DiagResp2";
RUN;

TITLE2;
Out[7]:
SAS Output

SAS Output

The SAS System

DiagResp2

The PRINT Procedure

Data Set WORK.DIAGRESP2

Obs PatientID ClaimID ClaimDate DgCd1 DgCd2 DgCd3 DgCd4 DgCd5 DgCd6 DgCd7 DgCd8 DgCd9 DgCd10 DgCd11 DgCd12 DgCd13 DgCd14 DgCd15 DgCd16 DgCd17 RespInd
1 A 111 20270 S102 R008 A304 A100 R205 R007                       1
2 A 112 20272 R008 R007 R202 A100 R005 S102 R003 R004 R001 R002 R205 R009 R006 S101 A503 A304   1
3 A 113 20423 S101 S102 R202 B173 R006 A503 R004 R205 R002 R003               1
4 A 114 20487 R202 R205 R001 R006 R002 R005 B173 R009 A304 S101 A503 S102           1
5 B 115 20006 R008 R005 R205 R202 R003 R006 S102 B173 R002 R001 A100 A503           1
6 B 116 20468 R006 A503 B173 A100 R004 R005                       1
7 C 117 20006 A304 R002 R004 R005                           1
8 C 118 20157 B173 R002 S101 R004 R003 R007 R006                     0
9 C 119 20579 R202 R005 R009 R006                           0

We can also have multidimensional arrays in SAS. In the next example, we will use a multidimensional array to check to see if the patient was diagnosed with Resperiatory ('A' codes), Muscular ("R" codes), Infectious ("B" codes), or Circulatory ("S" codes) problems in the years 2014, 2015, or 2016.

In [9]:
DATA DiagAll;
SET PatientDiag;

ARRAY AllDiagPrefix{4} $ RespCode MuscCode InfectCode CircCode ('A' 'R' 'B' 'S');


ARRAY AllDiagInd{4,3} 
	RespIndYr1 - RespIndYr3
	MuscIndYr1 - MuscIndYr3
	InfectIndYr1 - InfectIndYr3
	CircIndYr1 - CircIndYr3
	;

ARRAY AllDiagCodes {17} $ DgCd1-DgCd17;

DO i = 1 TO 4;
	DO j = 1 TO 3;
		AllDiagInd{i,j} = 0;
	END;
END;

ClaimYr = year(ClaimDate);
FORMAT FirstDigitDgCd $1.;

*Determine year for the current claim;
IF ClaimYr eq 2014 THEN k = 1;
ELSE IF ClaimYr eq 2015 THEN k = 2;
ELSE IF ClaimYr eq 2016 THEN k = 3;

*Scan through each of the 17 codes recorded for the claim;
DO i = 1 TO 17;
	*For each code, scan through the prefixes to determine the diagnosis category;
	DO j = 1 TO 4;
		FirstDigitDgCd = substr(AllDiagCodes{i},1,1); *Strip off the first letter of the diagnosis code;
		IF FirstDigitDgCd eq AllDiagPrefix{j}
		THEN AllDiagInd{j,k} = 1; *Populated the 2D array;
	END;
END;
DROP i j k FirstDigitDgCd ClaimYr RespCode MuscCode InfectCode CircCode;
RUN;	

PROC PRINT DATA = DiagAll;
TITLE2 "Diagnosis Indicators by Year";
FORMAT ClaimDate MMDDYY8.;
RUN;
 
TITLE2;
Out[9]:
SAS Output

SAS Output

The SAS System

Diagnosis Indicators by Year

The PRINT Procedure

Data Set WORK.DIAGALL

Obs PatientID ClaimID ClaimDate DgCd1 DgCd2 DgCd3 DgCd4 DgCd5 DgCd6 DgCd7 DgCd8 DgCd9 DgCd10 DgCd11 DgCd12 DgCd13 DgCd14 DgCd15 DgCd16 DgCd17 RespIndYr1 RespIndYr2 RespIndYr3 MuscIndYr1 MuscIndYr2 MuscIndYr3 InfectIndYr1 InfectIndYr2 InfectIndYr3 CircIndYr1 CircIndYr2 CircIndYr3
1 A 111 07/01/15 S102 R008 A304 A100 R205 R007                       0 1 0 0 1 0 0 0 0 0 1 0
2 A 112 07/03/15 R008 R007 R202 A100 R005 S102 R003 R004 R001 R002 R205 R009 R006 S101 A503 A304   0 1 0 0 1 0 0 0 0 0 1 0
3 A 113 12/01/15 S101 S102 R202 B173 R006 A503 R004 R205 R002 R003               0 1 0 0 1 0 0 1 0 0 1 0
4 A 114 02/03/16 R202 R205 R001 R006 R002 R005 B173 R009 A304 S101 A503 S102           0 0 1 0 0 1 0 0 1 0 0 1
5 B 115 10/10/14 R008 R005 R205 R202 R003 R006 S102 B173 R002 R001 A100 A503           1 0 0 1 0 0 1 0 0 1 0 0
6 B 116 01/15/16 R006 A503 B173 A100 R004 R005                       0 0 1 0 0 1 0 0 1 0 0 0
7 C 117 10/10/14 A304 R002 R004 R005                           1 0 0 1 0 0 0 0 0 0 0 0
8 C 118 03/10/15 B173 R002 S101 R004 R003 R007 R006                     0 0 0 0 1 0 0 1 0 0 1 0
9 C 119 05/05/16 R202 R005 R009 R006                           0 0 0 0 0 1 0 0 0 0 0 0

This could be used to calculate the number of claims for a given type of disease by year for each patient.

In [10]:
PROC SORT DATA = DiagAll;
 BY PatientID;
RUN;

PROC MEANS NOPRINT DATA = DiagAll;
BY PatientID;
OUTPUT OUT = PatientSummary (DROP = _TYPE_ _FREQ_)
SUM(RespIndYr1) = RespIndYr1
SUM(RespIndYr2) = RespIndYr2
SUM(RespIndYr3) = RespIndYr3
SUM(InfectIndYr1) = InfectIndYr1
SUM(InfectIndYr2) = InfectIndYr2
SUM(InfectIndYr3) = InfectIndYr3
SUM(MuscIndYr1) = MuscIndYr1
SUM(MuscIndYr2) = MuscIndYr2
SUM(MuscIndYr3) = MuscIndYr3
SUM(CircIndYr1) = CircIndYr1
SUM(CircIndYr2) = CircIndYr2
SUM(CircIndYr3) = CircIndYr3;
RUN;

PROC PRINT DATA = PatientSummary;
TITLE2 "Patient Claim Summary";
RUN;

TITLE2;
Out[10]:
SAS Output

SAS Output

The SAS System

Patient Claim Summary

The PRINT Procedure

Data Set WORK.PATIENTSUMMARY

Obs PatientID RespIndYr1 RespIndYr2 RespIndYr3 InfectIndYr1 InfectIndYr2 InfectIndYr3 MuscIndYr1 MuscIndYr2 MuscIndYr3 CircIndYr1 CircIndYr2 CircIndYr3
1 A 0 3 1 0 1 1 0 3 1 0 3 1
2 B 1 0 1 1 0 1 1 0 1 1 0 0
3 C 1 0 0 0 1 0 1 1 1 0 1 0

Wide and Long Data Formats

Depending on the analysis and procedure used, we may need the data to be in wide or long format. Wide format refers to when we have one row with multiple observation per persion. Long format refers to a dataset has one data point per row. The diagnosis dataset is can be viewed as being in wide format where each claim is a unit with multiple diagnosis codes.

First let's, turn this data set into long format where there is only one diganosis code per row.

In [13]:
DATA DiagLong;
SET PatientDiag;
ARRAY AllDiagCodes{17} $ DgCd1-DgCd17;
DO i = 1 TO 17;
	Diagnosis = i;
	DgCd = AllDiagCodes{i};
	OUTPUT;
END;
KEEP PatientID ClaimDate Diagnosis DgCd;
RUN;

PROC PRINT DATA = DiagLong (OBS = 20);
FORMAT ClaimDate MMDDYY8.;
TITLE2 "Long Format";
RUN;

TITLE2;
Out[13]:
SAS Output

SAS Output

The SAS System

Long Format

The PRINT Procedure

Data Set WORK.DIAGLONG

Obs PatientID ClaimDate Diagnosis DgCd
1 A 07/01/15 1 S102
2 A 07/01/15 2 R008
3 A 07/01/15 3 A304
4 A 07/01/15 4 A100
5 A 07/01/15 5 R205
6 A 07/01/15 6 R007
7 A 07/01/15 7  
8 A 07/01/15 8  
9 A 07/01/15 9  
10 A 07/01/15 10  
11 A 07/01/15 11  
12 A 07/01/15 12  
13 A 07/01/15 13  
14 A 07/01/15 14  
15 A 07/01/15 15  
16 A 07/01/15 16  
17 A 07/01/15 17  
18 A 07/03/15 1 R008
19 A 07/03/15 2 R007
20 A 07/03/15 3 R202

Now we will go from long back to wide. First, we need to sort the data. Sorting by both PatientId and ClaimDate creates two variables FIRST. and LAST. for each variable that we can use. These variables are 1 when you reach the first/last variable value in the sorted list. Sort up to claim since there is only one claim per date.

In [14]:
PROC SORT DATA = DiagLong;
BY PatientId ClaimDate;
RUN;

DATA DiagLongFL;
SET DiagLong;
BY PatientID ClaimDate;
PatientFirst = FIRST.PatientID;
PatientLast = LAST.PatientID;
ClaimFirst = FIRST.ClaimDate;
ClaimLast = LAST.ClaimDate;
RUN;

/* See what the values of FIRST. and LAST. are for each row */
PROC PRINT DATA = DiagLongFL;
FORMAT ClaimDate MMDDYY8.;
RUN;
Out[14]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.DIAGLONGFL

Obs PatientID ClaimDate Diagnosis DgCd PatientFirst PatientLast ClaimFirst ClaimLast
1 A 07/01/15 1 S102 1 0 1 0
2 A 07/01/15 2 R008 0 0 0 0
3 A 07/01/15 3 A304 0 0 0 0
4 A 07/01/15 4 A100 0 0 0 0
5 A 07/01/15 5 R205 0 0 0 0
6 A 07/01/15 6 R007 0 0 0 0
7 A 07/01/15 7   0 0 0 0
8 A 07/01/15 8   0 0 0 0
9 A 07/01/15 9   0 0 0 0
10 A 07/01/15 10   0 0 0 0
11 A 07/01/15 11   0 0 0 0
12 A 07/01/15 12   0 0 0 0
13 A 07/01/15 13   0 0 0 0
14 A 07/01/15 14   0 0 0 0
15 A 07/01/15 15   0 0 0 0
16 A 07/01/15 16   0 0 0 0
17 A 07/01/15 17   0 0 0 1
18 A 07/03/15 1 R008 0 0 1 0
19 A 07/03/15 2 R007 0 0 0 0
20 A 07/03/15 3 R202 0 0 0 0
21 A 07/03/15 4 A100 0 0 0 0
22 A 07/03/15 5 R005 0 0 0 0
23 A 07/03/15 6 S102 0 0 0 0
24 A 07/03/15 7 R003 0 0 0 0
25 A 07/03/15 8 R004 0 0 0 0
26 A 07/03/15 9 R001 0 0 0 0
27 A 07/03/15 10 R002 0 0 0 0
28 A 07/03/15 11 R205 0 0 0 0
29 A 07/03/15 12 R009 0 0 0 0
30 A 07/03/15 13 R006 0 0 0 0
31 A 07/03/15 14 S101 0 0 0 0
32 A 07/03/15 15 A503 0 0 0 0
33 A 07/03/15 16 A304 0 0 0 0
34 A 07/03/15 17   0 0 0 1
35 A 12/01/15 1 S101 0 0 1 0
36 A 12/01/15 2 S102 0 0 0 0
37 A 12/01/15 3 R202 0 0 0 0
38 A 12/01/15 4 B173 0 0 0 0
39 A 12/01/15 5 R006 0 0 0 0
40 A 12/01/15 6 A503 0 0 0 0
41 A 12/01/15 7 R004 0 0 0 0
42 A 12/01/15 8 R205 0 0 0 0
43 A 12/01/15 9 R002 0 0 0 0
44 A 12/01/15 10 R003 0 0 0 0
45 A 12/01/15 11   0 0 0 0
46 A 12/01/15 12   0 0 0 0
47 A 12/01/15 13   0 0 0 0
48 A 12/01/15 14   0 0 0 0
49 A 12/01/15 15   0 0 0 0
50 A 12/01/15 16   0 0 0 0
51 A 12/01/15 17   0 0 0 1
52 A 02/03/16 1 R202 0 0 1 0
53 A 02/03/16 2 R205 0 0 0 0
54 A 02/03/16 3 R001 0 0 0 0
55 A 02/03/16 4 R006 0 0 0 0
56 A 02/03/16 5 R002 0 0 0 0
57 A 02/03/16 6 R005 0 0 0 0
58 A 02/03/16 7 B173 0 0 0 0
59 A 02/03/16 8 R009 0 0 0 0
60 A 02/03/16 9 A304 0 0 0 0
61 A 02/03/16 10 S101 0 0 0 0
62 A 02/03/16 11 A503 0 0 0 0
63 A 02/03/16 12 S102 0 0 0 0
64 A 02/03/16 13   0 0 0 0
65 A 02/03/16 14   0 0 0 0
66 A 02/03/16 15   0 0 0 0
67 A 02/03/16 16   0 0 0 0
68 A 02/03/16 17   0 1 0 1
69 B 10/10/14 1 R008 1 0 1 0
70 B 10/10/14 2 R005 0 0 0 0
71 B 10/10/14 3 R205 0 0 0 0
72 B 10/10/14 4 R202 0 0 0 0
73 B 10/10/14 5 R003 0 0 0 0
74 B 10/10/14 6 R006 0 0 0 0
75 B 10/10/14 7 S102 0 0 0 0
76 B 10/10/14 8 B173 0 0 0 0
77 B 10/10/14 9 R002 0 0 0 0
78 B 10/10/14 10 R001 0 0 0 0
79 B 10/10/14 11 A100 0 0 0 0
80 B 10/10/14 12 A503 0 0 0 0
81 B 10/10/14 13   0 0 0 0
82 B 10/10/14 14   0 0 0 0
83 B 10/10/14 15   0 0 0 0
84 B 10/10/14 16   0 0 0 0
85 B 10/10/14 17   0 0 0 1
86 B 01/15/16 1 R006 0 0 1 0
87 B 01/15/16 2 A503 0 0 0 0
88 B 01/15/16 3 B173 0 0 0 0
89 B 01/15/16 4 A100 0 0 0 0
90 B 01/15/16 5 R004 0 0 0 0
91 B 01/15/16 6 R005 0 0 0 0
92 B 01/15/16 7   0 0 0 0
93 B 01/15/16 8   0 0 0 0
94 B 01/15/16 9   0 0 0 0
95 B 01/15/16 10   0 0 0 0
96 B 01/15/16 11   0 0 0 0
97 B 01/15/16 12   0 0 0 0
98 B 01/15/16 13   0 0 0 0
99 B 01/15/16 14   0 0 0 0
100 B 01/15/16 15   0 0 0 0
101 B 01/15/16 16   0 0 0 0
102 B 01/15/16 17   0 1 0 1
103 C 10/10/14 1 A304 1 0 1 0
104 C 10/10/14 2 R002 0 0 0 0
105 C 10/10/14 3 R004 0 0 0 0
106 C 10/10/14 4 R005 0 0 0 0
107 C 10/10/14 5   0 0 0 0
108 C 10/10/14 6   0 0 0 0
109 C 10/10/14 7   0 0 0 0
110 C 10/10/14 8   0 0 0 0
111 C 10/10/14 9   0 0 0 0
112 C 10/10/14 10   0 0 0 0
113 C 10/10/14 11   0 0 0 0
114 C 10/10/14 12   0 0 0 0
115 C 10/10/14 13   0 0 0 0
116 C 10/10/14 14   0 0 0 0
117 C 10/10/14 15   0 0 0 0
118 C 10/10/14 16   0 0 0 0
119 C 10/10/14 17   0 0 0 1
120 C 03/10/15 1 B173 0 0 1 0
121 C 03/10/15 2 R002 0 0 0 0
122 C 03/10/15 3 S101 0 0 0 0
123 C 03/10/15 4 R004 0 0 0 0
124 C 03/10/15 5 R003 0 0 0 0
125 C 03/10/15 6 R007 0 0 0 0
126 C 03/10/15 7 R006 0 0 0 0
127 C 03/10/15 8   0 0 0 0
128 C 03/10/15 9   0 0 0 0
129 C 03/10/15 10   0 0 0 0
130 C 03/10/15 11   0 0 0 0
131 C 03/10/15 12   0 0 0 0
132 C 03/10/15 13   0 0 0 0
133 C 03/10/15 14   0 0 0 0
134 C 03/10/15 15   0 0 0 0
135 C 03/10/15 16   0 0 0 0
136 C 03/10/15 17   0 0 0 1
137 C 05/05/16 1 R202 0 0 1 0
138 C 05/05/16 2 R005 0 0 0 0
139 C 05/05/16 3 R009 0 0 0 0
140 C 05/05/16 4 R006 0 0 0 0
141 C 05/05/16 5   0 0 0 0
142 C 05/05/16 6   0 0 0 0
143 C 05/05/16 7   0 0 0 0
144 C 05/05/16 8   0 0 0 0
145 C 05/05/16 9   0 0 0 0
146 C 05/05/16 10   0 0 0 0
147 C 05/05/16 11   0 0 0 0
148 C 05/05/16 12   0 0 0 0
149 C 05/05/16 13   0 0 0 0
150 C 05/05/16 14   0 0 0 0
151 C 05/05/16 15   0 0 0 0
152 C 05/05/16 16   0 0 0 0
153 C 05/05/16 17   0 1 0 1
In [15]:
DATA DiagWide;
SET DiagLong;
BY PatientId ClaimDate;
RETAIN DgCd1-DgCd17; *Retain the diagnosis codes for each claim date to be output at the end of each claim;
ARRAY AllDiagCodes{17} $ DgCd1-DgCd17;
IF FIRST.ClaimDate THEN DO;
	DO i = 1 TO 17;
		AllDiagCodes{i} = " ";
	END;
END;
AllDiagCodes{Diagnosis} = DgCd;
IF LAST.ClaimDate THEN OUTPUT; *Output a single row for each claim date;
DROP i Diagnosis DgCd;
RUN;

PROC PRINT DATA = DiagWide;
FORMAT ClaimDate MMDDYY8.;
TITLE2 "Wide Format";
RUN;

TITLE2;
Out[15]:
SAS Output

SAS Output

The SAS System

Wide Format

The PRINT Procedure

Data Set WORK.DIAGWIDE

Obs PatientID ClaimDate DgCd1 DgCd2 DgCd3 DgCd4 DgCd5 DgCd6 DgCd7 DgCd8 DgCd9 DgCd10 DgCd11 DgCd12 DgCd13 DgCd14 DgCd15 DgCd16 DgCd17
1 A 07/01/15 S102 R008 A304 A100 R205 R007                      
2 A 07/03/15 R008 R007 R202 A100 R005 S102 R003 R004 R001 R002 R205 R009 R006 S101 A503 A304  
3 A 12/01/15 S101 S102 R202 B173 R006 A503 R004 R205 R002 R003              
4 A 02/03/16 R202 R205 R001 R006 R002 R005 B173 R009 A304 S101 A503 S102          
5 B 10/10/14 R008 R005 R205 R202 R003 R006 S102 B173 R002 R001 A100 A503          
6 B 01/15/16 R006 A503 B173 A100 R004 R005                      
7 C 10/10/14 A304 R002 R004 R005                          
8 C 03/10/15 B173 R002 S101 R004 R003 R007 R006                    
9 C 05/05/16 R202 R005 R009 R006                          

Next, we will use PROC TRANSPOSE in order to change the dataset from long to wide and wide to long format.

In [16]:
PROC PRINT DATA=DiagLong;
RUN;
Out[16]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.DIAGLONG

Obs PatientID ClaimDate Diagnosis DgCd
1 A 20270 1 S102
2 A 20270 2 R008
3 A 20270 3 A304
4 A 20270 4 A100
5 A 20270 5 R205
6 A 20270 6 R007
7 A 20270 7  
8 A 20270 8  
9 A 20270 9  
10 A 20270 10  
11 A 20270 11  
12 A 20270 12  
13 A 20270 13  
14 A 20270 14  
15 A 20270 15  
16 A 20270 16  
17 A 20270 17  
18 A 20272 1 R008
19 A 20272 2 R007
20 A 20272 3 R202
21 A 20272 4 A100
22 A 20272 5 R005
23 A 20272 6 S102
24 A 20272 7 R003
25 A 20272 8 R004
26 A 20272 9 R001
27 A 20272 10 R002
28 A 20272 11 R205
29 A 20272 12 R009
30 A 20272 13 R006
31 A 20272 14 S101
32 A 20272 15 A503
33 A 20272 16 A304
34 A 20272 17  
35 A 20423 1 S101
36 A 20423 2 S102
37 A 20423 3 R202
38 A 20423 4 B173
39 A 20423 5 R006
40 A 20423 6 A503
41 A 20423 7 R004
42 A 20423 8 R205
43 A 20423 9 R002
44 A 20423 10 R003
45 A 20423 11  
46 A 20423 12  
47 A 20423 13  
48 A 20423 14  
49 A 20423 15  
50 A 20423 16  
51 A 20423 17  
52 A 20487 1 R202
53 A 20487 2 R205
54 A 20487 3 R001
55 A 20487 4 R006
56 A 20487 5 R002
57 A 20487 6 R005
58 A 20487 7 B173
59 A 20487 8 R009
60 A 20487 9 A304
61 A 20487 10 S101
62 A 20487 11 A503
63 A 20487 12 S102
64 A 20487 13  
65 A 20487 14  
66 A 20487 15  
67 A 20487 16  
68 A 20487 17  
69 B 20006 1 R008
70 B 20006 2 R005
71 B 20006 3 R205
72 B 20006 4 R202
73 B 20006 5 R003
74 B 20006 6 R006
75 B 20006 7 S102
76 B 20006 8 B173
77 B 20006 9 R002
78 B 20006 10 R001
79 B 20006 11 A100
80 B 20006 12 A503
81 B 20006 13  
82 B 20006 14  
83 B 20006 15  
84 B 20006 16  
85 B 20006 17  
86 B 20468 1 R006
87 B 20468 2 A503
88 B 20468 3 B173
89 B 20468 4 A100
90 B 20468 5 R004
91 B 20468 6 R005
92 B 20468 7  
93 B 20468 8  
94 B 20468 9  
95 B 20468 10  
96 B 20468 11  
97 B 20468 12  
98 B 20468 13  
99 B 20468 14  
100 B 20468 15  
101 B 20468 16  
102 B 20468 17  
103 C 20006 1 A304
104 C 20006 2 R002
105 C 20006 3 R004
106 C 20006 4 R005
107 C 20006 5  
108 C 20006 6  
109 C 20006 7  
110 C 20006 8  
111 C 20006 9  
112 C 20006 10  
113 C 20006 11  
114 C 20006 12  
115 C 20006 13  
116 C 20006 14  
117 C 20006 15  
118 C 20006 16  
119 C 20006 17  
120 C 20157 1 B173
121 C 20157 2 R002
122 C 20157 3 S101
123 C 20157 4 R004
124 C 20157 5 R003
125 C 20157 6 R007
126 C 20157 7 R006
127 C 20157 8  
128 C 20157 9  
129 C 20157 10  
130 C 20157 11  
131 C 20157 12  
132 C 20157 13  
133 C 20157 14  
134 C 20157 15  
135 C 20157 16  
136 C 20157 17  
137 C 20579 1 R202
138 C 20579 2 R005
139 C 20579 3 R009
140 C 20579 4 R006
141 C 20579 5  
142 C 20579 6  
143 C 20579 7  
144 C 20579 8  
145 C 20579 9  
146 C 20579 10  
147 C 20579 11  
148 C 20579 12  
149 C 20579 13  
150 C 20579 14  
151 C 20579 15  
152 C 20579 16  
153 C 20579 17  
In [17]:
PROC TRANSPOSE DATA=DiagLong OUT=DiagTransposeWide PREFIX=DgCd;
VAR DgCd;
BY PatientId ClaimDate;
RUN;

PROC PRINT DATA=DiagTransposeWide;
FORMAT ClaimDate MMDDYY8.;
RUN;
Out[17]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.DIAGTRANSPOSEWIDE

Obs PatientID ClaimDate _NAME_ DgCd1 DgCd2 DgCd3 DgCd4 DgCd5 DgCd6 DgCd7 DgCd8 DgCd9 DgCd10 DgCd11 DgCd12 DgCd13 DgCd14 DgCd15 DgCd16 DgCd17
1 A 07/01/15 DgCd S102 R008 A304 A100 R205 R007                      
2 A 07/03/15 DgCd R008 R007 R202 A100 R005 S102 R003 R004 R001 R002 R205 R009 R006 S101 A503 A304  
3 A 12/01/15 DgCd S101 S102 R202 B173 R006 A503 R004 R205 R002 R003              
4 A 02/03/16 DgCd R202 R205 R001 R006 R002 R005 B173 R009 A304 S101 A503 S102          
5 B 10/10/14 DgCd R008 R005 R205 R202 R003 R006 S102 B173 R002 R001 A100 A503          
6 B 01/15/16 DgCd R006 A503 B173 A100 R004 R005                      
7 C 10/10/14 DgCd A304 R002 R004 R005                          
8 C 03/10/15 DgCd B173 R002 S101 R004 R003 R007 R006                    
9 C 05/05/16 DgCd R202 R005 R009 R006                          
In [18]:
PROC TRANSPOSE DATA=DiagTransposeWide OUT=DiagTransposeLong (RENAME=(COL1=DgCd));
VAR DgCd1-DgCd17;
BY PatientId ClaimDate;
RUN;

PROC PRINT DATA=DiagTransposeLong;
FORMAT ClaimDate MMDDYY8.;
RUN;
Out[18]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.DIAGTRANSPOSELONG

Obs PatientID ClaimDate _NAME_ DgCd
1 A 07/01/15 DgCd1 S102
2 A 07/01/15 DgCd2 R008
3 A 07/01/15 DgCd3 A304
4 A 07/01/15 DgCd4 A100
5 A 07/01/15 DgCd5 R205
6 A 07/01/15 DgCd6 R007
7 A 07/01/15 DgCd7  
8 A 07/01/15 DgCd8  
9 A 07/01/15 DgCd9  
10 A 07/01/15 DgCd10  
11 A 07/01/15 DgCd11  
12 A 07/01/15 DgCd12  
13 A 07/01/15 DgCd13  
14 A 07/01/15 DgCd14  
15 A 07/01/15 DgCd15  
16 A 07/01/15 DgCd16  
17 A 07/01/15 DgCd17  
18 A 07/03/15 DgCd1 R008
19 A 07/03/15 DgCd2 R007
20 A 07/03/15 DgCd3 R202
21 A 07/03/15 DgCd4 A100
22 A 07/03/15 DgCd5 R005
23 A 07/03/15 DgCd6 S102
24 A 07/03/15 DgCd7 R003
25 A 07/03/15 DgCd8 R004
26 A 07/03/15 DgCd9 R001
27 A 07/03/15 DgCd10 R002
28 A 07/03/15 DgCd11 R205
29 A 07/03/15 DgCd12 R009
30 A 07/03/15 DgCd13 R006
31 A 07/03/15 DgCd14 S101
32 A 07/03/15 DgCd15 A503
33 A 07/03/15 DgCd16 A304
34 A 07/03/15 DgCd17  
35 A 12/01/15 DgCd1 S101
36 A 12/01/15 DgCd2 S102
37 A 12/01/15 DgCd3 R202
38 A 12/01/15 DgCd4 B173
39 A 12/01/15 DgCd5 R006
40 A 12/01/15 DgCd6 A503
41 A 12/01/15 DgCd7 R004
42 A 12/01/15 DgCd8 R205
43 A 12/01/15 DgCd9 R002
44 A 12/01/15 DgCd10 R003
45 A 12/01/15 DgCd11  
46 A 12/01/15 DgCd12  
47 A 12/01/15 DgCd13  
48 A 12/01/15 DgCd14  
49 A 12/01/15 DgCd15  
50 A 12/01/15 DgCd16  
51 A 12/01/15 DgCd17  
52 A 02/03/16 DgCd1 R202
53 A 02/03/16 DgCd2 R205
54 A 02/03/16 DgCd3 R001
55 A 02/03/16 DgCd4 R006
56 A 02/03/16 DgCd5 R002
57 A 02/03/16 DgCd6 R005
58 A 02/03/16 DgCd7 B173
59 A 02/03/16 DgCd8 R009
60 A 02/03/16 DgCd9 A304
61 A 02/03/16 DgCd10 S101
62 A 02/03/16 DgCd11 A503
63 A 02/03/16 DgCd12 S102
64 A 02/03/16 DgCd13  
65 A 02/03/16 DgCd14  
66 A 02/03/16 DgCd15  
67 A 02/03/16 DgCd16  
68 A 02/03/16 DgCd17  
69 B 10/10/14 DgCd1 R008
70 B 10/10/14 DgCd2 R005
71 B 10/10/14 DgCd3 R205
72 B 10/10/14 DgCd4 R202
73 B 10/10/14 DgCd5 R003
74 B 10/10/14 DgCd6 R006
75 B 10/10/14 DgCd7 S102
76 B 10/10/14 DgCd8 B173
77 B 10/10/14 DgCd9 R002
78 B 10/10/14 DgCd10 R001
79 B 10/10/14 DgCd11 A100
80 B 10/10/14 DgCd12 A503
81 B 10/10/14 DgCd13  
82 B 10/10/14 DgCd14  
83 B 10/10/14 DgCd15  
84 B 10/10/14 DgCd16  
85 B 10/10/14 DgCd17  
86 B 01/15/16 DgCd1 R006
87 B 01/15/16 DgCd2 A503
88 B 01/15/16 DgCd3 B173
89 B 01/15/16 DgCd4 A100
90 B 01/15/16 DgCd5 R004
91 B 01/15/16 DgCd6 R005
92 B 01/15/16 DgCd7  
93 B 01/15/16 DgCd8  
94 B 01/15/16 DgCd9  
95 B 01/15/16 DgCd10  
96 B 01/15/16 DgCd11  
97 B 01/15/16 DgCd12  
98 B 01/15/16 DgCd13  
99 B 01/15/16 DgCd14  
100 B 01/15/16 DgCd15  
101 B 01/15/16 DgCd16  
102 B 01/15/16 DgCd17  
103 C 10/10/14 DgCd1 A304
104 C 10/10/14 DgCd2 R002
105 C 10/10/14 DgCd3 R004
106 C 10/10/14 DgCd4 R005
107 C 10/10/14 DgCd5  
108 C 10/10/14 DgCd6  
109 C 10/10/14 DgCd7  
110 C 10/10/14 DgCd8  
111 C 10/10/14 DgCd9  
112 C 10/10/14 DgCd10  
113 C 10/10/14 DgCd11  
114 C 10/10/14 DgCd12  
115 C 10/10/14 DgCd13  
116 C 10/10/14 DgCd14  
117 C 10/10/14 DgCd15  
118 C 10/10/14 DgCd16  
119 C 10/10/14 DgCd17  
120 C 03/10/15 DgCd1 B173
121 C 03/10/15 DgCd2 R002
122 C 03/10/15 DgCd3 S101
123 C 03/10/15 DgCd4 R004
124 C 03/10/15 DgCd5 R003
125 C 03/10/15 DgCd6 R007
126 C 03/10/15 DgCd7 R006
127 C 03/10/15 DgCd8  
128 C 03/10/15 DgCd9  
129 C 03/10/15 DgCd10  
130 C 03/10/15 DgCd11  
131 C 03/10/15 DgCd12  
132 C 03/10/15 DgCd13  
133 C 03/10/15 DgCd14  
134 C 03/10/15 DgCd15  
135 C 03/10/15 DgCd16  
136 C 03/10/15 DgCd17  
137 C 05/05/16 DgCd1 R202
138 C 05/05/16 DgCd2 R005
139 C 05/05/16 DgCd3 R009
140 C 05/05/16 DgCd4 R006
141 C 05/05/16 DgCd5  
142 C 05/05/16 DgCd6  
143 C 05/05/16 DgCd7  
144 C 05/05/16 DgCd8  
145 C 05/05/16 DgCd9  
146 C 05/05/16 DgCd10  
147 C 05/05/16 DgCd11  
148 C 05/05/16 DgCd12  
149 C 05/05/16 DgCd13  
150 C 05/05/16 DgCd14  
151 C 05/05/16 DgCd15  
152 C 05/05/16 DgCd16  
153 C 05/05/16 DgCd17  

Combining Datasets

Consider a multi site clinical trial. Each site will generate its own dataset for the patients at that site. To perform the final analysis, we will want to combine the datasets into one master dataset. We will perform two dataset merges.

  • Stacking - Here we just want to combine the datasets by adding on one dataset as new observations into the other dataset.
  • Merge - We have observations on the same person but for different variables in two datasets and we want to combine all the variables for the same subject into one row in a single dataset.

We will use two simple example datasets. Both sites have most of the same variables, but no patients in common.

In [20]:
DATA Site1;
INPUT PatientId SBP DBP Visit MMDDYY8.;
DATALINES;
100 117 75 12/15/16
101 118 80 12/10/16
102 110 71 12/3/16
;
RUN;

DATA Site2;
INPUT PatientId SBP DBP;
DATALINES;
103 125 75
104 106 79
105 127 82
;
RUN;
Out[20]:

25                                                         The SAS System                           20:18 Sunday, September 24, 2017

553 ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg; ods graphics on /
553 ! outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1
554
555 DATA Site1;
556 INPUT PatientId SBP DBP Visit MMDDYY8.;
557 DATALINES;

NOTE: The data set WORK.SITE1 has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds


561 ;
562 RUN;
563
564 DATA Site2;
565 INPUT PatientId SBP DBP;
566 DATALINES;

NOTE: The data set WORK.SITE2 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds


570 ;
571 RUN;
572
573 ods html5 (id=saspy_internal) close;ods listing;
574

We can use the SET statement to stack the two datasets. This will create a new dataset containing all variables in both datasets. If there is a variables that is not in both sets, then missing values are assigned for observations from the dataset that did not have that variable.

In [21]:
DATA MasterSite;
SET Site1 Site2;
RUN;

PROC PRINT DATA = MasterSite;
RUN;
Out[21]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.MASTERSITE

Obs PatientId SBP DBP Visit
1 100 117 75 20803
2 101 118 80 20798
3 102 110 71 20791
4 103 125 75 .
5 104 106 79 .
6 105 127 82 .
In [22]:
DATA MasterSite;
SET Site2 Site1;
RUN;

PROC PRINT DATA = MasterSite;
RUN;
Out[22]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.MASTERSITE

Obs PatientId SBP DBP Visit
1 103 125 75 .
2 104 106 79 .
3 105 127 82 .
4 100 117 75 20803
5 101 118 80 20798
6 102 110 71 20791
In [23]:
/* These datasets have some patients in common */
DATA Site3;
INPUT PatientId SBP DBP Visit MMDDYY8.;
DATALINES;
100 117 75 12/15/16
101 118 80 12/10/16
102 110 71 12/3/16
;
RUN;

DATA Site4;
INPUT PatientId Weight Height;
DATALINES;
100 125 60
101 140 67
105 180 70
;
RUN;
Out[23]:

28                                                         The SAS System                           20:18 Sunday, September 24, 2017

603 ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg; ods graphics on /
603 ! outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1
604
605 /* These datasets have some patients in common */
606 DATA Site3;
607 INPUT PatientId SBP DBP Visit MMDDYY8.;
608 DATALINES;

NOTE: The data set WORK.SITE3 has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds


612 ;
613 RUN;
614
615 DATA Site4;
616 INPUT PatientId Weight Height;
617 DATALINES;

NOTE: The data set WORK.SITE4 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds


621 ;
622 RUN;
623
624 ods html5 (id=saspy_internal) close;ods listing;
625

We will merge the datasets by patient id, so we need to sort by patient id first.

In [24]:
PROC SORT DATA = Site3;
BY PatientId;
RUN;

PROC SORT DATA = Site4;
BY PatientId;
RUN;


DATA Master2;
MERGE Site3 Site4;
BY PatientId;
RUN;

PROC PRINT DATA = Master2;
TITLE2 "Merged Dataset";
RUN;
Out[24]:
SAS Output

SAS Output

The SAS System

Merged Dataset

The PRINT Procedure

Data Set WORK.MASTER2

Obs PatientId SBP DBP Visit Weight Height
1 100 117 75 20803 125 60
2 101 118 80 20798 140 67
3 102 110 71 20791 . .
4 105 . . . 180 70

Data Cleaning

Data cleaing refers to checking our dataset for "bad" or unusual data. This could mean checking for typos, missing values, or outliers. We may need to fix these errors, remove observations, or just examine the unusual observation further to determine if further data is needed. We will work through an example of cleaning a dataset containing patient records including

  • patient number - character (e.g. '001')
  • gender - character M or F
  • visit date - date MMDDYY10.
  • heart rate - numeric 40-100
  • systolic blood pressure - numeric 80-200
  • diastolic blood pressure - numeric 60-120
  • diagnosis code - character 1 to 3 digits (e.g. '5' or '128')
  • adverse event - character ('0' or '1')
In [1]:
DATA PATIENTS; 
INFILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week5DataManagement\SAS\patients.txt" PAD; 
INPUT @1  PATNO    $3. 
      @4  GENDER   $1. 
      @5  VISIT    MMDDYY10. 
      @15 HR       3. 
      @18 SBP      3. 
      @21 DBP      3. 
      @24 DX       $3. 
      @27 AE       $1.; 
LABEL PATNO   = "PATIENT NUMBER" 
      GENDER  = "GENDER" 
      VISIT   = "VISIT DATE" 
      HR      = "HEART RATE" 
      SBP     = "SYSTOLIC BLOOD PRESSURE" 
      DBP     = "DIASTOLIC BLOODPRESSURE" 
      DX      = "DIAGNOSIS CODE" 
      AE      = "ADVERSE EVENT?"; 
FORMAT VISIT MMDDYY10.; 
RUN; 

PROC PRINT DATA = patients;
RUN;
SAS Connection established. Subprocess id is 8736

Out[1]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.PATIENTS

Obs PATNO GENDER VISIT HR SBP DBP DX AE
1 001 M 11/11/1998 88 140 80 1 0
2 002 F 11/13/1998 84 120 78 X 0
3 003 X 10/21/1998 68 190 100 3 1
4 004 F 01/01/1999 101 200 120 5 A
5 XX5 M 05/07/1998 68 120 80 1 0
6 006   06/15/1999 72 102 68 6 1
7 007 M . 88 148 102   0
8 008 F 08/08/1998 210 . . 7 0
9 009 M 09/25/1999 86 240 180 4 1
10 010 F 10/19/1999 . 40 120 1 0
11 011 M . 68 300 20 4 1
12 012 M 10/12/1998 60 122 74   0
13 013 2 08/23/1999 74 108 64 1  
14 014 M 02/02/1999 22 130 90   1
15 002 F 11/13/1998 84 120 78 X 0
16 003 M 11/12/1999 58 112 74   0
17 015 F . 82 148 88 3 1
18 017 F 04/05/1999 208 . 84 2 0
19 019 M 06/07/1999 58 118 70   0
20 123 M . 60 . . 1 0
21 321 F . 900 400 200 5 1
22 020 F . 10 20 8   0
23 022 M 10/10/1999 48 114 82 2 1
24 023 F 12/31/1998 22 34 78   0
25 024 F 11/09/1998 76 120 80 1 0
26 025 M 01/01/1999 74 102 68 5 1
27 027 F . . 166 106 7 0
28 028 F 03/28/1998 66 150 90 3 0
29 029 M 05/15/1998 . . . 4 1
30 006 F 07/07/1999 82 148 84 1 0

First let's check the character variable gender for invalid entries. The data should contain only M or F. One way to do this is by using PROC FREQ.

In [2]:
PROC FREQ DATA = Patients;
TABLE GENDER / NOCUM NOPERCENT;
RUN;
Out[2]:
SAS Output

SAS Output

The SAS System

The FREQ Procedure

The FREQ Procedure

Table GENDER

One-Way Frequencies

GENDER
GENDER Frequency
Frequency Missing = 1
2 1
F 14
M 13
X 1

We could also use a DATA step to check for invalid values and print the unusual ones to the log file using PUT.

In [ ]:
DATA _NULL_;
SET Patients;
TITLE "Listing Invalid Input";
***CHECK GENDER;
IF GENDER NOT IN ('F','M',' ') THEN
	PUT PATNO= GENDER=;
***CHECK DX;
**Verify returns position of first character that is not ' ' or 1-9;
IF VERIFY(DX,' 0123456789') NE 0 THEN 
	PUT PATNO= DX=;
***CHECK AE;
IF AE NOT IN ('0','1',' ') THEN 
	PUT PATNO= AE=;
RUN;

The following outputs to the log file.

PATNO=002 DX=X
PATNO=003 GENDER=X
PATNO=004 AE=A
PATNO=013 GENDER=2
PATNO=002 DX=X

We could also use a print statement to check for invalid input. Let's use a print statement to check for invalid systolic blood pressure values (outside of 80-200) and missing values

Recall that in SAS the missing numeric value . is interpreted in inequalities as -infinity.

In [3]:
PROC PRINT DATA = Patients;
WHERE SBP NOT BETWEEN 80 AND 200;
VAR PATNO SBP;
RUN;
Out[3]:
SAS Output

SAS Output

The SAS System

The PRINT Procedure

Data Set WORK.PATIENTS

Obs PATNO SBP
8 008 .
9 009 240
10 010 40
11 011 300
18 017 .
20 123 .
21 321 400
22 020 20
24 023 34
29 029 .

We can also use formats to check for invalid input. The following example using a format and a data step to check for out of bounds heart rate, systolic blood pressure, and diastolic blood pressure.

In [ ]:
PROC FORMAT;
VALUE HR_CK 40-100, . = 'OK'; *Values between 40 and 100 and . are OK;
VALUE SBP_CK 80-200, . = 'OK'; *Value between 80 and 200 and . are OK;
VALUE DBP_CK 60-120, . = 'OK'; *Value between 60 and 120 and . are OK;
RUN;

/* The PUT function applies the given format to the given variable */
DATA _NULL_;
SET Patients;
IF PUT(HR,HR_CK.) NE 'OK' THEN PUT 
PATNO= HR=;
IF PUT(SBP,SBP_CK.) NE 'OK' THEN PUT
PATNO= SBP=;
IF PUT(DBP,DBP_CK.) NE 'OK' THEN PUT
PATNO= DBP=;
RUN;

Log file:

PATNO=004 HR=101
PATNO=008 HR=210
PATNO=009 SBP=240
PATNO=009 DBP=180
PATNO=010 SBP=40
PATNO=011 SBP=300
PATNO=011 DBP=20
PATNO=014 HR=22
PATNO=017 HR=208
PATNO=321 HR=900
PATNO=321 SBP=400
PATNO=321 DBP=200
PATNO=020 HR=10
PATNO=020 SBP=20
PATNO=020 DBP=8
PATNO=023 HR=22
PATNO=023 SBP=34
In [ ]: