10. Data Cleaning

In this lesson, we will learn some basic techniques to check our data for invalid inputs. One of the first and most important steps in any data processing task is to verify that your data values are correct or, at the very least, conform to some a set of rules. For example, a variable called GENDER would be expected to have only two values; a variable representing height in inches would be expected to be within reasonable limits. Some critical applications require a double entry and verification process of data entry. Whether this is done or not, it is still useful to run your data through a series of data checking operations.

For this lesson, we will work through part of Ron Cody’s paper Data Cleaning 101. For the examples, we will use a small dataset with patient data stored in the raw data file PAITENTS.TXT (see the course webpage’s data folder for the dataset). This dataset contains the following variables.

Variable Name

Description

Variable Type

Valid Values

PATNO

Patient Number

Character

Numerals

GENDER

Gender

Character

‘M’ or ‘F’

VISIT

Visit Date

MMDDYY10.

Any Valid Date

HR

Heart Rate

Numeric

40 to 100

SBP

Systolic Blood Pressure

Numeric

80 to 200

DBP

Diastolic Blood Pressure

Numeric

60 to 120

DX

Diagnosis Code

Character

1 to 3 digits

AE

Adverse Event

Character

‘0’ or ‘1’

DATA PATIENTS; 
INFILE "./data/patients.txt" PAD; 
INPUT @1  PATNO    $3. 
      @4  GENDER   $1. 
      @5  VISIT    MMDDYY10. 
      @15 HR       3. 
      @18 SBP      3. 
      @21 DBP      3. 
      @24 DX       $3. 
      @27 AE       $1.; 
LABEL PATNO   = "PATIENT NUMBER" 
      GENDER  = "GENDER" 
      VISIT   = "VISIT DATE" 
      HR      = "HEART RATE" 
      SBP     = "SYSTOLIC BLOOD PRESSURE" 
      DBP     = "DIASTOLIC BLOODPRESSURE" 
      DX      = "DIAGNOSIS CODE" 
      AE      = "ADVERSE EVENT?"; 
FORMAT VISIT MMDDYY10.; 
RUN; 

PROC PRINT DATA = patients;
RUN;
SAS Connection established. Subprocess id is 5001
SAS Output

The SAS System

Obs PATNO GENDER VISIT HR SBP DBP DX AE
1 001 M 11/11/1998 88 140 80 1 0
2 002 F 11/13/1998 84 120 78 X 0
3 003 X 10/21/1998 68 190 100 3 1
4 004 F 01/01/1999 101 200 120 5 A
5 XX5 M 05/07/1998 68 120 80 1 0
6 006   06/15/1999 72 102 68 6 1
7 007 M . 88 148 102   0
8 008 F 08/08/1998 210 . . 7 0
9 009 M 09/25/1999 86 240 180 4 1
10 010 F 10/19/1999 . 40 120 1 0
11 011 M . 68 300 20 4 1
12 012 M 10/12/1998 60 122 74   0
13 013 2 08/23/1999 74 108 64 1  
14 014 M 02/02/1999 22 130 90   1
15 002 F 11/13/1998 84 120 78 X 0
16 003 M 11/12/1999 58 112 74   0
17 015 F . 82 148 88 3 1
18 017 F 04/05/1999 208 . 84 2 0
19 019 M 06/07/1999 58 118 70   0
20 123 M . 60 . . 1 0
21 321 F . 900 400 200 5 1
22 020 F . 10 20 8   0
23 022 M 10/10/1999 48 114 82 2 1
24 023 F 12/31/1998 22 34 78   0
25 024 F 11/09/1998 76 120 80 1 0
26 025 M 01/01/1999 74 102 68 5 1
27 027 F . . 166 106 7 0
28 028 F 03/28/1998 66 150 90 3 0
29 029 M 05/15/1998 . . . 4 1
30 006 F 07/07/1999 82 148 84 1 0

10.1. Checking for Missing Data

A we have already seen, SAS has separate missing values for character and numeric data:

  • Missing numeric data is represented by a period (.)

  • Missing character data is represented by a single empty space (‘ ‘) or the null string (‘’).

We can check to see if a particular value is missing by using = and comparing it to the corresponding missing data value or we can use the missing function, which checks to see if a value is missing and returns 1 if it is missing and 0 otherwise.

Some other missing data functions in SAS include:

  • nmiss - Can be given a list of numeric variables and counts the number of variables that contain missing data.

  • cmiss - Can be given character or numeric variables and counts the number of variables that contain missing data.

Other useful checks for missing data include using PROC FREQ to count the number of missing data values in a character variable and PROC MEANS with the nmiss option to count the number of missing values in a numeric variable.

Example

The following SAS program prints the rows of PATIENTS in which the GENDER or VISIT are missing

PROC PRINT data = patients;
    WHERE gender = ' ' OR visit = .;
RUN;
SAS Output

The SAS System

Obs PATNO GENDER VISIT HR SBP DBP DX AE
6 006   06/15/1999 72 102 68 6 1
7 007 M . 88 148 102   0
11 011 M . 68 300 20 4 1
17 015 F . 82 148 88 3 1
20 123 M . 60 . . 1 0
21 321 F . 900 400 200 5 1
22 020 F . 10 20 8   0
27 027 F . . 166 106 7 0

Alternatively, we could use the missing function to check to see if GENDER or VISIT are missing valued.

PROC PRINT data = patients;
    WHERE missing(gender) OR missing(visit);
RUN;
SAS Output

The SAS System

Obs PATNO GENDER VISIT HR SBP DBP DX AE
6 006   06/15/1999 72 102 68 6 1
7 007 M . 88 148 102   0
11 011 M . 68 300 20 4 1
17 015 F . 82 148 88 3 1
20 123 M . 60 . . 1 0
21 321 F . 900 400 200 5 1
22 020 F . 10 20 8   0
27 027 F . . 166 106 7 0

Example

The following SAS function illustrates using the cmiss function to find the rows that include complete cases, i.e. none of the variables contain missing data.

DATA patients_cc;
    SET patients;
    IF cmiss(of _ALL_) ~= 0 THEN DELETE;
RUN;

PROC PRINT data = patients_cc;
RUN;
SAS Output

The SAS System

Obs PATNO GENDER VISIT HR SBP DBP DX AE
1 001 M 11/11/1998 88 140 80 1 0
2 002 F 11/13/1998 84 120 78 X 0
3 003 X 10/21/1998 68 190 100 3 1
4 004 F 01/01/1999 101 200 120 5 A
5 XX5 M 05/07/1998 68 120 80 1 0
6 009 M 09/25/1999 86 240 180 4 1
7 002 F 11/13/1998 84 120 78 X 0
8 022 M 10/10/1999 48 114 82 2 1
9 024 F 11/09/1998 76 120 80 1 0
10 025 M 01/01/1999 74 102 68 5 1
11 028 F 03/28/1998 66 150 90 3 0
12 006 F 07/07/1999 82 148 84 1 0

We have used the special SAS variable _ALL_ to list all the variables in our dataset in the cmiss function call. This allows us to check all variables for missing data in a given row without needed to list out all the variables manually or with a variable list. Since we have a mix of numeric and character variables, we use cmiss, but if we had only had numeric variables (or were only checking numeric variables by using the _NUMERIC_ SAS variable) then we could have used nmiss. A row is complete if the number of missing values is 0.

10.2. Checking for Invalid Character Values

One simple technique to check a character variable, if there are a limited number of categories, for invalid input is to use PROC FREQ. Creating a frequency table of the character variable will list out all the oberved categories along with the frequency of occurence.

Example

The following SAS program uses PROC FREQ to check the GENDER, DX, and AE variables for invalid input. Recall that these have certain rules for valid values listed in the variable description table above.

PROC FREQ data = patients;
   TITLE "Frequency Counts";
   TABLES GENDER DX AE / NOCUM NOPERCENT;
RUN;
SAS Output

Frequency Counts

The FREQ Procedure

GENDER
GENDER Frequency
Frequency Missing = 1
2 1
F 14
M 13
X 1
DIAGNOSIS CODE
DX Frequency
Frequency Missing = 7
1 7
2 2
3 3
4 3
5 3
6 1
7 2
X 2
ADVERSE EVENT?
AE Frequency
Frequency Missing = 1
0 18
1 10
A 1

Note that GENDER has a few miscoded values such as 2 and X, DX has a miscoded value of X, AE has a miscoded value of A, and all have at least one missing value. We would need to revisit the patient data to see if we can determine what these should have been coded as.

10.3. Using a DATA Step to Indetify Invalid Characters

We can also use a DATA step with IF statments to identify observations that do not meet our expected patterns.

Example

The following SAS program checks GENDER, DX, and AE for invalid character codes using IF statments in combination with PUT and VERIFY.

DATA _NULL_;
    SET Patients;
    FILE print; *Print to output window instead of LOG window;
    TITLE "Listing Invalid Input";
    ***CHECK GENDER;
    IF GENDER NOT IN ('F','M',' ') THEN
        PUT PATNO= GENDER=;
    ***CHECK DX;
    **Verify returns position of first character that is not ' ' or 1-9;
    IF VERIFY(DX,' 0123456789') NE 0 THEN 
        PUT PATNO= DX=;
    ***CHECK AE;
    IF AE NOT IN ('0','1',' ') THEN 
        PUT PATNO= AE=;
RUN;
SAS Output

Listing Invalid Input

PATNO=002 DX=X                                                                                                                      
PATNO=003 GENDER=X                                                                                                                  
PATNO=004 AE=A                                                                                                                      
PATNO=013 GENDER=2                                                                                                                  
PATNO=002 DX=X                                                                                                                      

The _NULL_ dataset keyword allows us to read data from the PATIENTS dataset without actually creating a new dataset. The FILE PRINT statement tells SAS to print the output of the PUT statements to the output window instead of the log window. Becuase we are expected only 'M', 'F', or ' ' for valid values of GENDER, we can check for these specific values with an IF statement and output and rows that have an invalid value. We do the same thing to check AE where we check for the valid values of '0', '1' or ' '. For DX, we know it should only include numeric values between 0 and 9 in an up to 3 digit character string. The VERIFY function returns the location first character in DX that is not the search string of the numerals 0 to 9 or a blank space. If DX only contains valid values from this search string, then VERIFY returns 0.

Another way to identify miscoded character values in a DATA step is by using user defined formats to group all unexpected codes into a ‘miscoded’ group.

Example

The following SAS program uses our known valid values for the variables GNDER, DX and AE to define formats that translate any unexpected codes into 'miscoded', so that we can identify the rows with miscoded values.

PROC FORMAT;   
    VALUE $GENDER 'F','M' = 'VALID'                 
                  ' '     = 'MISSING'                 
                  OTHER   = 'MISCODED';   
    VALUE $DX '001' - '999' = 'VALID'             
              ' '           = 'MISSING'                 
                  OTHER     = 'MISCODED';   
    VALUE $AE '0','1' = 'VALID'             
              ' '     = 'MISSING'              
               OTHER  = 'MISCODED'; 
RUN; 

DATA _NULL_;
    SET Patients;
    FILE print; *Print to output window instead of LOG window;
    TITLE "Listing Invalid Input";
    ***CHECK GENDER;
    IF PUT(GENDER, $GENDER.) = 'MISCODED'
       THEN PUT PATNO= GENDER=;
    ***CHECK DX;
    IF PUT(DX, $DX.) = 'MISCODED'
       THEN PUT PATNO= DX=;
    ***CHECK AE;
    IF PUT(AE, $AE.) = 'MISCODED'
       THEN PUT PATNO= AE=;
RUN;
SAS Output

Listing Invalid Input

PATNO=002 DX=X                                                                                                                      
PATNO=003 GENDER=X                                                                                                                  
PATNO=004 AE=A                                                                                                                      
PATNO=013 GENDER=2                                                                                                                  
PATNO=002 DX=X                                                                                                                      

The PUT function applies the format on the fly and outputs the translated value to be compared to 'MISCODED'. If the value falls into the OTHER category of the translation and becomed 'MISCODED', then the PATNO and value of the miscoded variable are output.

10.4. Checking for Invalid Numeric Values

The techniques for checking for invalid numeric data are quite different from the techniques we used with character data. Although there are usually many different values a numeric variable can take on, there are several techniques that we can use to help identify data errors. One simple technique is to examine some of the largest and smallest data values for each numeric variable. If we see values such as 12 or 1200 for a systolic blood pressure, we can be quite certain that an error was made, either in entering the data values or on the original data collection form.

There are also some internal consistency methods that can be used to identify possible data errors. If we see that most of the data values fall with a certain range of values, then any values that fall far enough outside that range may be data errors. We will develop programs based on these ideas in this section.

The simplest way to check to see if there are some missing values is to use PROC MEANS or PROC UNIVARIATE to check to see if you have missing values or values that are out of the valid range for a numeric variable.

Example

In the following SAS program, we use PROC MEANS to see if there are any values of HR, SBP or DBP that are too large or too small or missing by using the nmiss, min and max statistics.

PROC MEANS DATA=PATIENTS N NMISS MIN MAX MAXDEC=0;   
    TITLE "CHECKING NUMERIC VARIABLES";   
    VAR HR SBP DBP; 
RUN;
SAS Output

CHECKING NUMERIC VARIABLES

The MEANS Procedure

Variable Label N N Miss Minimum Maximum
HR
SBP
DBP
HEART RATE
SYSTOLIC BLOOD PRESSURE
DIASTOLIC BLOODPRESSURE
27
26
27
3
4
3
10
20
8
900
400
200

From the output, we can clearly see that there are SOME values of HR, SBP and DBP that are out the valid range, but we don't know which ones or how many there are that are invalid.

Similar to character variables, there are two simple ways that we can find out WHICH rows have potentially invalid input:

  • we can use IF statements to check for the values outside the expected ranges

  • we can use formats to identify potentially miscoded values

Example

The following example explores these two options of using IF statements and FORMATS to identify values of HR, SBP and DBP that are out of the expected ranges. We begin with the IF statements.

DATA _NULL_;   
    SET patients;
    FILE PRINT; *SEND OUTPUT TO THE OUTPUT WINDOW;   
    TITLE "LISTING OF INVALID DATA VALUES";   
    *NOTE: WE WILL ONLY INPUT THOSE VARIABLES OF INTEREST;   
    *CHECK HR;   
    IF (HR LT 40 AND HR NE .) OR HR GT 100       
       THEN PUT PATNO= HR=;   
    *CHECK SBP;   
    IF (SBP LT 80 AND SBP NE .) OR SBP GT 200 
       THEN PUT PATNO= SBP=;   
    *CHECK DBP;   
    IF (DBP LT 60 AND DBP NE .) OR DBP GT 120 
       THEN PUT PATNO= DBP=; 
RUN;
SAS Output

LISTING OF INVALID DATA VALUES

PATNO=004 HR=101                                                                                                                    
PATNO=008 HR=210                                                                                                                    
PATNO=009 SBP=240                                                                                                                   
PATNO=009 DBP=180                                                                                                                   
PATNO=010 SBP=40                                                                                                                    
PATNO=011 SBP=300                                                                                                                   
PATNO=011 DBP=20                                                                                                                    
PATNO=014 HR=22                                                                                                                     
PATNO=017 HR=208                                                                                                                    
PATNO=321 HR=900                                                                                                                    
PATNO=321 SBP=400                                                                                                                   
PATNO=321 DBP=200                                                                                                                   
PATNO=020 HR=10                                                                                                                     
PATNO=020 SBP=20                                                                                                                    
PATNO=020 DBP=8                                                                                                                     
PATNO=023 HR=22                                                                                                                     
PATNO=023 SBP=34                                                                                                                    

Note that the checks for values below some cutoff also check that the value is not missing since a value of missing (.) is interpreted as negative infinity in inequalities.

An alternative way to do this is with FORMATS as shown below in the following SAS program.

PROC FORMAT;
    VALUE HR_CK 40-100, . = 'OK'; *Values between 40 and 100 and . are OK;
    VALUE SBP_CK 80-200, . = 'OK'; *Value between 80 and 200 and . are OK;
    VALUE DBP_CK 60-120, . = 'OK'; *Value between 60 and 120 and . are OK;
RUN;

/* The PUT function applies the given format to the given variable */
DATA _NULL_;
    SET Patients;
    FILE print;
    IF PUT(HR,HR_CK.) NE 'OK' 
       THEN PUT PATNO= HR=;
    IF PUT(SBP,SBP_CK.) NE 'OK' 
       THEN PUT PATNO= SBP=;
    IF PUT(DBP,DBP_CK.) NE 'OK' 
       THEN PUT PATNO= DBP=;
RUN;
SAS Output

LISTING OF INVALID DATA VALUES

PATNO=004 HR=101                                                                                                                    
PATNO=008 HR=210                                                                                                                    
PATNO=009 SBP=240                                                                                                                   
PATNO=009 DBP=180                                                                                                                   
PATNO=010 SBP=40                                                                                                                    
PATNO=011 SBP=300                                                                                                                   
PATNO=011 DBP=20                                                                                                                    
PATNO=014 HR=22                                                                                                                     
PATNO=017 HR=208                                                                                                                    
PATNO=321 HR=900                                                                                                                    
PATNO=321 SBP=400                                                                                                                   
PATNO=321 DBP=200                                                                                                                   
PATNO=020 HR=10                                                                                                                     
PATNO=020 SBP=20                                                                                                                    
PATNO=020 DBP=8                                                                                                                     
PATNO=023 HR=22                                                                                                                     
PATNO=023 SBP=34                                                                                                                    

This is a fairly simple and efficient program. The user-defined formats HR_CK, SBP_CK, and DBP_CK all assign the format ‘OK’ for any data value in the acceptable range. In the DATA step, the PUT function is used to test if a value outside the valid range was found. For example, a value of 22 for heart rate would not fall within the range of 40 to 100 or missing and the format OK would not be assigned. The result of the PUT function for heart rate is not equal to ‘OK’ and the argument of the IF statement is true. The appropriate PUT statement is then executed and the invalid value is printed to the print file.

This is of course only a basic introduction to methods for checking your data. See for example, Ron Cody’s Cody’s Data Cleaning Techniques Using SAS for more details on data cleaning with SAS.

10.5. Exercises

  1. Read in the bike lanes dataset Bike_Lanes.csv using PROC IMPORT and call it bike.

  2. How many rows are are in the bike dataset? How many are complete cases? Hint: Use cmiss(of ALL) in a dataset to create a 0/1 variable to indicate if it complete or not and then use PROC MEANS to sum this 0/1 variable.

  3. Filter rows of bike that are NOT missing the route variable, assign this to the object have_route. Create a frequency table of the subType variable using PROC FREQ, including the missing subTypes.