{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Cleaning\n", "\n", "In this lesson, we will learn some basic techniques to check our data for invalid inputs. One of the first and most important steps in any data processing task is to verify that your data values are correct or, at the very least, conform to some a set of rules. For example, a variable called GENDER would be expected to have only two values; a variable representing height in inches would be expected to be within reasonable limits. Some critical applications require a double entry and verification process of data entry. Whether this is done or not, it is still useful to run your data through a series of data checking operations.\n", "\n", "For this lesson, we will work through part of Ron Cody's paper Data Cleaning 101. For the examples, we will use a small dataset with patient data stored in the raw data file PAITENTS.TXT (see the course webpage's data folder for the dataset). This dataset contains the following variables.\n", "\n", "|Variable Name | Description | Variable Type | Valid Values |\n", "| --- | --- | --- | --- |\n", "| PATNO | Patient Number | Character | Numerals |\n", "| GENDER | Gender | Character | 'M' or 'F' |\n", "| VISIT | Visit Date | MMDDYY10. | Any Valid Date |\n", "| HR | Heart Rate | Numeric | 40 to 100 |\n", "| SBP | Systolic Blood Pressure | Numeric | 80 to 200 |\n", "| DBP | Diastolic Blood Pressure | Numeric | 60 to 120 |\n", "| DX | Diagnosis Code | Character | 1 to 3 digits |\n", "| AE | Adverse Event | Character | '0' or '1' |" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SAS Connection established. Subprocess id is 5001\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsPATNOGENDERVISITHRSBPDBPDXAE
1001M11/11/1998881408010
2002F11/13/19988412078X0
3003X10/21/19986819010031
4004F01/01/19991012001205A
5XX5M05/07/1998681208010
6006  06/15/1999721026861
7007M.88148102  0
8008F08/08/1998210..70
9009M09/25/19998624018041
10010F10/19/1999.4012010
11011M.683002041
12012M10/12/19986012274  0
13013208/23/199974108641  
14014M02/02/19992213090  1
15002F11/13/19988412078X0
16003M11/12/19995811274  0
17015F.821488831
18017F04/05/1999208.8420
19019M06/07/19995811870  0
20123M.60..10
21321F.90040020051
22020F.10208  0
23022M10/10/1999481148221
24023F12/31/1998223478  0
25024F11/09/1998761208010
26025M01/01/1999741026851
27027F..16610670
28028F03/28/1998661509030
29029M05/15/1998...41
30006F07/07/1999821488410
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA PATIENTS; \n", "INFILE \"./data/patients.txt\" PAD; \n", "INPUT @1 PATNO $3. \n", " @4 GENDER $1. \n", " @5 VISIT MMDDYY10. \n", " @15 HR 3. \n", " @18 SBP 3. \n", " @21 DBP 3. \n", " @24 DX $3. \n", " @27 AE $1.; \n", "LABEL PATNO = \"PATIENT NUMBER\" \n", " GENDER = \"GENDER\" \n", " VISIT = \"VISIT DATE\" \n", " HR = \"HEART RATE\" \n", " SBP = \"SYSTOLIC BLOOD PRESSURE\" \n", " DBP = \"DIASTOLIC BLOODPRESSURE\" \n", " DX = \"DIAGNOSIS CODE\" \n", " AE = \"ADVERSE EVENT?\"; \n", "FORMAT VISIT MMDDYY10.; \n", "RUN; \n", "\n", "PROC PRINT DATA = patients;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking for Missing Data\n", "\n", "A we have already seen, SAS has separate missing values for character and numeric data:\n", "\n", "* Missing numeric data is represented by a period (.)\n", "* Missing character data is represented by a single empty space (' ') or the null string ('').\n", "\n", "We can check to see if a particular value is missing by using = and comparing it to the corresponding missing data value or we can use the **missing** function, which checks to see if a value is missing and returns 1 if it is missing and 0 otherwise.\n", "\n", "Some other missing data functions in SAS include:\n", "\n", "* **nmiss** - Can be given a list of numeric variables and counts the number of variables that contain missing data.\n", "* **cmiss** - Can be given character or numeric variables and counts the number of variables that contain missing data.\n", "\n", "Other useful checks for missing data include using PROC FREQ to count the number of missing data values in a character variable and PROC MEANS with the nmiss option to count the number of missing values in a numeric variable.\n", "\n", "
\n", "

Example

\n", "

The following SAS program prints the rows of PATIENTS in which the GENDER or VISIT are missing

\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsPATNOGENDERVISITHRSBPDBPDXAE
6006  06/15/1999721026861
7007M.88148102  0
11011M.683002041
17015F.821488831
20123M.60..10
21321F.90040020051
22020F.10208  0
27027F..16610670
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC PRINT data = patients;\n", " WHERE gender = ' ' OR visit = .;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Alternatively, we could use the missing function to check to see if GENDER or VISIT are missing valued.

\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsPATNOGENDERVISITHRSBPDBPDXAE
6006  06/15/1999721026861
7007M.88148102  0
11011M.683002041
17015F.821488831
20123M.60..10
21321F.90040020051
22020F.10208  0
27027F..16610670
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC PRINT data = patients;\n", " WHERE missing(gender) OR missing(visit);\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following SAS function illustrates using the cmiss function to find the rows that include complete cases, i.e. none of the variables contain missing data.

\n", "
" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsPATNOGENDERVISITHRSBPDBPDXAE
1001M11/11/1998881408010
2002F11/13/19988412078X0
3003X10/21/19986819010031
4004F01/01/19991012001205A
5XX5M05/07/1998681208010
6009M09/25/19998624018041
7002F11/13/19988412078X0
8022M10/10/1999481148221
9024F11/09/1998761208010
10025M01/01/1999741026851
11028F03/28/1998661509030
12006F07/07/1999821488410
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA patients_cc;\n", " SET patients;\n", " IF cmiss(of _ALL_) ~= 0 THEN DELETE;\n", "RUN;\n", "\n", "PROC PRINT data = patients_cc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

We have used the special SAS variable _ALL_ to list all the variables in our dataset in the cmiss function call. This allows us to check all variables for missing data in a given row without needed to list out all the variables manually or with a variable list. Since we have a mix of numeric and character variables, we use cmiss, but if we had only had numeric variables (or were only checking numeric variables by using the _NUMERIC_ SAS variable) then we could have used nmiss. A row is complete if the number of missing values is 0.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking for Invalid Character Values\n", "\n", "One simple technique to check a character variable, if there are a limited number of categories, for invalid input is to use PROC FREQ. Creating a frequency table of the character variable will list out all the oberved categories along with the frequency of occurence.\n", "\n", "
\n", "

Example

\n", "

The following SAS program uses PROC FREQ to check the GENDER, DX, and AE variables for invalid input. Recall that these have certain rules for valid values listed in the variable description table above.

\n", "
" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Frequency Counts

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
GENDER
GENDERFrequency
Frequency Missing = 1
21
F14
M13
X1
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DIAGNOSIS CODE
DXFrequency
Frequency Missing = 7
17
22
33
43
53
61
72
X2
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ADVERSE EVENT?
AEFrequency
Frequency Missing = 1
018
110
A1
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data = patients;\n", " TITLE \"Frequency Counts\";\n", " TABLES GENDER DX AE / NOCUM NOPERCENT;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note that GENDER has a few miscoded values such as 2 and X, DX has a miscoded value of X, AE has a miscoded value of A, and all have at least one missing value. We would need to revisit the patient data to see if we can determine what these should have been coded as.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using a DATA Step to Indetify Invalid Characters\n", "\n", "We can also use a DATA step with IF statments to identify observations that do not meet our expected patterns.\n", "\n", "
\n", "

Example

\n", "

The following SAS program checks GENDER, DX, and AE for invalid character codes using IF statments in combination with PUT and VERIFY.

\n", "
" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Listing Invalid Input

\n", "
\n", "
\n", "
\n",
       "PATNO=002 DX=X                                                                                                                      \n",
       "PATNO=003 GENDER=X                                                                                                                  \n",
       "PATNO=004 AE=A                                                                                                                      \n",
       "PATNO=013 GENDER=2                                                                                                                  \n",
       "PATNO=002 DX=X                                                                                                                      \n",
       "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA _NULL_;\n", " SET Patients;\n", " FILE print; *Print to output window instead of LOG window;\n", " TITLE \"Listing Invalid Input\";\n", " ***CHECK GENDER;\n", " IF GENDER NOT IN ('F','M',' ') THEN\n", " PUT PATNO= GENDER=;\n", " ***CHECK DX;\n", " **Verify returns position of first character that is not ' ' or 1-9;\n", " IF VERIFY(DX,' 0123456789') NE 0 THEN \n", " PUT PATNO= DX=;\n", " ***CHECK AE;\n", " IF AE NOT IN ('0','1',' ') THEN \n", " PUT PATNO= AE=;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The _NULL_ dataset keyword allows us to read data from the PATIENTS dataset without actually creating a new dataset. The FILE PRINT statement tells SAS to print the output of the PUT statements to the output window instead of the log window. Becuase we are expected only 'M', 'F', or ' ' for valid values of GENDER, we can check for these specific values with an IF statement and output and rows that have an invalid value. We do the same thing to check AE where we check for the valid values of '0', '1' or ' '. For DX, we know it should only include numeric values between 0 and 9 in an up to 3 digit character string. The VERIFY function returns the location first character in DX that is not the search string of the numerals 0 to 9 or a blank space. If DX only contains valid values from this search string, then VERIFY returns 0.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to identify miscoded character values in a DATA step is by using user defined formats to group all unexpected codes into a 'miscoded' group.\n", "\n", "
\n", "

Example

\n", "

The following SAS program uses our known valid values for the variables GNDER, DX and AE to define formats that translate any unexpected codes into 'miscoded', so that we can identify the rows with miscoded values.

\n", "
" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Listing Invalid Input

\n", "
\n", "
\n", "
\n",
       "PATNO=002 DX=X                                                                                                                      \n",
       "PATNO=003 GENDER=X                                                                                                                  \n",
       "PATNO=004 AE=A                                                                                                                      \n",
       "PATNO=013 GENDER=2                                                                                                                  \n",
       "PATNO=002 DX=X                                                                                                                      \n",
       "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FORMAT; \n", " VALUE $GENDER 'F','M' = 'VALID' \n", " ' ' = 'MISSING' \n", " OTHER = 'MISCODED'; \n", " VALUE $DX '001' - '999' = 'VALID' \n", " ' ' = 'MISSING' \n", " OTHER = 'MISCODED'; \n", " VALUE $AE '0','1' = 'VALID' \n", " ' ' = 'MISSING' \n", " OTHER = 'MISCODED'; \n", "RUN; \n", "\n", "DATA _NULL_;\n", " SET Patients;\n", " FILE print; *Print to output window instead of LOG window;\n", " TITLE \"Listing Invalid Input\";\n", " ***CHECK GENDER;\n", " IF PUT(GENDER, $GENDER.) = 'MISCODED'\n", " THEN PUT PATNO= GENDER=;\n", " ***CHECK DX;\n", " IF PUT(DX, $DX.) = 'MISCODED'\n", " THEN PUT PATNO= DX=;\n", " ***CHECK AE;\n", " IF PUT(AE, $AE.) = 'MISCODED'\n", " THEN PUT PATNO= AE=;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The PUT function applies the format on the fly and outputs the translated value to be compared to 'MISCODED'. If the value falls into the OTHER category of the translation and becomed 'MISCODED', then the PATNO and value of the miscoded variable are output.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Checking for Invalid Numeric Values\n", "\n", "The techniques for checking for invalid numeric data are quite different from the techniques we used with character data. Although there are usually many different values a numeric variable can take on, there are several techniques that we can use to help identify data errors. One simple technique is to examine some of the largest and smallest data values for each numeric variable. If we see values such as 12 or 1200 for a systolic blood pressure, we can be quite certain that an error was made, either in entering the data values or on the original data collection form. \n", "\n", "There are also some internal consistency methods that can be used to identify possible data errors. If we see that most of the data values fall with a certain range of values, then any values that fall far enough outside that range may be data errors. We will develop programs based on these ideas in this section.\n", "\n", "The simplest way to check to see if there are some missing values is to use PROC MEANS or PROC UNIVARIATE to check to see if you have missing values or values that are out of the valid range for a numeric variable.\n", "\n", "
\n", "

Example

\n", "

In the following SAS program, we use PROC MEANS to see if there are any values of HR, SBP or DBP that are too large or too small or missing by using the nmiss, min and max statistics.

\n", "
" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

CHECKING NUMERIC VARIABLES

\n", "
\n", "
\n", "

The MEANS Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
VariableLabelNN MissMinimumMaximum
\n", "
\n", "
HR
\n", "
SBP
\n", "
DBP
\n", "
\n", "
\n", "
\n", "
HEART RATE
\n", "
SYSTOLIC BLOOD PRESSURE
\n", "
DIASTOLIC BLOODPRESSURE
\n", "
\n", "
\n", "
\n", "
27
\n", "
26
\n", "
27
\n", "
\n", "
\n", "
\n", "
3
\n", "
4
\n", "
3
\n", "
\n", "
\n", "
\n", "
10
\n", "
20
\n", "
8
\n", "
\n", "
\n", "
\n", "
900
\n", "
400
\n", "
200
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC MEANS DATA=PATIENTS N NMISS MIN MAX MAXDEC=0; \n", " TITLE \"CHECKING NUMERIC VARIABLES\"; \n", " VAR HR SBP DBP; \n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

From the output, we can clearly see that there are SOME values of HR, SBP and DBP that are out the valid range, but we don't know which ones or how many there are that are invalid.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to character variables, there are two simple ways that we can find out WHICH rows have potentially invalid input:\n", "\n", "* we can use IF statements to check for the values outside the expected ranges\n", "* we can use formats to identify potentially miscoded values\n", "\n", "
\n", "

Example

\n", "

The following example explores these two options of using IF statements and FORMATS to identify values of HR, SBP and DBP that are out of the expected ranges. We begin with the IF statements.

\n", "
" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

LISTING OF INVALID DATA VALUES

\n", "
\n", "
\n", "
\n",
       "PATNO=004 HR=101                                                                                                                    \n",
       "PATNO=008 HR=210                                                                                                                    \n",
       "PATNO=009 SBP=240                                                                                                                   \n",
       "PATNO=009 DBP=180                                                                                                                   \n",
       "PATNO=010 SBP=40                                                                                                                    \n",
       "PATNO=011 SBP=300                                                                                                                   \n",
       "PATNO=011 DBP=20                                                                                                                    \n",
       "PATNO=014 HR=22                                                                                                                     \n",
       "PATNO=017 HR=208                                                                                                                    \n",
       "PATNO=321 HR=900                                                                                                                    \n",
       "PATNO=321 SBP=400                                                                                                                   \n",
       "PATNO=321 DBP=200                                                                                                                   \n",
       "PATNO=020 HR=10                                                                                                                     \n",
       "PATNO=020 SBP=20                                                                                                                    \n",
       "PATNO=020 DBP=8                                                                                                                     \n",
       "PATNO=023 HR=22                                                                                                                     \n",
       "PATNO=023 SBP=34                                                                                                                    \n",
       "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA _NULL_; \n", " SET patients;\n", " FILE PRINT; *SEND OUTPUT TO THE OUTPUT WINDOW; \n", " TITLE \"LISTING OF INVALID DATA VALUES\"; \n", " *NOTE: WE WILL ONLY INPUT THOSE VARIABLES OF INTEREST; \n", " *CHECK HR; \n", " IF (HR LT 40 AND HR NE .) OR HR GT 100 \n", " THEN PUT PATNO= HR=; \n", " *CHECK SBP; \n", " IF (SBP LT 80 AND SBP NE .) OR SBP GT 200 \n", " THEN PUT PATNO= SBP=; \n", " *CHECK DBP; \n", " IF (DBP LT 60 AND DBP NE .) OR DBP GT 120 \n", " THEN PUT PATNO= DBP=; \n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note that the checks for values below some cutoff also check that the value is not missing since a value of missing (.) is interpreted as negative infinity in inequalities.

\n", "

An alternative way to do this is with FORMATS as shown below in the following SAS program.

\n", "
" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

LISTING OF INVALID DATA VALUES

\n", "
\n", "
\n", "
\n",
       "PATNO=004 HR=101                                                                                                                    \n",
       "PATNO=008 HR=210                                                                                                                    \n",
       "PATNO=009 SBP=240                                                                                                                   \n",
       "PATNO=009 DBP=180                                                                                                                   \n",
       "PATNO=010 SBP=40                                                                                                                    \n",
       "PATNO=011 SBP=300                                                                                                                   \n",
       "PATNO=011 DBP=20                                                                                                                    \n",
       "PATNO=014 HR=22                                                                                                                     \n",
       "PATNO=017 HR=208                                                                                                                    \n",
       "PATNO=321 HR=900                                                                                                                    \n",
       "PATNO=321 SBP=400                                                                                                                   \n",
       "PATNO=321 DBP=200                                                                                                                   \n",
       "PATNO=020 HR=10                                                                                                                     \n",
       "PATNO=020 SBP=20                                                                                                                    \n",
       "PATNO=020 DBP=8                                                                                                                     \n",
       "PATNO=023 HR=22                                                                                                                     \n",
       "PATNO=023 SBP=34                                                                                                                    \n",
       "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FORMAT;\n", " VALUE HR_CK 40-100, . = 'OK'; *Values between 40 and 100 and . are OK;\n", " VALUE SBP_CK 80-200, . = 'OK'; *Value between 80 and 200 and . are OK;\n", " VALUE DBP_CK 60-120, . = 'OK'; *Value between 60 and 120 and . are OK;\n", "RUN;\n", "\n", "/* The PUT function applies the given format to the given variable */\n", "DATA _NULL_;\n", " SET Patients;\n", " FILE print;\n", " IF PUT(HR,HR_CK.) NE 'OK' \n", " THEN PUT PATNO= HR=;\n", " IF PUT(SBP,SBP_CK.) NE 'OK' \n", " THEN PUT PATNO= SBP=;\n", " IF PUT(DBP,DBP_CK.) NE 'OK' \n", " THEN PUT PATNO= DBP=;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

This is a fairly simple and efficient program. The user-defined formats HR_CK, SBP_CK, and DBP_CK all assign the format ‘OK’ for any data value in the acceptable range. In the DATA step, the PUT function is used to test if a value outside the valid range was found. For example, a value of 22 for heart rate would not fall within the range of 40 to 100 or missing and the format OK would not be assigned. The result of the PUT function for heart rate is not equal to ‘OK’ and the argument of the IF statement is true. The appropriate PUT statement is then executed and the invalid value is printed to the print file.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is of course only a basic introduction to methods for checking your data. See for example, Ron Cody's Cody's Data Cleaning Techniques Using SAS for more details on data cleaning with SAS.\n", "\n", "## Exercises\n", "\n", "1. Read in the bike lanes dataset Bike_Lanes.csv using PROC IMPORT and call it bike.\n", "2. How many rows are are in the bike dataset? How many are complete cases? Hint: Use cmiss(of _ALL_) in a dataset to create a 0/1 variable to indicate if it complete or not and then use PROC MEANS to sum this 0/1 variable.\n", "3. Filter rows of bike that are NOT missing the `route` variable, assign this to the object `have_route`. Create a frequency table of the `subType` variable using PROC FREQ, including the missing `subType`s." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "SAS", "language": "sas", "name": "sas" }, "language_info": { "codemirror_mode": "sas", "file_extension": ".sas", "mimetype": "text/x-sas", "name": "sas" } }, "nbformat": 4, "nbformat_minor": 2 }