Data Set WORK.PATIENTINFO
Obs | subj | name | gender | height | weight |
---|---|---|---|---|---|
1 | 1024 | Alice | 1 | 65 | 125 |
2 | 1167 | Maryann | 1 | 68 | 140 |
3 | 1168 | Thomas | 2 | 68 | 190 |
4 | 1201 | Benedict | 2 | 68 | 190 |
5 | 1302 | Felicia | 1 | 63 | 115 |
In this lecture, we will discuss how to read data into SAS from different file formats. By the end of this lecture you should be able to
We can enter data directly into SAS by using DATALINES or CARDS in a DATA step. We will use DATALINES in the next example to read in a small data set with patient info
This data is given in list format, so SAS will recognize a space as the separator between data fields.
DATA PatientInfo;
INPUT subj name $ gender height weight;
DATALINES;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 68 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT DATA = PatientInfo;
RUN;
Note that
DATA PatientInfo;
LENGTH name $11; *Name is now defined to be up to 11 characters long;
INPUT subj name $ gender height weight;
DATALINES;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 68 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT DATA = PatientInfo;
RUN;
If we look at the output of PROC CONTENTS, we can see that the name variable now has a length of 11 instead of 8.
PROC CONTENTS DATA = PatientInfo;
RUN;
We could also have used a character informat to specify the lenght. All character formats begin with \$. To speciy a length, we use \$<length>. which must end with the period. The : before the format says to read up to the first delimeter (a whitespace here) or 11 characters. This is how you specify variable length formats. Otherwise it will read 11 character everytime causing SAS to potentially read part of the next variable.
DATA PatientInfo;
INPUT subj name :$11. gender height weight;
DATALINES;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 68 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT DATA = PatientInfo (FIRSTOBS=4 OBS=4);
RUN;
The FIRSTOBS and OBS options tell SAS where to start and stop reading in the data set respectively.
In the previous examples, we only had one observation per line. If we have more than one observation per line or a single observation that spans multiple lines, we will need to use some other options to tell SAS how to read it correctly.
DATA PatientInfo;
INPUT subj name :$11. gender height weight;
DATALINES;
1024 Alice 1 65 125 1167 Maryann 1 68 140
1168 Thomas . 68 190 1201 Benedictine 2 68 190
1302 Felicia 1 . .
;
RUN;
PROC PRINT DATA = PatientInfo;
RUN;
Note that Maryann's and Benedictine's data was skipped. To fix this we need to use the trailing @@ in the INPUT statement. This tells SAS to hold on the the line if there is remaining data at the end of the DATA step iteration.
DATA PatientInfo;
INPUT subj name :$11. gender height weight @@;
DATALINES;
1024 Alice 1 65 125 1167 Maryann 1 68 140
1168 Thomas . 68 190 1201 Benedictine 2 68 190
1302 Felicia 1 . .
;
RUN;
PROC PRINT DATA = PatientInfo;
RUN;
A single trailing @ also tells SAS to hold onto the line but only until the end of the DATA step iteration. We can use the single trailing @ to make decisions about what we read in based on some portion of the data.
DATA TRAIL;
INPUT GENDER $1. @;
IF GENDER NE 'F' THEN DELETE;
INPUT +1 AGE 2.
+1 HEIGHT 2.;
DATALINES;
M 23 68
F 44 62
;
PROC PRINT DATA=TRAIL;
RUN;
In this example, we only wanted to read in data for female patients, so we only read in the first data value, which is the gender followed by the single trailing @ so that SAS doesn't discard the rest of the line yet. We then check the gender, and if it is female, we use another INPUT statement to finish reading in that line otherwise we delete it. The +1 option tells SAS to move one position over before starting to read the next data value.
Column formatted input specifies between which two column locations in a data file that a variable can be found. These files are highly structured.
DATA PatientInfo;
INPUT name $ 1-10 height 11-12 weight 13-15 age 16-17;
DATALINES;
Bob Smith 7019521
Sue Jones 6213218
Mary Lane 65 25
Joe Tesh 35
John Young68
;
RUN;
PROC PRINT DATA=PatientInfo;
RUN;
In the INPUT statement, we specify the columns where that variable will be found after the variable name.
----1----2----3----4 Bob Smith 7019521 Sue Jones 6213218 Mary Lane 65 25 Joe Tesh 35 John Young68
We can use column formats to skip variables as well.
DATA PatientInfo;
INPUT name $ 1-10 age 16-17;
DATALINES;
Bob Smith 7019521
Sue Jones 6213218
Mary Lane 65 25
Joe Tesh 35
John Young68
;
RUN;
PROC PRINT DATA=PatientInfo;
RUN;
The next example has observations that span multiple rows. To move between rows of an observation we have a few options we can use in the INPUT statement.
SAS will automatically go to the next line of the file if it runs out of data but has more variables to read.
DATA PatientInfo;
INPUT subj name :$11. gender height weight age;
DATALINES;
1024 Alice 1
65 125
18
1167 Maryann 1
68 140
38
1168 Thomas 2
68 190
25
1201 Benedictine 2
68 190
40
1302 Felicia 1
63 115
32
;
RUN;
PROC PRINT DATA = PatientInfo;
RUN;
But if we want to read the data in a different order we can use #n and @n to skip around.
DATA PatientInfo;
INPUT @6 name :$11. @1 subj #3 age #2 height weight #1 @17 gender;
DATALINES;
1024 Alice 1
65 125
18
1167 Maryann 1
68 140
38
1168 Thomas 2
68 190
25
1201 Benedictine 2
68 190
40
1302 Felicia 1
63 115
32
;
RUN;
PROC PRINT DATA = PatientInfo;
RUN;
Challenge¶
Write a DATA step that will read in the following datasets using DATALINES. This dataset contains information on >esophageal cancer in Ille-et-Vilaine, France. The variables are
- Age group
- Alcohol consumption
- Tobacco consumption
- Number of Cases
- Number of Controls
Write a DATA step to read in the data and print the results.
----1----2----3----4----5----6----7----8 75+ 0-39g/day 0-9g/day 1 18 75+ 0-39g/day 10-19 2 6 75+ 0-39g/day 30+ 1 3 75+ 40-79 0-9g/day 2 5 75+ 40-79 10-19 1 3 75+ 40-79 20-29 0 3 75+ 40-79 30+ 1 1 75+ 80-119 0-9g/day 1 1 75+ 80-119 10-19 1 1 75+ 120+ 0-9g/day 2 2 75+ 120+ 10-19 1 1
SAS has many built in formats for dates, money, decimals, time and other formats. See LSB for a list of some formats and the SAS documentation for other formats available. This example will explore a few formats for dates, numbers with commas and decimals, and character formats.
DATA formats_ex;
INPUT date1 mmddyy10.
@12 date2 anydtdte10.
@24 date3 date10.
/* date informats are tricky, these worked for me
but sometimes you might need a different informat */
@35 num1 comma6.
@44 num2 comma7.
@53 pct1 percent5.
@61 dec3
@61 dec 8.6
/* read in 6 decimals plus 2 more characters for
decimal point and one whole number */
@61 dec2 5.3
/* read in 3 decimals plus 2 more characters as before.
Does not round values to 3 decimal places! */
@71 char1 $char15.
@87 char2a $char15.
@87 char2b $15. /* Trims leading spaces in characters read */
;
/*
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0----+----1 */
DATALINES;
3/4/2010 4-Mar-10 4 Mar 10 76,604 $76,604 53.2% 0.991418 South Carolina South Carolina
3/11/2010 11-Mar-10 11 Mar 10 90,376 $90,376 0.8% 0.983612 North Carolina North Carolina
3/18/2010 18-Mar-10 18 Mar 10 96,263 $96,263 16.1% 0.888531 South Dakota South Dakota
3/25/2010 25-Mar-10 25 Mar 10 33,608 $33,608 3.3% 0.530136 North Dakota North Dakota
4/1/2010 1-Apr-10 1 Apr 10 26,577 $26,577 73.0% 0.672677 California California
4/8/2010 8-Apr-10 8 Apr 10 88,943 $88,943 8.5% 0.111815 Arizona Arizona
4/15/2010 15-Apr-10 15 Apr 10 49,988 $49,988 79.7% 0.958651 Georgia Georgia
4/22/2010 22-Apr-10 22 Apr 10 25,755 $25,755 48.7% 0.746895 Florida Florida
4/29/2010 29-Apr-10 29 Apr 10 53,169 $53,169 76.1% 0.802572 Mississippi Mississippi
5/6/2010 6-May-10 6 May 10 95,509 $95,509 91.5% 0.564680 Louisiana Louisiana
5/13/2010 13-May-10 13 May 10 86,442 $86,442 19.4% 0.896147 Alabama Alabama
5/20/2010 20-May-10 20 May 10 57,135 $57,135 67.8% 0.965187 Texas Texas
5/27/2010 27-May-10 27 May 10 33,458 $33,458 61.1% 0.588009 New York New York
6/3/2010 3-Jun-10 3 Jun 10 770 $770 93.4% 0.886857 New Jersey New Jersey
6/10/2010 10-Jun-10 10 Jun 10 65,739 $65,739 98.1% 0.858632 Deleware Deleware
6/17/2010 17-Jun-10 17 Jun 10 99,529 $99,529 0.0% 0.869562 Virginia Virginia
6/24/2010 24-Jun-10 24 Jun 10 79,056 $79,056 13.8% 0.673845 West Virginia West Virginia
7/1/2010 1-Jul-10 1 Jul 10 2,397 $2,397 59.4% 0.700637 Ohio Ohio
7/8/2010 8-Jul-10 8 Jul 10 66,614 $66,614 71.4% 0.027379 Idaho Idaho
;
RUN;
*unformatted results printed;
PROC PRINT DATA=formats_ex (FIRSTOBS=1 OBS=5);
RUN;
Note that using informats only tells SAS how to read in the data and store it, but does not store the format for printing. Dates are stored in SAS as the number of days since January 1, 1960.
PROC PRINT DATA=formats_ex (FIRSTOBS=1 OBS=5);
VAR date1 date2 date3 dec;
FORMAT date1 date9.
date2 mmddyy10.
date3 worddate20.
dec 4.2;
RUN;
Note that you must use a DATA set to permanently store formats for variables in SAS.
PROC PRINT DATA=formats_ex (FIRSTOBS=1 OBS=5);
RUN;
DATA formats_exb;
SET formats_ex;
FORMAT date1 date9.
date2 mmddyy10.
date3 worddate20.
dec 4.2;
RUN;
PROC PRINT DATA=formats_exb (FIRSTOBS=1 OBS=5);
VAR date1 date2 date3 dec;
RUN;
When reading data from a file in a DATA step, you use the INFILE statement to specify the file's location. For the next example, I will use the LIBNAME statement to create a (temporary) SAS library called myData. Data sets stored to this library will be permanent on my computer but the library name will not be remembered once the SAS session ends.
LIBNAME myData "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS";
Now, let's read in a data set from a csv file.
DATA myData.usairq;
INFILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\usair_a.csv" DLM = "," FIRSTOBS = 2;
INPUT City :$20. SO2 Temp Factories Population Windspeed Rain RainyDays;
RUN;
PROC PRINT DATA = myData.usairq (OBS=5);
RUN;
The DLM = "," options tells SAS what the delimiter in the file is and the FIRSTOBS = 2 arguments tells SAS to skip the first line of the file, which contains the variable names. Another useful option for csv files is the DSD option in the INFILE statement. This tells SAS that the file is a csv, to read two consecutive commas as a missing value, and to ingore commas within quotes.
DATA PatientInfo;
INFILE DATALINES DSD FIRSTOBS = 2 OBS = 4;
INPUT name :$20. height weight age;
DATALINES;
"Smith, Bob", 70,195,21
"Jones, Sue", 62,132,18
"Lane, Mary", 65,,25
Joe Tesh Main,,,35
John Young,68,,
;
RUN;
PROC PRINT DATA=PatientInfo;
RUN;
Challenge¶
The following dataset contains information on esophageal cancer in Ille-et-Vilaine, France. The variables are
- Age group
- Alcohol consumption
- Tobacco consumption
- Number of Cases
- Number of Controls
Write a DATA step to read in the data using the INFILE statement and print the results.
75+,0-39g/day,0-9g/day,1,18 75+,0-39g/day,10-19,2,6 75+,0-39g/day,30+,1,3 75+,40-79,0-9g/day,2,5 75+,40-79,10-19,1,3 75+,40-79,20-29,0,3 75+,40-79,30+,1,1 75+,80-119,0-9g/day,1,1 75+,80-119,10-19,1,1 75+,120+,0-9g/day,2,2 75+,120+,10-19,1,1The esoph.csv file can be found on the CANVAS page for the course.
PROC IMPORT can be used to read in data files as well. Let's look at a few of the options with PROC IMPORT:
Other possible options:
PROC IMPORT DATAFILE="H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\usair_a.xlsx" out=usair
DBMS=xlsx REPLACE;
RUN;
PROC PRINT DATA = usair (OBS = 5);
RUN;
Another way to create a permanent SAS dataset is to directly refernce a file on your computer.
DATA "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\usair2";
SET usair;
RUN;
PROC PRINT DATA = "H:\BiostatCourses\PublicHealthComputing\CourseNotes\SASchpts\chapt2Read\usair2" (OBS=5);
RUN;
We can write data to a file in a DATA step by using the FILE and PUT statements. FILE is used to specify the name and location of the file to write to along with other options such as making it a delimited file. The PUT statement includes the variables and optionally locations in the file of variables to write. Let's write the patient info to dataset to a file.
DATA PatientInfo;
INPUT subj name :$11. gender height weight;
DATALINES;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 68 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT DATA = PatientInfo;
RUN;
DATA _NULL_; /* Don't create a new dataset */
SET PatientInfo;
FILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\PatientInfo.dat" DSD;
PUT subj name gender height weight;
RUN;
We could alos make it column formatted.
DATA _NULL_; /* Don't create a new dataset */
SET PatientInfo;
FILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\PatientInfo2.dat";
PUT subj @6 name @18 gender @20 height @23 weight;
RUN;