SAS: Reading and Writing Data¶

In this lecture, we will discuss how to read data into SAS from different file formats. By the end of this lecture you should be able to

Use the INFILE statement and options with a DATA step
Read in data using DATALINES/CARDS
Read in data given in list or column format
Read in data from delimited files such as csv files
Be able to use PROC IMPORT or the import wizard to read in excel files
Be able to use informats

DATALINES/CARDS¶

We can enter data directly into SAS by using DATALINES or CARDS in a DATA step. We will use DATALINES in the next example to read in a small data set with patient info

Subject ID
Name
Gender - 1 = Female 2 = Male
Height
Weight

This data is given in list format, so SAS will recognize a space as the separator between data fields.

DATA PatientInfo;
  INPUT subj name $ gender height weight;
  DATALINES;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 68 190
1302 Felicia 1 63 115
;
RUN;

PROC PRINT DATA = PatientInfo;
RUN;

Note that

$ following a variable tells SAS that it is a character variable
When printed Benedictine was read in as Benedict. This is because by default SAS character varaibles are 8 characters long. To fix this, we can use informats or the length statement.

DATA PatientInfo;
  LENGTH name $11; *Name is now defined to be up to 11 characters long;
  INPUT subj name $ gender height weight;
  DATALINES;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 68 190
1302 Felicia 1 63 115
  ;
RUN;

PROC PRINT DATA = PatientInfo;
RUN;

If we look at the output of PROC CONTENTS, we can see that the name variable now has a length of 11 instead of 8.

PROC CONTENTS DATA = PatientInfo;
RUN;

We could also have used a character informat to specify the lenght. All character formats begin with \$. To speciy a length, we use \$<length>. which must end with the period. The : before the format says to read up to the first delimeter (a whitespace here) or 11 characters. This is how you specify variable length formats. Otherwise it will read 11 character everytime causing SAS to potentially read part of the next variable.

DATA PatientInfo;
  INPUT subj name :$11. gender height weight;
  DATALINES;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 68 190
1302 Felicia 1 63 115
  ;
RUN;

PROC PRINT DATA = PatientInfo (FIRSTOBS=4 OBS=4);
RUN;

The FIRSTOBS and OBS options tell SAS where to start and stop reading in the data set respectively.

In the previous examples, we only had one observation per line. If we have more than one observation per line or a single observation that spans multiple lines, we will need to use some other options to tell SAS how to read it correctly.

DATA PatientInfo;
  INPUT subj name :$11. gender height weight;
  DATALINES;
1024 Alice 1 65 125 1167 Maryann 1 68 140
1168 Thomas . 68 190 1201 Benedictine 2 68 190
1302 Felicia 1 . .
  ;
RUN;

PROC PRINT DATA = PatientInfo;
RUN;

Note that Maryann's and Benedictine's data was skipped. To fix this we need to use the trailing @@ in the INPUT statement. This tells SAS to hold on the the line if there is remaining data at the end of the DATA step iteration.

DATA PatientInfo;
  INPUT subj name :$11. gender height weight @@;
  DATALINES;
1024 Alice 1 65 125 1167 Maryann 1 68 140
1168 Thomas . 68 190 1201 Benedictine 2 68 190
1302 Felicia 1 . .
  ;
RUN;

PROC PRINT DATA = PatientInfo;
RUN;

A single trailing @ also tells SAS to hold onto the line but only until the end of the DATA step iteration. We can use the single trailing @ to make decisions about what we read in based on some portion of the data.

DATA TRAIL; 
   INPUT GENDER $1. @; 
   IF GENDER NE 'F' THEN DELETE; 
      INPUT +1 AGE 2. 
            +1 HEIGHT 2.; 
DATALINES; 
M 23 68 
F 44 62 
; 
PROC PRINT DATA=TRAIL; 
RUN;

In this example, we only wanted to read in data for female patients, so we only read in the first data value, which is the gender followed by the single trailing @ so that SAS doesn't discard the rest of the line yet. We then check the gender, and if it is female, we use another INPUT statement to finish reading in that line otherwise we delete it. The +1 option tells SAS to move one position over before starting to read the next data value.

Column Formatted Input¶

Column formatted input specifies between which two column locations in a data file that a variable can be found. These files are highly structured.

DATA PatientInfo;
INPUT name $ 1-10 height 11-12 weight 13-15 age 16-17;
DATALINES; 
Bob Smith 7019521
Sue Jones 6213218
Mary Lane 65   25
Joe Tesh       35
John Young68 
;
RUN;

PROC PRINT DATA=PatientInfo;
RUN;

In the INPUT statement, we specify the columns where that variable will be found after the variable name.

----1----2----3----4
Bob Smith 7019521
Sue Jones 6213218
Mary Lane 65   25
Joe Tesh       35
John Young68

We can use column formats to skip variables as well.

DATA PatientInfo;
INPUT name $ 1-10 age 16-17;
DATALINES;
Bob Smith 7019521
Sue Jones 6213218
Mary Lane 65   25
Joe Tesh       35
John Young68 
;
RUN;

PROC PRINT DATA=PatientInfo;
RUN;

The next example has observations that span multiple rows. To move between rows of an observation we have a few options we can use in the INPUT statement.

#n - go to line n
/ - Go to the next line
@n - go to column n in the current line

SAS will automatically go to the next line of the file if it runs out of data but has more variables to read.

DATA PatientInfo;
  INPUT subj name :$11. gender height weight age;
  DATALINES;
1024 Alice 1 
65 125
18
1167 Maryann 1 
68 140
38
1168 Thomas 2 
68 190
25
1201 Benedictine 2 
68 190
40
1302 Felicia 1 
63 115
32
  ;
RUN;

PROC PRINT DATA = PatientInfo;
RUN;

But if we want to read the data in a different order we can use #n and @n to skip around.

DATA PatientInfo;
  INPUT @6 name :$11. @1 subj #3 age #2 height weight #1 @17 gender;
  DATALINES;
1024 Alice       1 
65 125
18
1167 Maryann     1 
68 140
38
1168 Thomas      2 
68 190
25
1201 Benedictine 2 
68 190
40
1302 Felicia     1 
63 115
32
  ;
RUN;

PROC PRINT DATA = PatientInfo;
RUN;

Challenge¶
Write a DATA step that will read in the following datasets using DATALINES. This dataset contains information on >esophageal cancer in Ille-et-Vilaine, France. The variables are

Age group

Alcohol consumption

Tobacco consumption

Number of Cases

Number of Controls

Write a DATA step to read in the data and print the results.
----1----2----3----4----5----6----7----8
75+ 0-39g/day 0-9g/day      1        18
75+ 0-39g/day    10-19      2         6
75+ 0-39g/day      30+      1         3
75+     40-79 0-9g/day      2         5
75+     40-79    10-19      1         3
75+     40-79    20-29      0         3
75+     40-79      30+      1         1
75+    80-119 0-9g/day      1         1
75+    80-119    10-19      1         1
75+      120+ 0-9g/day      2         2
75+      120+    10-19      1         1

Informats¶

SAS has many built in formats for dates, money, decimals, time and other formats. See LSB for a list of some formats and the SAS documentation for other formats available. This example will explore a few formats for dates, numbers with commas and decimals, and character formats.

DATA formats_ex;
INPUT date1       mmddyy10.         
      @12 date2   anydtdte10.
      @24 date3   date10. 
	  /* date informats are tricky, these worked for me 
	  but sometimes you might need a different informat */
      @35 num1    comma6.
      @44 num2    comma7.
      @53 pct1    percent5.
	  @61 dec3     
      @61 dec     8.6  
      /* read in 6 decimals plus 2 more characters for 
         decimal point and one whole number */
      @61 dec2    5.3  
      /* read in 3 decimals plus 2 more characters as before. 
         Does not round values to 3 decimal places!  */
      @71 char1   $char15.
      @87 char2a  $char15.
      @87 char2b  $15.        /* Trims leading spaces in characters read */
;
/*
----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0----+----1 */
DATALINES;
3/4/2010   4-Mar-10    4 Mar 10   76,604   $76,604  53.2%   0.991418  South Carolina  South Carolina
3/11/2010  11-Mar-10   11 Mar 10  90,376   $90,376  0.8%    0.983612  North Carolina  North Carolina
3/18/2010  18-Mar-10   18 Mar 10  96,263   $96,263  16.1%   0.888531  South Dakota      South Dakota
3/25/2010  25-Mar-10   25 Mar 10  33,608   $33,608  3.3%    0.530136  North Dakota      North Dakota
4/1/2010   1-Apr-10    1 Apr 10   26,577   $26,577  73.0%   0.672677  California          California
4/8/2010   8-Apr-10    8 Apr 10   88,943   $88,943  8.5%    0.111815  Arizona                Arizona
4/15/2010  15-Apr-10   15 Apr 10  49,988   $49,988  79.7%   0.958651  Georgia                Georgia
4/22/2010  22-Apr-10   22 Apr 10  25,755   $25,755  48.7%   0.746895  Florida                Florida
4/29/2010  29-Apr-10   29 Apr 10  53,169   $53,169  76.1%   0.802572  Mississippi        Mississippi
5/6/2010   6-May-10    6 May 10   95,509   $95,509  91.5%   0.564680  Louisiana            Louisiana
5/13/2010  13-May-10   13 May 10  86,442   $86,442  19.4%   0.896147  Alabama                Alabama
5/20/2010  20-May-10   20 May 10  57,135   $57,135  67.8%   0.965187  Texas                    Texas
5/27/2010  27-May-10   27 May 10  33,458   $33,458  61.1%   0.588009  New York              New York
6/3/2010   3-Jun-10    3 Jun 10   770      $770     93.4%   0.886857  New Jersey          New Jersey
6/10/2010  10-Jun-10   10 Jun 10  65,739   $65,739  98.1%   0.858632  Deleware              Deleware
6/17/2010  17-Jun-10   17 Jun 10  99,529   $99,529  0.0%    0.869562  Virginia              Virginia
6/24/2010  24-Jun-10   24 Jun 10  79,056   $79,056  13.8%   0.673845  West Virginia    West Virginia
7/1/2010   1-Jul-10    1 Jul 10   2,397    $2,397   59.4%   0.700637  Ohio                      Ohio
7/8/2010   8-Jul-10    8 Jul 10   66,614   $66,614  71.4%   0.027379  Idaho                    Idaho
;
RUN;

*unformatted results printed;
PROC PRINT DATA=formats_ex (FIRSTOBS=1 OBS=5);
RUN;

Note that using informats only tells SAS how to read in the data and store it, but does not store the format for printing. Dates are stored in SAS as the number of days since January 1, 1960.

PROC PRINT DATA=formats_ex (FIRSTOBS=1 OBS=5);
VAR date1 date2 date3 dec;
FORMAT date1 date9.
       date2 mmddyy10.
       date3 worddate20.
       dec 4.2;
RUN;

Note that you must use a DATA set to permanently store formats for variables in SAS.

PROC PRINT DATA=formats_ex (FIRSTOBS=1 OBS=5);
RUN;

DATA formats_exb;
SET formats_ex;
FORMAT date1 date9.
       date2 mmddyy10.
       date3 worddate20.
       dec 4.2;
RUN;

PROC PRINT DATA=formats_exb (FIRSTOBS=1 OBS=5);
VAR date1 date2 date3 dec;
RUN;

Delimited Files¶

When reading data from a file in a DATA step, you use the INFILE statement to specify the file's location. For the next example, I will use the LIBNAME statement to create a (temporary) SAS library called myData. Data sets stored to this library will be permanent on my computer but the library name will not be remembered once the SAS session ends.

LIBNAME myData "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS";

Now, let's read in a data set from a csv file.

DATA myData.usairq;
INFILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\usair_a.csv" DLM = "," FIRSTOBS = 2;
INPUT City :$20. SO2 Temp Factories Population Windspeed Rain RainyDays;
RUN;

PROC PRINT DATA = myData.usairq (OBS=5);
RUN;

The DLM = "," options tells SAS what the delimiter in the file is and the FIRSTOBS = 2 arguments tells SAS to skip the first line of the file, which contains the variable names. Another useful option for csv files is the DSD option in the INFILE statement. This tells SAS that the file is a csv, to read two consecutive commas as a missing value, and to ingore commas within quotes.

DATA PatientInfo;
INFILE DATALINES DSD FIRSTOBS = 2 OBS = 4;
INPUT name :$20. height weight age;
DATALINES;
"Smith, Bob", 70,195,21
"Jones, Sue", 62,132,18
"Lane, Mary", 65,,25
Joe Tesh Main,,,35
John Young,68,, 
;
RUN;

PROC PRINT DATA=PatientInfo;
RUN;

Challenge¶
The following dataset contains information on esophageal cancer in Ille-et-Vilaine, France. The variables are

Age group

Alcohol consumption

Tobacco consumption

Number of Cases

Number of Controls

Write a DATA step to read in the data using the INFILE statement and print the results.
75+,0-39g/day,0-9g/day,1,18
75+,0-39g/day,10-19,2,6
75+,0-39g/day,30+,1,3
75+,40-79,0-9g/day,2,5
75+,40-79,10-19,1,3
75+,40-79,20-29,0,3
75+,40-79,30+,1,1
75+,80-119,0-9g/day,1,1
75+,80-119,10-19,1,1
75+,120+,0-9g/day,2,2
75+,120+,10-19,1,1
The esoph.csv file can be found on the CANVAS page for the course.

PROC IMPORT¶

PROC IMPORT can be used to read in data files as well. Let's look at a few of the options with PROC IMPORT:

DBMS - specifies what type of file you are reading in (e.g. excel)
REPLACE - (YES/NO) overwrite an existing SAS dataset if one exists
OUT - name of SAS dataset to create
DATAFILE - path of file to read in

Other possible options:

DELIMITER - specifiy a delimiter for a delimited file
GETNAMES - (YES/NO) specify if the first row of data contains variable names
GUESSINGROWS - Number of rows to read before determining lengths and data types of variables

PROC IMPORT DATAFILE="H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\usair_a.xlsx" out=usair
            DBMS=xlsx REPLACE;
RUN;

PROC PRINT DATA = usair (OBS = 5);
RUN;

Another way to create a permanent SAS dataset is to directly refernce a file on your computer.

DATA "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\usair2";
SET usair;
RUN;

PROC PRINT DATA = "H:\BiostatCourses\PublicHealthComputing\CourseNotes\SASchpts\chapt2Read\usair2" (OBS=5);
RUN;

Writing Data to a File¶

We can write data to a file in a DATA step by using the FILE and PUT statements. FILE is used to specify the name and location of the file to write to along with other options such as making it a delimited file. The PUT statement includes the variables and optionally locations in the file of variables to write. Let's write the patient info to dataset to a file.

DATA PatientInfo;
  INPUT subj name :$11. gender height weight;
  DATALINES;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 68 190
1302 Felicia 1 63 115
  ;
RUN;

PROC PRINT DATA = PatientInfo;
RUN;

DATA _NULL_; /* Don't create a new dataset */
SET PatientInfo;
FILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\PatientInfo.dat" DSD;
PUT subj name gender height weight;
RUN;

We could alos make it column formatted.

DATA _NULL_; /* Don't create a new dataset */
SET PatientInfo;
FILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week2ReadinData\SAS\PatientInfo2.dat";
PUT subj @6 name @18 gender @20 height @23 weight;
RUN;

Data Set Name	WORK.PATIENTINFO	Observations	5
Member Type	DATA	Variables	5
Engine	V9	Indexes	0
Created	08/23/2017 11:09:20	Observation Length	48
Last Modified	08/23/2017 11:09:20	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	WINDOWS_64
Encoding	wlatin1 Western (Windows)

Engine/Host Dependent Information
Data Set Page Size	65536
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	1361
Obs in First Data Page	5
Number of Data Set Repairs	0
ExtendObsCounter	YES
Filename	C:\Users\rlp176\AppData\Local\Temp\SAS Temporary Files\_TD7332_PHHP-DDH4LH2_\Prc2\patientinfo.sas7bdat
Release Created	9.0401M3
Host Created	X64_7PRO

Obs	date1	date2	date3	num1	num2	pct1	dec3	dec	dec2	char1	char2a	char2b
1	18325	18325	18325	76604	76604	0.532	0.99142	0.99142	0.991	South Carolina	South Carolina	South Carolina
2	18332	18332	18332	90376	90376	0.008	0.98361	0.98361	0.983	North Carolina	North Carolina	North Carolina
3	18339	18339	18339	96263	96263	0.161	0.88853	0.88853	0.888	South Dakota	South Dakota	South Dakota
4	18346	18346	18346	33608	33608	0.033	0.53014	0.53014	0.530	North Dakota	North Dakota	North Dakota
5	18353	18353	18353	26577	26577	0.730	0.67268	0.67268	0.672	California	California	California

Obs	date1	date2	date3	dec
1	04MAR2010	03/04/2010	March 4, 2010	0.99
2	11MAR2010	03/11/2010	March 11, 2010	0.98
3	18MAR2010	03/18/2010	March 18, 2010	0.89
4	25MAR2010	03/25/2010	March 25, 2010	0.53
5	01APR2010	04/01/2010	April 1, 2010	0.67

Obs	date1	date2	date3	num1	num2	pct1	dec3	dec	dec2	char1	char2a	char2b
1	18325	18325	18325	76604	76604	0.532	0.99142	0.99142	0.991	South Carolina	South Carolina	South Carolina
2	18332	18332	18332	90376	90376	0.008	0.98361	0.98361	0.983	North Carolina	North Carolina	North Carolina
3	18339	18339	18339	96263	96263	0.161	0.88853	0.88853	0.888	South Dakota	South Dakota	South Dakota
4	18346	18346	18346	33608	33608	0.033	0.53014	0.53014	0.530	North Dakota	North Dakota	North Dakota
5	18353	18353	18353	26577	26577	0.730	0.67268	0.67268	0.672	California	California	California

Obs	subj	name	gender	height	weight
1	1024	Alice	1	65	125
2	1167	Maryann	1	68	140
3	1168	Thomas	2	68	190
4	1201	Benedict	2	68	190
5	1302	Felicia	1	63	115

Alphabetic List of Variables and Attributes
#	Variable	Type	Len
3	gender	Num	8
4	height	Num	8
1	name	Char	11
2	subj	Num	8
5	weight	Num	8

Obs	name	height	weight	age
1	Bob Smith	70	195	21
2	Sue Jones	62	132	18
3	Mary Lane	65	.	25
4	Joe Tesh	.	.	35
5	John Young	68	.	.

Obs	City	SO2	Temp	Factories	Population	Windspeed	Rain	RainyDays
1	Phoenix	10	70.3	213	582	6.0	7.05	36
2	Little Rock	13	61.0	91	132	8.2	48.52	100
3	San Francisco	12	56.7	453	716	8.7	20.66	67
4	Denver	17	51.9	454	515	9.0	12.95	86
5	Hartford	56	49.1	412	158	9.0	43.37	127

SAS: Reading and Writing Data¶

DATALINES/CARDS¶

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

SAS Output

The CONTENTS Procedure

WORK.PATIENTINFO

Attributes

Engine/Host Information

Variables

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

SAS Output

The PRINT Procedure

Data Set WORK.TRAIL

Column Formatted Input¶

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

Challenge¶

Informats¶

SAS Output

The PRINT Procedure

Data Set WORK.FORMATS_EX

SAS Output

The PRINT Procedure

Data Set WORK.FORMATS_EX

SAS Output

The PRINT Procedure

Data Set WORK.FORMATS_EX

SAS Output

The PRINT Procedure

Data Set WORK.FORMATS_EXB

Delimited Files¶

SAS Output

The PRINT Procedure

Data Set MYDATA.USAIRQ

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO

Challenge¶

PROC IMPORT¶

SAS Output

The PRINT Procedure

Data Set WORK.USAIR

SAS Output

The PRINT Procedure

Data Set H:\BiostatCourses\PublicHealthComputing\CourseNotes\SASchpts\chapt2Read\usair2

Writing Data to a File¶

SAS Output

The PRINT Procedure

Data Set WORK.PATIENTINFO