4. Data I/O¶
In order to be able to analyze our data, we need to be able to read it into a data set that our SAS software understands. In this lesson, we learn methods that will work in many situations to get your data read into a SAS data set. Keep in mind though that when reading data into a SAS data set, you always need to tell SAS three things:
where your data reside — Are they instream data, that is, embedded within your SAS program? Are they contained in a raw data file external to your SAS program? Or are they contained already in another SAS data set?
the format of the data — Are the data values arranged in neatly defined columns so that they can be read in using column input? Are the data values separated by at least one blank space so that they can be read in using list input? Do the data values contain special characters so that they must be read in using formatted input?
the kind of SAS data set that you want to create — Do you want to create a permanent SAS data set? Or do you want to create a temporary SAS data set?
In this lesson, we will learn how to tackle all of these cases. We will learn how to read in both instream, external data from a raw data file (e.g. a csv file), or from another SAS dataset. We’ll learn how to use column, list and formatted input. And, we’ll learn how to create both temporary and permanent SAS datasets.
4.1. Reading Instream Data¶
Although the title of this section is reading instream data, it is hard to focus on just one method of reading data into SAS. As discussed in the introduction to this lesson, every time we read data into a SAS data set, we need to tell SAS three things — where our data reside, the form of the data, and the kind of SAS data set that we want to create. Let’s jump right in and take a look at an example.
Example
The following SAS program illustrates how to create a temporary SAS data set called temp1 to read instream data using column input and then prints the resulting dataset with a title:DATA temp1;
input subj 1-4 gender 6 height 8-9 weight 11-13;
DATALINES;
1024 1 65 125
1167 1 68 140
1168 2 68 190
1201 2 72 190
1302 1 63 115
;
RUN;
PROC PRINT data=temp1;
title 'Output dataset: TEMP1';
RUN;
SAS Connection established. Subprocess id is 10685
Obs | subj | gender | height | weight |
---|---|---|---|---|
1 | 1024 | 1 | 65 | 125 |
2 | 1167 | 1 | 68 | 140 |
3 | 1168 | 2 | 68 | 190 |
4 | 1201 | 2 | 72 | 190 |
5 | 1302 | 1 | 63 | 115 |
The DATALINES statement indicates that lines of data are to follow. All lines following DATALINES are treated as data until a single semicolon is encountered. This semicolon indicates the end of the dataset. Note that DATALINES must occur last in the DATA step. Anything after the semicolon ending the DATALINES statement will not be executed by SAS.
The INPUT statement identifies the columns that we want to read from the input data, and the names of the SAS variables to be created from the data values.
The RUN statement tells SAS to execute the DATA step. The PROC PRINT statement tells SAS to print the entire temporary dataset temp1.
The key things to not about the program are:
- The DATALINES statement is the statement that you must use to tell SAS to expect instream data. The DATALINES statement:
- Must be the last statement to appear in the DATA step (that is, except for the RUN statement)
- Must immediately precede the data lines
- Must be closed by a null statement (that is, a single semicolon). Only one DATALINES statement can appear in a DATA step
- The **INPUT** statement is the statement that you must use to tell SAS the form of the data. Here, we use what is called column input, because the data values are in specific columns. Recall that standard numeric data values can contain only numbers, decimal points, numbers in scientific notation (e.g., 3.1E5), and plus or minus signs.
In general, for each field of raw data that you want to read into your SAS data set, you must specify the following information in the INPUT statement:
If you intend for the variable to be a character variable, place one blank space and then a dollar sign ($) right after the variable’s name in the INPUT statement. None of the variables in our data set are character variables, and therefore no dollar signs appear in the INPUT statement in our program. As our INPUT statement informs SAS, the subject number (subj) begins in column 1 and ends in column 4, gender occupies just column 6, the subject’s height begins in column 8 and ends in column 9, and the subject’s weight begins in column 11 and ends in column 13. You might want to count the columns out from left to right to convince yourself that we’ve defined the fields correctly.
- standard character or numeric values, and
- arranged in neatly defined columns.
- a valid SAS variable name,
- a type (character or numeric),
- and the number of the column in which the field starts and the number of the column in which the field ends, separated by a dash (-).
- The **DATA** statement is the statement that you must use to tell SAS whether the data set that you intend to create should be temporary or permanent. We'll learn more about temporary and permanent data sets in the lesson pages that follow. Know for now .... errrrr, that is, trust me? ... that the above DATA statement tells SAS to create a temporary data set called temp1. Okay, I'll let the cat out of the bag a little bit ... the DATA statement tells SAS that temp1 should be treated as temporary by specifying what is called a one-level name, such as temp1, rather than a two-level name, such as stat480.temp1. Okay, best to stop there. The key thing for now is to know that, because temp1 is a temporary data set, it exists only until the end of your current SAS session. That is, once you close out your SAS session, the SAS data set is removed from memory, and would have to be created again if you needed to use it again.
Launch and run the SAS program in your SAS environment. Then, as always, view the log window first to see if SAS displays any errors from running the code. Then, view the output window. You should see a display of the data set that arises from the PRINT procedure in our code.
Note that the CARDS statement is an alias for the DATALINES statement. That is, we could have alternatively entered the data by replacing the “DATALINES;” statement with a “CARDS;” statement. In your program editor, replace “DATALINES;” with “CARDS;” and rerun your program to convince yourself that this is indeed true.
One more thing … if any of your data values contain semicolons, the DATALINES statement will not work. Instead you must replace the DATALINES statement with a DATALINES4 statement, and the null statement with a single semicolon (;) with a null statement containing four semicolons (;;;;). Strange, I know.
4.2. SAS Data Libraries¶
Before we can really get a handle on the distinction between temporary and permanent SAS data sets, we need to understand what SAS refers to as SAS data libraries. In short, a SAS library is simply a collection of SAS files that are stored in the same folder or directory on your computer. Other files can be stored in the same folder or directory, but only the files that have SAS file extensions are recognized as part of the SAS library.
Depending on the library name that you use when you create a SAS file, such as a SAS data set, SAS files are stored either temporarily or permanently as follows:
When creating a SAS file, if you use the library name Work or you don’t specify a library name at all, then the file is stored in a temporary SAS library called Work. When you close out the SAS session in which you created the SAS file, the temporary library and all of its files are removed from your computer’s memory.
If you use a library name other than the default library name Work when creating a SAS file, then the file is stored permanently until you delete it. That is, you can use permanent SAS libraries and their contents in subsequent SAS sessions.
4.2.1. Referencing Permanent SAS Files¶
Regardless of whether a SAS data set is temporary or permanent, SAS always refers to the data set by a two-level name:
libref.filename
In the two-level name, libref is the (nick)name that you gave the SAS data library that contains the SAS data set, and filename is the name of the file itself. For example, if we have a library called PHC6089 with a SAS dataset called yts, then we would see this if you are using SAS for Windows or UF Apps or if you are using SAS University Edition or OnDemand Academics, then you will see something like this.
In order to print the permanent dataset called YTS that is stored in the SAS library called PHC6089, we have to refer to the permanently stored SAS dataset as:
PHC6089.YTS
as in the following code:
PROC PRINT data = PHC6089.YTS;
RUN;
In this example, we know the SAS dataset called YTS is permanent because the SAS library name is PHC6089 and not Work.
4.2.2. Referencing Temporary SAS Files¶
As stated previously, regardless of whether a SAS data set is temporary or permanent, SAS always refers to the data set by a two-level name:
libref.filename
Now, we know that the libref of a temporary data set is always Work. Therefore, we can refer to any SAS data set stored temporarily in Work by:
Work.filename
where filename is the name of the file itself. For example, if we have a temporary dataset called temp1, then you will see something like this if you are using windows SAS or UF apps and something like this if you are using SAS Univeristy Edition or OnDemand Academics.
In order to print the temporary SAS data set called Temp1 that is stored in the default temporary SAS library called Work, we can refer to the temporarily stored SAS data set as:
Work.Temp1
as in the following code:
PROC PRINT data = work.temp1;
RUN;
Now, although SAS always refers to SAS data sets by their two-level names, it doesn’t mean you have to do the same! In this case, SAS lets you take a shortcut by using just the one-level name:
filename
When you specify a one-level name, the default libref Work is assumed. For example, rather than using the previous code to print the temporary SAS data set called Temp1, we could use the following code:
PROC PRINT data = Temp1;
RUN;
4.2.3. Defining Libraries¶
We have just one more thing to tend to before we complete our discussion of SAS libraries, and that is how to define a library. To do so, we use a LIBNAME statement to tell SAS to associate a a name — called a libref
— to one of the folders or directories on your computer that contains SAS files. For example, I have a subdirectory on my computer:
folders/myfolders/SAS_Notes/data
that contains the SAS dataset yts.sas7bdat.
To tell SAS to associate a libref PHC6089 with this subdirectory, we simply use a LIBNAME
statement as follows:
LIBNAME PHC6089 "/folders/myfolders/SAS_Notes/data";
In general, the libref can be a nickname of your choosing, as long as it is between 1 and 8 characters long, begins with a letter or underscore, and contains only letters, numbers, or underscores. The specification of the physical name of the library, of course, adheres to the conventions of the Windows operating system. You can use as many LIBNAME statements in a program as necessary to assign the librefs that you need.
Now, after submitting our LIBNAME statement, we can (and should!) look at the log window to verify that SAS assigned the libref successfully. If all went well, we should see a message similar to this in this log window:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 libname PHC6089 "/folders/myfolders/SAS_Notes/data";
NOTE: Libref PHC6089 was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myfolders/SAS_Notes/data
74
75
76 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
88
Because the LIBNAME statement is a global statement, our PHC6089 libref remains in effect until we have modified it, canceled it, or ended the SAS session. That is, the LIBNAME statement assigns the libref for the current SAS session only. Each time you begin a SAS session, you must assign a libref to each permanent SAS data library that contains files that you want to access in that SAS session.
4.2.4. Reading Data into Permanent SAS Datasets¶
Rather than having created a temporary data set in the previous Example that disappears at the end of our SAS session, we could have just as easily created a permanent data set that instead gets stored in a directory of our choosing and is therefore available to us permanently. Let’s try that!
Example
The following SAS program illustrates how to create a permanent SAS data set called PHC6089.temp2 to read instream data using column input:
The following SAS program illustrates the simplest example of column input.
LIBNAME PHC6089 "/folders/myfolders/SAS_Notes/data";
*Specifies the SAS data library (directory);
DATA PHC6089.temp2;
input subj 1-4 gender 6 height 8-9 weight 11-13;
DATALINES;
1024 1 65 125
1167 1 68 140
1168 2 68 190
1201 2 72 190
1302 1 63 115
;
RUN;
PROC PRINT data=PHC6089.temp2;
title 'Output dataset: PHC6089.TEMP2';
RUN;
Obs | subj | gender | height | weight |
---|---|---|---|---|
1 | 1024 | 1 | 65 | 125 |
2 | 1167 | 1 | 68 | 140 |
3 | 1168 | 2 | 68 | 190 |
4 | 1201 | 2 | 72 | 190 |
5 | 1302 | 1 | 63 | 115 |
Note that the only differences between this program and the program in previous Example is that:
- there is now a LIBNAME statement. This is how we tell SAS to which of our directories we want the data permanently written. Here, the SAS data set is written to our /folders/myfolders/SAS_Notes/data directory.
- the name of the data set appearing in the DATA statement is now a two-level name. This is how SAS knows to put the permanent dataset temp2 in our /folders/myfolders/SAS_Notes/data directory.
- in subsequent procedures, such as the PRINT procedure here, we refer to the permanent data set by its two-level name.
Launch and run the SAS program. Don't forget that you must make sure that you have already created the /folders/myfolders/SAS_Notes/data subdirectory on your computer (or at least some other appropriately placed and named subdirectory). After running the program, note that:
- the contents and structure of phc6089.temp2 is identical to the contents and structure of work.temp1. The only thing that differs between the two datasets is that temp1 is temporary and temp2 is permanent.
- in the Explorer Window, the dataset temp2 appears in the phc6089 library indicating that the data set is permanent.
- using your Windows Explorer, you can see that the SAS data set is permanently stored in your /folders/myfolders/SAS_Notes/data directory.
4.3. Reading From a Raw Data File¶
Thus far, we’ve only looked at examples in which we’ve read instream data into SAS data sets. Now, let’s direct our attention to learning how to read data from a raw data file into SAS data sets.
A raw data file is an external text file whose records contain data values that are arranged in fields. Typical filename extensions of raw data files are .dat and .txt. A raw data file (also commonly called an ASCII file) is the kind of data file that you would view using your Notepad or Wordpad software. Data sets stored in spreadsheets, such as Microsoft’s Excel, are binary, not raw ASCII data files.
Example
The following SAS program illustrates how to create a temporary SAS data set called temp3 by using column input to read in data stored in a raw data file called temp3.dat:
The following SAS program illustrates the simplest example of column input.
DATA temp3;
infile '/folders/myfolders/SAS_Notes/data/temp3.dat';
input subj 1-4 gender 6 height 8-9 weight 11-13;
RUN;
PROC PRINT data=temp3;
title 'Output dataset: TEMP3';
RUN;
Obs | subj | gender | height | weight |
---|---|---|---|---|
1 | 1024 | 1 | 65 | 125 |
2 | 1167 | 1 | 68 | 140 |
3 | 1168 | 2 | 68 | 190 |
4 | 1201 | 2 | 72 | 190 |
5 | 1302 | 1 | 63 | 115 |
Notice that the INFILE statement, which must precede the INPUT statement, merely replaces the DATALINES statement and the data that appeared in the previous two examples. The INFILE statement tells SAS where the raw data file is stored on your computer. The name and location of the raw data file must appear in single quotes and immediately follow the INFILE keyword. In this case, our raw data file temp3.dat is stored in the directory "/folders/myfolders/SAS_Notes/data". As you can see, the data values are the only items stored in the file.
In order to run this program, first save the temp3.dat file (see the CANVAS website for this file) to a convenient location on your computer. Then, edit the INFILE statement as necessary to reflect the correct location. Finally, launch and run the SAS program. Review the log and output windows to convince yourself that the data were properly read into the temporary data set temp3.
Instead of identifying the raw data file by specifying the entire filename and location in the INFILE statement, we can alternatively use what is called a fileref (for file reference). Just as we use a LIBNAME statement to assign a libref, we use a FILENAME statement to assign a filref. Filerefs perform the same function as librefs. That is, they temporarily point to a storage location for data. However, librefs point to SAS data libraries, whereas filerefs point to external data files.
Example
The following SAS program illustrates the use of a fileref in the INFILE statement, in conjunction with a FILENAME statement, to read data stored in a raw data file called temp3.dat to create a temporary SAS data set called temp4:
The following SAS program illustrates the simplest example of column input.
FILENAME patients '/folders/myfolders/SAS_Notes/data/temp3.dat';
DATA temp4;
infile patients;
input subj 1-4 gender 6 height 8-9 weight 11-13;
RUN;
PROC PRINT data = temp4;
title 'Output dataset: TEMP4';
RUN;
Obs | subj | gender | height | weight |
---|---|---|---|---|
1 | 1024 | 1 | 65 | 125 |
2 | 1167 | 1 | 68 | 140 |
3 | 1168 | 2 | 68 | 190 |
4 | 1201 | 2 | 72 | 190 |
5 | 1302 | 1 | 63 | 115 |
The FILENAME statement in our program assigns the fileref patients to our temp3.dat file stored in our /folders/myfolders/SAS_Notes/data/ folder. In general, the fileref can be a nickname of our choosing, as long as it is between 1 and 8 characters long, begins with a letter or underscore, and contains only letters, numbers, or underscores. The specification of the physical name of the file, of course, adheres to the conventions of the Windows operating system.
If you haven't already done so for the previous example, save the temp3.dat file to a convenient location on your computer. Then, edit the FILENAME statement as necessary to reflect the correct location. Finally, launch and run the SAS program. Review the log and output windows to convince yourself that the data were properly read into the temporary data set temp4.
4.4. Reading Column Input¶
As mentioned in the introduction to this lesson, there are three different styles of input that are available to us in SAS. They are:
column input, which is the most commonly used style, allows you to read data values that are entered in fixed columns.
list input, which allows you to read data by simply listing the variable names in the INPUT statement. At least one space (or character) must occur between each value in the data set.
formatted input, which allows you to read numeric data containing special characters, such as dates and dollar amounts.
In this section, we will take a look at two simple examples of column input. In later sections, we will spend some time investigating list input and formatted input.
A couple of comments. For the sake of the examples that follow, we’ll use the DATALINES statement to read in data. We could have just as easily used the INFILE statement to illustrate each point. Additionally, we’ll create temporary data sets rather than permanent ones, even though we could have just as easily created permanent data sets to illustrate each point. Finally, after each SAS DATA step, we’ll use the SAS print procedure (PROC PRINT) to print the resulting SAS data set for your perusal.
Column input allows you to read variable values that occupy the same columns within each record. To use column input, list the variable names in the INPUT statement, immediately following each variable name with its corresponding column positions in each of the data lines. (Of course, you’ll need to follow each character variable with a dollar sign ($) first.) Column input can be used whenever your raw data are in fixed columns and in standard character or numeric format. Column input reads data values until it reaches the last specified column for the field.
The important points to note about column input are:
When using column input, you are not required to indicate missing values with a placeholder, such as a period. That is, missing values can be left as blank.
Column input uses the columns specified to determine the length of character variables, thereby allowing the character values to exceed the default 8 characters and to have embedded spaces.
Column input allows fields to be skipped altogether or to be read in any order.
Column input allows only part of a value to be read and allows values to be re-read.
Spaces are not required between the data values.
Example
The following SAS program illustrates the simplest example of column input.DATA temp;
input subj 1-4 name $ 6-23 gender 25 height 27-28 weight 30-32;
CARDS;
1024 Alice Smith 1 65 125
1167 Maryann White 1 68 140
1168 Thomas Jones 2 68 190
1201 Benedictine Arnold 2 68 190
1302 Felicia Ho 1 63 115
;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | name | gender | height | weight |
---|---|---|---|---|---|
1 | 1024 | Alice Smith | 1 | 65 | 125 |
2 | 1167 | Maryann White | 1 | 68 | 140 |
3 | 1168 | Thomas Jones | 2 | 68 | 190 |
4 | 1201 | Benedictine Arnold | 2 | 68 | 190 |
5 | 1302 | Felicia Ho | 1 | 63 | 115 |
First, inspect the SAS code to make sure you understand how to set up the INPUT statement for column input.
The data points are line up in columns. Although the data values are separated by a space for easy readability, they need not be when using column input. Column # 1 starts at the left-most column on the page.
The INPUT statement tells SAS that the subject number appears in columns #1-4. Name is a character variable that appears in columns #6-23, so name is character variable of length 18. Using column input, character variables can have embedded spaces. The gender variable is a character variable that appears in column #25. The height variable appears in columns #27-28. The weight variable appears in columns #30-32.
Then, launch and run the SAS program.
Finally, review the output from the print procedure to convince yourself that the data are read in properly.
Example
The following SAS program illustrates some of the key features of column input:DATA temp;
input init $ 6 f_name $ 6-16 l_name $ 18-23
weight 30-32 height 27-28;
CARDS;
1024 Alice Smith 1 65 125
1167 Maryann White 1 68 140
1168 Thomas Jones 2 190
1201 Benedictine Arnold 2 68 190
1302 Felicia Ho 1 63 115
;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | init | f_name | l_name | weight | height |
---|---|---|---|---|---|
1 | A | Alice | Smith | 125 | 65 |
2 | M | Maryann | White | 140 | 68 |
3 | T | Thomas | Jones | 190 | . |
4 | B | Benedictine | Arnold | 190 | 68 |
5 | F | Felicia | Ho | 115 | 63 |
First, inspect the SAS code so that you can become familiar with some of the features of column input. Then, launch and run the SAS program.
Review the output from the print procedure to convince yourself that the data are read in properly. Note that the position of the variables within the temporary data set temp corresponds to the order in which the variables appear in the input statement, not the order in which the variables appear in the data set. Note that the first column from the dataset is not read in to the dataset temp.
4.5. Reading List Input¶
In the previous section, we investigated how to use column input to read data values that appear in neatly define columns looking like this, say:
DATALINES;
Smith 8145551354 3.89
Washington 8145569847 2.73
Wing 8145359376 3.56
Jackson 8145557437 3.12
;
Here, we’ll investigate how to use list input to read in free-format data looking like this, say:
Smith 8145551354 3.89
Washington 8145569847 2.73
Wing 8145359376 3.56
Jackson 8145557437 3.12
List input might be the easiest input style to use because, as shown in the examples that follow, you simply list the variable names in the same order as the corresponding raw data fields. The ease comes with a bit of a price, however. Because you do not have to tell SAS the columns in which the data values appear, you must take note of the following restrictions:
Fields must be separated by at least one blank (or other delimiters such as a comma).
Fields must be read in order from left to right.
You cannot skip or re-read fields.
Missing values must be represented by a place holder such as a period. (A blank field causes the matching of variable names and values to get out of sync.)
Character values can’t contain embedded blanks.
The default length of character values is 8 bytes. A longer value is truncated when it is written to the data set NOTE! 1 byte = 1 character
Data must be in standard character or numeric format.
Example
The following SAS program illustrates the simplest example of list input. Note that there is one blank space between each of the data values. Also note that although the data values need not be lined up in columns, we still recommend doing so because of the difficulty otherwise in "eyeing" the data quickly.
DATA temp;
input subj name $ gender height weight;
* The $ that follows name tells SAS that it is
a character variable;
* By default, name only allows up to 8 characters
to be read in;
CARDS;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benedictine 2 72 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | name | gender | height | weight |
---|---|---|---|---|---|
1 | 1024 | Alice | 1 | 65 | 125 |
2 | 1167 | Maryann | 1 | 68 | 140 |
3 | 1168 | Thomas | 2 | 68 | 190 |
4 | 1201 | Benedict | 2 | 72 | 190 |
5 | 1302 | Felicia | 1 | 63 | 115 |
The INPUT statement is how you tell SAS to read in the data using list input. For list input, simply list the variable names — leaving at least one space between names — in the order in which the variables appear in the data file. Remember to use the dollar sign (\$) to distinguish character variables from numeric variables.
Launch and run the SAS program. Review the output from the print procedure to convince yourself that the data are read in properly with the exception of row 4. Note that Benedictine is read in as Benedict because of the default character length of 8. We will learn how to fix this later with informats.
Example
The following SAS program illustrates the necessary use of the numeric missing value (.) placeholder when a data value is missing:
DATA temp;
input subj name $ gender height weight;
CARDS;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benny 2 . 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | name | gender | height | weight |
---|---|---|---|---|---|
1 | 1024 | Alice | 1 | 65 | 125 |
2 | 1167 | Maryann | 1 | 68 | 140 |
3 | 1168 | Thomas | 2 | 68 | 190 |
4 | 1201 | Benny | 2 | . | 190 |
5 | 1302 | Felicia | 1 | 63 | 115 |
Note that Benny's height is missing. Therefore, since we are using the list input style to read in the data, we have to put in a missing value (.) placeholder.
First, launch and run the SAS program. Review the output from the print procedure to convince yourself that the data are read in properly. Then, edit the program by deleting the missing value (.) placeholder. Rerun the SAS program to see what happens when you fail to account for the missing value. In the log file, you should see a note that says:
NOTE: SAS went to a new line when INPUT statement reached past the end of a line
And the resulting output:
DATA temp;
input subj name $ gender height weight;
CARDS;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benny 2 190
1302 Felicia 1 63 115
;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | name | gender | height | weight |
---|---|---|---|---|---|
1 | 1024 | Alice | 1 | 65 | 125 |
2 | 1167 | Maryann | 1 | 68 | 140 |
3 | 1168 | Thomas | 2 | 68 | 190 |
4 | 1201 | Benny | 2 | 190 | 1302 |
should indicate that something has clearly gone awry. What is going on here is that, by default, SAS goes to the next data line to find more data if there are more variable names in the INPUT statement then there are values in the data line. In this case, Benny's height becomes 190, the first number to appear in the data line after gender, and Benny's weight becomes 1302, the first number to appear in the next data line.
Example
The following SAS program illustrates how you can use the DELIMITER option of the INFILE statement to use values separators other than blanks. This example, in particular, illustrates it for the commonly used comma (,) as a delimiter:
DATA temp;
infile cards delimiter=',';
input subj name $ gender height weight;
CARDS;
1024,Alice,1,65,125
1167,Maryann,1,68,140
1168,Thomas,2,68,190
1201,Benny,2,.,190
1302,Felicia,1,63,115
;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | name | gender | height | weight |
---|---|---|---|---|---|
1 | 1024 | Alice | 1 | 65 | 125 |
2 | 1167 | Maryann | 1 | 68 | 140 |
3 | 1168 | Thomas | 2 | 68 | 190 |
4 | 1201 | Benny | 2 | . | 190 |
5 | 1302 | Felicia | 1 | 63 | 115 |
By default, SAS assumes data are space-delimited. The DELIMITER option of the INFILE statement here instead warns SAS that the data are comma-delimited — that is, that commas separate the data values rather than blank spaces. You might also have noted that although the INFILE statement typically directs SAS to externally-stored data, here the CARDS (or equivalently DATALINES) option included in the INFILE statement alerts SAS that the data are actually included in the code. Launch and run the SAS program. Review the output from the print procedure to convince yourself that the data are indeed read in properly.
4.6. Reading Formatted Input¶
The fundamental difference between column input, which we studied in the previous lesson, and formatted input, which we’ll explore now, is that column input is only appropriate for reading standard numeric data, while formatted input allows us to read both standard and nonstandard numeric data. That is, formatted input combines the features of column input with the ability to read nonstandard data values.
Standard numeric data values. Recall that standard numeric data contain only:
numbers
decimal points
numbers using scientific (E) notation
negative (minus) and positive (plus) signs
Examples of standard numeric values include: 26, 3.9, -13, +3.14, 314E-2, and 2.193E3 (where E stands for scientific notation i.e. E3 stands for \(\times 10^3\)).
Nonstandard numeric data values. On the other hand, nonstandard numeric data values include:
values that contain special characters, such as dollar signs ($), percent signs (%), and commas (,)
date and time values
data in fraction, integer binary, real binary, and hexadecimal forms
Examples of nonstandard numeric values include: 23.3%, $1.26, and 03/07/47.
4.6.1. The INPUT Statement¶
Here’s the general form of the INPUT statement when using formatted input:
INPUT <pointer-control> variable informat.;
where:
pointer-control tells SAS at what column to start reading the data value
variable is the name of the variable being created
informat is a special instruction that tells SAS how to read the raw data values
A couple of things here. The above INPUT statement is written using standard SAS Help notation. For example, the pointer-control appears in brackets (<>) only to indicate that it is optional, or rather, not necessary for every variable you create. For example, you need not tell SAS to go to column 1 if that’s where you want to start reading data values, because that’s where SAS starts by default. In practice, the brackets (<>) should not appear in the INPUT statements in your program … otherwise, SAS will hiccup.
There are two pointer controls that we’ll learn about here:
The @n pointer control moves the input pointer a specific column number n
The +n pointer control moves the input pointer forward n columns to a column number that is relative to the current position
Again, an informat is what is used to tell SAS what special instructions are required to read in the raw data values. Many (too many?) special informats are available in SAS. For example, the numeric informat “mmddyy8.” tells SAS that you want to read in a date that takes up to 8 spaces and looks like 10/27/05. The numeric informat “comma6.” tells SAS that you want to read in a number that contains a comma and takes up to 6 spaces (including the comma), such as the number 11,387.
Let’s take a look at an example!
Example
The following SAS program uses the @n column pointer control and standard numeric informats to read three numeric variables — subj, height, and weight — into a temporary SAS data set called temp:
DATA temp;
input @1 subj 4.
@27 height 2.
@30 weight 3.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC PRINT data = temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | height | weight |
---|---|---|---|
1 | 1024 | 65 | 125 |
2 | 1167 | 68 | 140 |
3 | 1168 | . | 190 |
4 | 1201 | 68 | 190 |
5 | 1302 | 63 | 115 |
If you look at the INPUT statement, you'll see that it uses @n absolute pointer controls to move the input pointer first to column 1 to read subj, then to column 27 to read height, and finally to column 30 to read weight. In general, the @ moves the pointer to column n, the first column of the field that is being read.
The 4. that appears after subj, the 2. that appears after height, and 3. that appears after weight are the informats that tell SAS how to read each of the raw data values. In each case here, we are trying to read in standard numeric data values, and so we use what is called the w.d informat. The w tells SAS the width of the raw data value, that is how many columns it occupies. The d, which is optional, tells SAS the number of implied decimal places for the raw data value. The w and d must be connected by a period (.) delimiter. If the d is not present, you still need to make sure that you include the period in the informat name.
Making this all a little more concrete ... here, the subj values are four columns wide with no decimal places, and hence we use the 4. informat. We alternatively could have specified a 4.0 informat, but we could not have specified 4 (without the period) as the informat. The height values are two columns wide with no decimal places, and hence we use the 2. informat. Finally, the weight values are three columns wide with no decimal places, and hence we use the 3. informat.
Incidentally, the w.d informat ignores any d that we specify if the data value already contains a decimal point. So, for example, if we had a raw data value of 23.001 (occupying 6 columns with 5 digits, 1 decimal point, and 3 decimal places) and specified a 6. informat, SAS would still store the value as 23.001, even though we told SAS not to expect any decimal places.
Okay ... launch and run the SAS program, and review the output obtained from the print procedure to convince yourself that the three variables were read incorrectly. Oh, do you remember when I said that you needn't tell SAS to move the input pointer to the first column because it does so by default? You might want to convince yourself of this by removing the @1 that appears in the INPUT statement. Then, rerun the program to convince yourself that SAS still reads the data values properly.
Example
The following SAS program uses the @n column pointer control and standard character and numeric informats to read, respectively, two character variables — l_name and f_name — and two numeric variables — weight and height — into a temporary SAS data set called temp:
DATA temp;
input @18 l_name $6.
@6 f_name $11.
@30 weight 3.
@27 height 2.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC PRINT data = temp;
title 'Output dataset: TEMP';
RUN;
Obs | l_name | f_name | weight | height |
---|---|---|---|---|
1 | Smith | Alice | 125 | 65 |
2 | White | Maryann | 140 | 68 |
3 | Jones | Thomas | 190 | . |
4 | Arnold | Benedictine | 190 | 68 |
5 | Ho | Felicia | 115 | 63 |
The INPUT statement uses @n absolute pointer controls to move the input pointer first to column 18 to read l_name, then back to column 6 to read f_name, forward to column 30 to read weight, and back again to column 27 to read height. This example illustrates, therefore, how you can use @n pointer controls to read data fields in any order ... backwards, forwards, and backwards again.
The 3. that appears after weight and the 2. that appears after height should look familiar. They are again the numeric informats that tell SAS how to read each of the two numeric variables of interest. Now, if you look at the informats for the two character variables, l_name and f_name, you should see that they each begin with a dollar sign (\$). Because we are trying to read in character data values, we use what is called the \$w. informat. The w tells SAS the width of the raw data value, that is how many columns it occupies. The dollar sign (\$) and period (.) are required delimiters. In our example, the last name (l_name) occupies as many as 6 columns and hence we use the \$6. informat, and the first name (f_name) occupies as many as 11 columns and hence we use the \$11. informat.
Okay ... launch and run the SAS program, and review the output obtained from the print procedure to convince yourself that the four variables were read in correctly.
Example
In addition to using @n absolute pointer controls with numeric and character informats, the following SAS program uses +n relative pointer controls with nonstandard informats to create a temporary SAS data set called temp containing six variables:
DATA temp;
input @1 subj 4.
@6 f_name $11.
@18 l_name $6.
+3 height 2.
+5 wt_date mmddyy8.
+1 calorie comma5.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC PRINT data = temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | f_name | l_name | height | wt_date | calorie |
---|---|---|---|---|---|---|
1 | 1024 | Alice | Smith | 65 | 13118 | 2036 |
2 | 1167 | Maryann | White | 68 | 13118 | 1800 |
3 | 1168 | Thomas | Jones | . | 13119 | 2302 |
4 | 1201 | Benedictine | Arnold | 68 | 13117 | 2432 |
5 | 1302 | Felicia | Ho | 63 | 13149 | 1972 |
You should now understand the specifications for reading subj, f_name and l_name. The INPUT statement tells SAS to go to column 1 to read the numeric field subj that is four columns wide, then to go to column 6 to read the character field f_name that is 11 columns wide, and then to go to column 18 to read the character field l_name that is 6 columns wide.
The +n relative pointer controls are used to read the remaining three variables — height, wt_date, and calorie. The +n pointer control moves the input pointer forward to a column number that is relative to the current position of the pointer. That is, the + moves the pointer forward n columns. In order to count correctly, it is important to understand where the column pointer is located after each data value is read. In general, when using formatted input, the column pointer moves to the first column that follows the field that was just read.
Again, let's make it more concrete. When SAS finishes reading the l_name field, the column pointer moves to the first column that follows that field, that is, to the column that immediately follows the d in Arnold (column 24). Now, the height field begins 3 columns to the right, so that's why we tell SAS to move +3 before reading the height data. When SAS finishes reading the height field, the column pointer moves to the first column that follows that field, that is, to the empty column that follows the heights (column 29). Now, the wt_date field begins 5 columns to the right, so that's why we tell SAS to move +5 before reading the weight dates. When SAS finished reading the wt_date field, the column pointer moves to the first column that follows the field, that is to the empty column just before the calorie data (column 42). Now, the calorie field begins 1 column to the right, so that's why we tell SAS to move +1 before reading the calorie data.
We'd be all done explaining the INPUT statement if it weren't for the wt_date and calorie fields containing nonstandard numeric data. The wt_date field contains forward slashes (/) for specifying the dates, and the calorie data contains commas (,). To read the wt_date field, we use the mmddyy8. informat to tell SAS to expect dates written in mm/dd/yy form. The 8, of course, tells SAS that the dates could occupy as many as eight columns. To read the calorie field, we use the comma6. informat to tell SAS to read numeric data containing commas and occupying as many as 6 columns.
In general, the COMMAw.d informat is used to read numeric values and to remove embedded blanks, commas, dashes, dollar signs, percent signs, right parentheses, and left parentheses (which are converted to minus signs). The w. tells SAS the width of the field (including dollar signs, decimal places, or other special characters). The d, which is optional, tells SAS the number of implied decimal places for a value.
Whew!!! Now, go ahead and launch and run the SAS program, and review the output obtained from the print procedure. In doing so, note that formatted input does not imply formatted output. For example, note that SAS stores the date 12/01/95 as the number 13118. We won't worry about this too much yet — it's just that, interestingly enough, SAS stores dates as numeric values equal to the number of days that have passed since January 1, 1960. Yup, that's right —dates after January 1, 1960 are stored as unique positive integers, and dates before January 1, 1960 are stored as unique negative integers. Also, note that SAS prints the values of the variable calorie without commas. In order to get SAS to print the values in a more understandable format, we must tell SAS how to format the output by using a SAS FORMAT statement.
Run the following code to see formatted output:
DATA temp;
input @1 subj 4.
@6 f_name $11.
@18 l_name $6.
+3 height 2.
+5 wt_date mmddyy8.
+1 calorie comma5.;
FORMAT wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC PRINT data = temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | f_name | l_name | height | wt_date | calorie |
---|---|---|---|---|---|---|
1 | 1024 | Alice | Smith | 65 | 12/01/95 | 2,036 |
2 | 1167 | Maryann | White | 68 | 12/01/95 | 1,800 |
3 | 1168 | Thomas | Jones | . | 12/02/95 | 2,302 |
4 | 1201 | Benedictine | Arnold | 68 | 11/30/95 | 2,432 |
5 | 1302 | Felicia | Ho | 63 | 01/01/96 | 1,972 |
4.7. Reading Data in From Other SAS Datasets¶
In terms of telling SAS where your input data reside, we’ve investigated how to read instream data into a SAS data set. We’ve also investigated how to read data contained in an external raw data file into a SAS data set. Now, we’ll investigate how to read data already contained in one SAS data set into another SAS data set.
Example
The following SAS program creates a temporary SAS data set called work.temp, which is identical to the permanent SAS data set called PHC6089.temp2:
LIBNAME PHC6089 '/folders/myfolders/SAS_Notes/data/';
DATA temp;
set PHC6089.temp2;
RUN;
PROC PRINT data=temp;
title 'Output dataset: TEMP';
RUN;
Obs | subj | gender | height | weight |
---|---|---|---|---|
1 | 1024 | 1 | 65 | 125 |
2 | 1167 | 1 | 68 | 140 |
3 | 1168 | 2 | 68 | 190 |
4 | 1201 | 2 | 72 | 190 |
5 | 1302 | 1 | 63 | 115 |
In this case:
because a one-level name is used, the DATA statement tells SAS to create a temporary dataset called temp.
the SET statement tells SAS to assign the data in the existing permanent SAS data set stat480.temp2 — observation by observation — to the temporary temp data set appearing in the DATA statement. Because the variables in the existing data set have already been named, no input statement is necessary. That is, the SET statement indicates that the data being read are already in the structure of a SAS data set, and therefore merely gives the name of the data set.
Note that before this program will work, you will have already had to create the permanent SAS data set PHC6089.temp2 in our previous Example of this lesson. Launch the SAS program. If PHC6089.temp2 is not stored in a directory called /folders/myfolders/SAS_Notes/data/, you will need to edit the LIBNAME statement so it reflects the location that PHC6089.temp2 is stored on your computer. After convincing yourself that you’ve made any necessary changes, run the SAS program and review the output. Note that the structure and contents of the new temporary data set temp are identical to the permanent SAS data set PHC6089.temp2.
4.8. Reading Data Using PROC IMPORT¶
If you have a delimited file such as a CSV or Excel file, you can read this file in by using the PROC IMPORT procedure. This will be similar to using the R functions read_csv
and read_excel
.
Example
The following SAS program creates a temporary SAS data set called work.yts by reading in the CSV file Youth_Tobacco_Survey_YTS_Data.csv:
FILENAME yts_data
"/folders/myfolders/SAS_Notes/data/Youth_Tobacco_Survey_YTS_Data.csv";
PROC IMPORT datafile = yts_data out = yts DBMS = CSV replace;
getnames = yes;
guessingrows = max;
run;
579 ods listing close;ods html5 (id=saspy_internal) file=stdout options(bitmap_mode='inline') device=svg style=HTMLBlue; ods
579! graphics on / outputfmt=png;
NOTE: Writing HTML5(SASPY_INTERNAL) Body file: STDOUT
580
581 FILENAME yts_data
582 "/folders/myfolders/SAS_Notes/data/Youth_Tobacco_Survey_YTS_Data.csv";
583
584 PROC IMPORT datafile = yts_data out = yts DBMS = CSV replace;
585 getnames = yes;
586 guessingrows = max;
587 run;
588 /**********************************************************************
589 * PRODUCT: SAS
590 * VERSION: 9.4
591 * CREATOR: External File Interface
592 * DATE: 16SEP20
593 * DESC: Generated SAS Datastep Code
594 * TEMPLATE SOURCE: (None Specified.)
595 ***********************************************************************/
596 data WORK.YTS ;
597 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
598 infile YTS_DATA delimiter = ',' MISSOVER DSD firstobs=2 ;
599 informat YEAR best32. ;
600 informat LocationAbbr $2. ;
601 informat LocationDesc $24. ;
602 informat TopicType $25. ;
603 informat TopicDesc $29. ;
604 informat MeasureDesc $57. ;
605 informat DataSource $4. ;
606 informat Response $8. ;
607 informat Data_Value_Unit $1. ;
608 informat Data_Value_Type $10. ;
609 informat Data_Value best32. ;
610 informat Data_Value_Footnote_Symbol $1. ;
611 informat Data_Value_Footnote $71. ;
612 informat Data_Value_Std_Err best32. ;
613 informat Low_Confidence_Limit best32. ;
614 informat High_Confidence_Limit best32. ;
615 informat Sample_Size best32. ;
616 informat Gender $7. ;
617 informat Race $9. ;
618 informat Age $8. ;
619 informat Education $13. ;
620 informat GeoLocation $43. ;
621 informat TopicTypeId $3. ;
622 informat TopicId $6. ;
623 informat MeasureId $6. ;
624 informat StratificationID1 $4. ;
625 informat StratificationID2 $4. ;
626 informat StratificationID3 $4. ;
627 informat StratificationID4 $4. ;
628 informat SubMeasureID $5. ;
629 informat DisplayOrder best32. ;
630 format YEAR best12. ;
631 format LocationAbbr $2. ;
632 format LocationDesc $24. ;
633 format TopicType $25. ;
634 format TopicDesc $29. ;
635 format MeasureDesc $57. ;
636 format DataSource $4. ;
637 format Response $8. ;
638 format Data_Value_Unit $1. ;
639 format Data_Value_Type $10. ;
640 format Data_Value best12. ;
641 format Data_Value_Footnote_Symbol $1. ;
642 format Data_Value_Footnote $71. ;
643 format Data_Value_Std_Err best12. ;
644 format Low_Confidence_Limit best12. ;
645 format High_Confidence_Limit best12. ;
646 format Sample_Size best12. ;
647 format Gender $7. ;
648 format Race $9. ;
649 format Age $8. ;
650 format Education $13. ;
651 format GeoLocation $43. ;
652 format TopicTypeId $3. ;
653 format TopicId $6. ;
654 format MeasureId $6. ;
655 format StratificationID1 $4. ;
656 format StratificationID2 $4. ;
657 format StratificationID3 $4. ;
658 format StratificationID4 $4. ;
659 format SubMeasureID $5. ;
660 format DisplayOrder best12. ;
661 input
662 YEAR
663 LocationAbbr $
664 LocationDesc $
665 TopicType $
666 TopicDesc $
667 MeasureDesc $
668 DataSource $
669 Response $
670 Data_Value_Unit $
671 Data_Value_Type $
672 Data_Value
673 Data_Value_Footnote_Symbol $
674 Data_Value_Footnote $
675 Data_Value_Std_Err
676 Low_Confidence_Limit
677 High_Confidence_Limit
678 Sample_Size
679 Gender $
680 Race $
681 Age $
682 Education $
683 GeoLocation $
684 TopicTypeId $
685 TopicId $
686 MeasureId $
687 StratificationID1 $
688 StratificationID2 $
689 StratificationID3 $
690 StratificationID4 $
691 SubMeasureID $
692 DisplayOrder
693 ;
694 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
695 run;
NOTE: The infile YTS_DATA is:
Filename=/folders/myfolders/SAS_Notes/data/Youth_Tobacco_Survey_YTS_Data.csv,
Owner Name=root,Group Name=vboxsf,
Access Permission=-rwxrwx---,
Last Modified=30Jul2020:12:31:00,
File Size (bytes)=2615810
NOTE: 9794 records were read from the infile YTS_DATA.
The minimum record length was 228.
The maximum record length was 355.
NOTE: The data set WORK.YTS has 9794 observations and 31 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
9794 rows created in WORK.YTS from YTS_DATA.
NOTE: WORK.YTS data set was successfully created.
NOTE: The data set WORK.YTS has 9794 observations and 31 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 8.71 seconds
cpu time 7.34 seconds
696
697 ods html5 (id=saspy_internal) close;ods listing;
698
Note that when running PROC IMPORT SAS generates the DATA step code (shown in the log file output) to read in the dataset. Some of the key features of PROC IMPORT:
- datafille: path to the dataset file on your computer that you are trying to read in.
- out: name of the output SAS dataset that you are creating. Here we made a temporary SAS datset called yts that is stored in the Work library.
- DBMS: type of file you are reading in such as csv (CSV) or excel(XLS or XLSX).
- replace: use this option to overwite a SAS dataset of the same name if one already exists.
- getnames: YES if the first row of the dataset are column names. NO otherwise.
- guessingrows: this option specifies the number of rows of the datatset to scan for SAS to try and figure out the data type and formats of the variables in the datset. Can take values equal to an exact number of rows or MAX to scan the whole dataset.
4.9. Exercises¶
Use PROC IMPORT to read in the Youth Tobacco study, Youth_Tobacco_Survey_YTS_Data.csv and name it youth.
Download an Excel version of the Monuments dataset, Monuments.xlsx, from CANVAS. Use PROC IMPORT to read in the dataset and call the output mon.
Write a DATA step that will read in the following column formatted data using DATALINES and print the results using PROC PRINT. This dataset contains information on esophageal cancer in Ille-et-Vilaine, France. The variables are
Age group
Alcohol consumption
Tobacco consumption
Number of Cases
Number of Controls
Write a DATA step to read in the data and print the results. Note: the first line is to help you count the columns. Each number corresponds to a multiple of 5.
----1----2----3----4----5----6----7----8 75+ 0-39g/day 0-9g/day 1 18 75+ 0-39g/day 10-19 2 6 75+ 0-39g/day 30+ 1 3 75+ 40-79 0-9g/day 2 5 75+ 40-79 10-19 1 3 75+ 40-79 20-29 0 3 75+ 40-79 30+ 1 1 75+ 80-119 0-9g/day 1 1 75+ 80-119 10-19 1 1 75+ 120+ 0-9g/day 2 2 75+ 120+ 10-19 1 1