{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Character Functions\n", "\n", "In this lesson, we'll investigate some of the functions available in SAS that can be applied only to character variables. For example, if you want to remove blanks from a character string, you might consider using the **compress** function. Or, if you want to select a smaller substring, say a first name, from a larger string containing one's full name, you might want to take advantage of the **substr** function. Some of the functions that we will learn about are old standbys, such as: **length**, **substr**, **compbl**, **compress**, **verify**, **input**, **put**, **tranwrd**, **scan**, **trim**, **upcase**, **lowcase**, **| |** (concatenation), **index**, **indexc**, and **spedis**. And, some of the functions that we will learn about are new just to SAS Version 9. They include: **anyalpha**, **anydigit**, **catx**, **cats**, **lengthc**, **propcase**, **strip**, **count**, and **countc**.\n", "\n", "Let's also take this opportunity to introduce you to a couple of great resources for finding information about a variety of SAS topics. One resource is sasCommunity.org. You can learn more about the sasCommunity.org just by mucking around for a bit on the site. Another resource is the SAS Global Forum Conference which is held each year. However, the really helpful place to end up is a resource that allows you to search for previous papers and presentations from the SAS Global Forum (annual conferences for SAS Users). It is using this search engine where you'll find the material this lesson is based on!\n", "\n", "For this lesson, we will work through Ron Cody's paper An Introduction to SAS Character Function.\n", "\n", "## Lengths of Character Variables\n", "\n", "Before we actually discuss these functions, we need to understand how SAS software assigns storage lengths to character variables. It is important to remember two things: 1) The storage length of a character variable is set at compile time. and 2) this length is determined by the first appearance of a character variable in a DATA step. There are several ways to check the storage length of character variables in your SAS data set. One way is to run PROC CONTENTS. If you are using SAS 9 and above, the new function LENGTHC can be used to determine the storage length of a character variable. \n", "\n", "
The following SAS program illustrates properties of the length of a character varible in SAS.
\n", "\n", "storage_length=3 \n", "display=:abc: \n", "\n", "
Let's examine the program. We are creating a temporary dataset called chars1. The FILE statement determines where output from PUT statement is sent to. In this case, the use of the PRINT option sends output from PUT to the ODS output destination used for all other output such as plots and other output from PROC steps. We then create a character variable called string that contains the string 'abc'. Since we create the variable first, the next line where we set the length of string to be 7 characters is ignored. We can see that this happens by using the lengthc function to save the length of the string variable to the numeric variable storage_length which is then output by using the PUT statments. Note that both PUT statements use the format
\n", "PUT <varname>=;
\n",
" which results in both printing out the variables name along with it's stored value as we can see in the output shown above. Note that we have also created another string called display that contains the character value stored in string with semicolons appended on either end by using the string concatenation operators ||.
\n", "In order to correctly set the length of string to the desired length of 7 characters, we must run the LENGTH statement first before assigning a value to string as in the following program.
\n", "\n", "storage_length=7 \n", "display=:abc : \n", "\n", "
Now that we have switched the order of the LENGTH statement and creating the variable string, the length of string is now set to 7 instead of 3 and whitespace has been added to pad the extra character values to make it length 7 as shown in the display variable.
\n", "This example will demonstrate how to convert multiple blanks to a single blank. Suppose you have some names and addresses in a file. Some of the data entry clerks placed extra spaces between the first and last names and in the address fields. You would like to store all names and addresses with single blanks. Here is an example of how this is done:
\n", "Obs | \n", "name | \n", "address | \n", "city | \n", "state | \n", "zip | \n", "
---|---|---|---|---|---|
1 | \n", "Ron Cody | \n", "89 Lazy Brook Road | \n", "Flemington | \n", "NJ | \n", "0882 | \n", "
2 | \n", "Bill Brown | \n", "28 Cathy Street | \n", "North City | \n", "NY | \n", "1151 | \n", "
The compbl function reduces anywhere it finds multiple successive blanks into a single blank. Note that it removed the extra white space between 28 and Cathy and Street in the second address.
\n", "The program below uses the COMPRESS function twice. The first time, to remove blanks from the string, and the second to remove blanks plus the other above mentioned characters. Here is the code:
\n", "phone | \n", "phone1 | \n", "phone2 | \n", "
---|---|---|
(908)235-4490 | \n", "(908)235-4490 | \n", "9082354490 | \n", "
(201) 555-77 99 | \n", "(201)555-7799 | \n", "2015557799 | \n", "
The variable PHONE1 has just blanks removed. Notice that the COMPRESS function does not have a second argument here. When it is omitted, the COMPRESS function removes only blanks. For the variable PHONE2, the second argument of the COMPRESS function contains a list of the characters to remove: left parenthesis, blank, right parenthesis, and dash. This string is placed in single or double quotes. Remember, when you specify a list of characters to remove, blanks are no longer included unless you explicitly include a blank in the list.
\n", "In the example below, only the values 'A', 'B', 'C', 'D', and 'E' are valid data values. A very easy way to test if there are any invalid characters present is shown next:
\n", "id | \n", "answer | \n", "position | \n", "
---|---|---|
001 | \n", "acbed | \n", "0 | \n", "
002 | \n", "abxde | \n", "3 | \n", "
003 | \n", "12cce | \n", "1 | \n", "
004 | \n", "abc e | \n", "4 | \n", "
The workhorse of this example is the VERIFY function. It is a bit complicated. It inspects every character in the first argument and, if it finds any value not in the verify string (the second argument), it will return the position of the first offending value. If all the values of the string are located in the verify string, a value of 0 is returned. To help clarify this, look at the output shown above.
\n", "One thing to be careful of when using the VERIFY function (and many of the other character functions) is trailing blanks. For example, look at the following:
\n", "string | \n", "pos1 | \n", "pos2 | \n", "
---|---|---|
abc | \n", "4 | \n", "0 | \n", "
The LENGTH statement forces SAS to add 7 trailing blanks to pad string to be of length 10, so the value of POS1 is 4, the position of the first trailing blank. One way to avoid the trailing blank problem is to remove the trailing blanks before using the verify function. The TRIM function does this for us. We use TRIM first for POS2, so now with the trailing blanks removed there are no invalid characters and VERIFY returns 0. Do note that TRIM only removes leading and trailing blanks, so this would not work if the blanks were in the middle of the string. In this case, you could add a blank to the verify string or use compress to remove all the blanks.
\n", "In this example, you have ID codes which contain in the first two positions, a state abbreviation. Furthermore, positions 7-9 contain a numeric code. You want to create two new variables; one containing the two digit state codes and the other, a numeric variable constructed from the three numerals in positions 7,8, and 9. Here goes:
\n", "id | \n", "state | \n", "num | \n", "
---|---|---|
NYXXXX123 | \n", "NY | \n", "123 | \n", "
NJ1234567 | \n", "NJ | \n", "567 | \n", "
Creating the state code is easy. We use the SUBSTR function. The first argument is the variable from which we want to extract the substring, the second argument is the starting position of the substring, and the last argument is the length of the substring (not the ending position as you might guess). Also note the use of the LENGTH statement to set the length of STATE to 2 bytes. Without a LENGTH statement, the length of STATE would be the same as the length of ID. Why? Remember that character variable lengths are set at compile time. The starting position and length parameters are constants here, but they could have been computed or read in from an external file. So, without a LENGTH statement, what is SAS to do? What is the longest substring you can extract from a string of length n? The answer is n and that is what SAS uses as the default length of the result.
\n", "Extracting the three digit number code is more complicated. First we use the SUBSTR function to pull out the three numerals (numerals are character representations of numbers). However, the result of a SUBSTR function is always a character value. To convert the character value to a number, we use the INPUT function. The INPUT function takes the first argument and \"reads\" it as if it were coming from a file, according to the informat listed as the second argument. So, for the first observation, the SUBSTR function would return the string '123' and the INPUT function would convert this to the number 123. As a point of interest, you may use a longer informat as the second argument without any problems. For example, the INPUT statement could have been written as:
\n", "input(substr(id,7,3),8.);
\n",
" and everything would have worked out fine. This fact is useful in situations where you do not know the length of the string ahead of time.
\n", "Suppose you have some systolic blood pressures (SBP) and diastolic blood pressures (DBP) in a SAS data set. You want to print out these values and star high values with an asterisk. Here is a program that uses the SUBSTR function on the left of the equals sign to do that:
\n", "sbp | \n", "dbp | \n", "sbp_chk | \n", "dbp_chk | \n", "
---|---|---|---|
120 | \n", "80 | \n", "120 | \n", "80 | \n", "
180 | \n", "92 | \n", "180* | \n", "92* | \n", "
200 | \n", "110 | \n", "200* | \n", "110* | \n", "
We first need to set the lengths of SBP_CHK and DBP_CHK to 4 (three spaces for the value plus one for the possible asterisk). Next, we use a PUT function to perform a numeric to character conversion. The PUT function is, in some ways, similar to the INPUT function. It \"writes out\" the value of the first argument, according to the FORMAT specified in the second argument. By \"write out\" we actually mean assign the value to the variable on the left of the equal sign. The SUBSTR function then places an asterisk in the fourth position when a value of SBP is greater than 160 or a value of DBP is greater than 90, as you can see in the output above.
\n", "In the example to follow, five separate character values were placed together on a line with either a space, a comma, a semi-colon, a period, or an explanation mark between them. You would like to extract the five values and assign them to five character variables. Without the SCAN function this would be hard; with it, it's easy:
\n", "piece1 | \n", "piece2 | \n", "piece3 | \n", "piece4 | \n", "piece5 | \n", "
---|---|---|---|---|
this | \n", "line | \n", "contains | \n", "five | \n", "words | \n", "
abcdefghij | \n", "xxx | \n", "yyy | \n", "\n", " | \n", " |
The function:
\n", "SCAN(char_var,n,'list-of-delimiters');
\n",
" returns the nth \"word\" from the char_var, where a \"word\" is defined as anything between two delimiters. If there are fewer than n words in the character variable, the SCAN function will return a blank. If n is negative, the scan will proceed from right to left. If n is greater than the number of words in the string, a missing value is returned.
\n", "By placing the SCAN function in a DO loop, we can pick out the nth word in the string. Even though we have not discussed loops and array in SAS yet, hopefully the general syntax makes sense from using loops and arrays in other languages such as R.
\n", "Here is an interesting example that uses the SCAN function to extract the last name from a character variable that contains first and last name as well as a possible middle name or initial. In this example, you want to create a list in alphabetical order by last name. First the program, then the explanation:
\n", "Name | \n", "Phone Number | \n", "
---|---|
Raymond Albert | \n", "(732)235-44 | \n", "
Steven J. Foster | \n", "(201)567-98 | \n", "
Alred Edward Newman | \n", "(800)123-43 | \n", "
Jose Romerez | \n", "(516)593-23 | \n", "
Jeff W. Snoker | \n", "(908)782-43 | \n", "
It is easy to extract the last name by using a –1 as the second argument of the SCAN function. Remember, a negative value for this arguments results in a scan from right to left. Output from the REPORT procedure is shown above.
\n", "The REPORT procedure simply creates a table. The COLUMNS statement defines what variables will be used to make the table, and the DEFINE statements describes how they will be used. For example, the order and noprint options are used for last_name so that we use the last name to order the rows but do not print the column to the output table.
\n", "Here is a short program which demonstrate these two functions: If the search fails, both functions return a zero.
\n", "string | \n", "first | \n", "first_c | \n", "
---|---|---|
abcxyz1234 | \n", "4 | \n", "4 | \n", "
1234567890 | \n", "0 | \n", "0 | \n", "
abcx1y2z39 | \n", "0 | \n", "4 | \n", "
abczzzxyz3 | \n", "7 | \n", "4 | \n", "
The following example program uses the UPCASE function to convert all character variables to upper case.
\n", "a | \n", "b | \n", "c | \n", "d | \n", "e | \n", "x | \n", "y | \n", "
---|---|---|---|---|---|---|
M | \n", "F | \n", "P | \n", "P | \n", "D | \n", "1 | \n", "2 | \n", "
M | \n", "F | \n", "M | \n", "F | \n", "M | \n", "3 | \n", "4 | \n", "
This program uses the _CHARACTER_ keyword to select all the character variables instead of individually listing them out when creating the array. The result of running this program is to convert all values for the variables A,B,C, D, and E to upper case. The LOWCASE function could be used in place of the UPCASE function if you wanted all your character values in lower case.
\n", "Again, do not worry if you don't fully understand the array and do loops yet. We will cover these in more detail in an unpcoming lesson.
\n", "The following example program uses the PROPCASE function to text stored in a character variable to proper case, i.e. the frist letter is capitalized followed by all lower case letters within a word.
\n", "Name | \n", "propname | \n", "
---|---|
rOn coDY | \n", "Ron Cody | \n", "
the tall and the short | \n", "The Tall And The Short | \n", "
the "%$#@!" escape | \n", "The "%$#@!" Escape | \n", "
In the following example, we want to standardize addresses by converting the words 'Street', 'Avenue', and 'Road' to the abbreviations 'St.', 'Ave.', and 'Rd.' respectively. Look at the following program:
\n", "Obs | \n", "address | \n", "
---|---|
1 | \n", "89 Lazy Brook Rd. | \n", "
2 | \n", "123 River Rd. | \n", "
3 | \n", "12 Main St. | \n", "
\n", "string1=* x * \n", "string2=* x * \n", "string3=*x* \n", "\n", "
The following program demonstrates some example distances between strings.
\n", "string1 | \n", "string2 | \n", "points | \n", "
---|---|---|
same | \n", "same | \n", "0 | \n", "
same | \n", "sam | \n", "8 | \n", "
firstletter | \n", "xirstletter | \n", "18 | \n", "
lastletter | \n", "lastlettex | \n", "10 | \n", "
receipt | \n", "reciept | \n", "7 | \n", "
The following program demonstrates the ANYALPHA and ANYDIGIT functions.
\n", "string | \n", "first_alpha | \n", "first_digit | \n", "
---|---|---|
no digits here | \n", "1 | \n", "0 | \n", "
the 3 and 4 | \n", "1 | \n", "5 | \n", "
123 456 789 | \n", "0 | \n", "1 | \n", "
The following program demonstrates the NOTALPHA and NOTDIGIT functions.
\n", "string | \n", "only_alpha | \n", "only_digit | \n", "
---|---|---|
abcdefg | \n", "0 | \n", "1 | \n", "
1234567 | \n", "1 | \n", "0 | \n", "
abc123 | \n", "4 | \n", "1 | \n", "
1234abcd | \n", "1 | \n", "5 | \n", "
Notice the use of the TRIM function in the program above. Without it, both NOTALPHA and NOTDIGIT would return the position of the first blank. You can see why we called the data set DATA_CLEANING. It is an easy way to determine if a string contains only a single class of characters.
\n", "The following program demonstrates the differences between the different concatenation functions.
\n", "\n", "string1=ABC \n", "string2=XYZ \n", "string3=12345 \n", "cat_op=ABC XYZ \n", "cat=ABC XYZ \n", "cats=ABCXYZ \n", "catx=ABC***XYZ***12345 \n", "\n", "
string1 | \n", "string2 | \n", "string3 | \n", "cat_op | \n", "cat | \n", "cats | \n", "catx | \n", "
---|---|---|---|---|---|---|
ABC | \n", "XYZ | \n", "12345 | \n", "ABC XYZ | \n", "ABC XYZ | \n", "ABCXYZ | \n", "ABC***XYZ***12345 | \n", "
Without the LENGTH statement in this program, the length of the two variables CATS and CATX would be 200 (the default length for the CAT functions). Note that the default length when using the concatenation operator is the sum of the lengths of the arguments to be joined.
\n", "The following program demonstrates the LENGTH, LENGTHN, and LENGTHC functions.
\n", "one | \n", "two | \n", "three | \n", "length_one | \n", "lengthn_one | \n", "lengthc_one | \n", "length_two | \n", "lengthn_two | \n", "lengthc_two | \n", "length_three | \n", "lengthn_three | \n", "lengthc_three | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
ABC | \n", "\n", " | ABC XYZ | \n", "3 | \n", "3 | \n", "6 | \n", "1 | \n", "0 | \n", "1 | \n", "9 | \n", "9 | \n", "9 | \n", "
Obs | \n", "string1 | \n", "string2 | \n", "compare1 | \n", "compare2 | \n", "compare3 | \n", "compare4 | \n", "
---|---|---|---|---|---|---|
1 | \n", "Robert Parker | \n", "ROBERT PARKER | \n", "-1 | \n", "-1 | \n", "2 | \n", "0 | \n", "
\n", "*ROBERT* \n", "* ROBERT* \n", "*ROBERT * \n", "\n", "
The following program illustrates the COUNT and COUNTC functions
\n", "string | \n", "count_a_or_b | \n", "countc_a_or_b | \n", "count_abc | \n", "countc_abc | \n", "case_a | \n", "
---|---|---|---|---|---|
xxabcxabcxxbbbb | \n", "2 | \n", "8 | \n", "2 | \n", "10 | \n", "2 | \n", "
cbacba | \n", "0 | \n", "4 | \n", "0 | \n", "6 | \n", "2 | \n", "
aaAA | \n", "0 | \n", "2 | \n", "0 | \n", "2 | \n", "4 | \n", "
Obs | \n", "name | \n", "
---|---|
1 | \n", "Aaron,Patricia G | \n", "
2 | \n", "Aaron,Petra L | \n", "
3 | \n", "Abaineh,Yohannes T | \n", "
4 | \n", "Abbene,Anthony M | \n", "
5 | \n", "Abbey,Emmanuel | \n", "
The following SAS program shows how to use regular expressions to find all names that include 'Rawlings'.
\n", "Obs | \n", "name | \n", "
---|---|
1 | \n", "Rawlings,Kellye A | \n", "
2 | \n", "Rawlings,Paula M | \n", "
3 | \n", "Rawlings-Blake,Stephani | \n", "
The variable re stores the regular expression defined by PRXPARSE. Note the the regular expression must occur between two forward slashes '/regex/'. The i in the regular expression definition '/Rawlings/i' tells SAS to ignore case when doing the search, so both 'Rawlings' and 'rawlings' will match. We then pass the regular expression to PRXMATCH to search the name variable. If there is a match somewhere it returns a non-zero index. Note that we have an if statement to catch mistakes in our regular expression and stop the program if we have made a mistake.
\n", "PRXSUBSTR works similarly but has a two variables that you pass in called `pos` and `len` in which to store the starting position and length of the string match. This would be needed for a followup call to SUBSTR in order to find and possible replace with substring.
\n", "The following SAS program used PRXCHANGE to replace \"a\" with \"j\" in the names of the salary dataset.
\n", "Obs | \n", "name | \n", "name2 | \n", "
---|---|---|
1 | \n", "Aaron,Patricia G | \n", "Ajron,Patricia G | \n", "
2 | \n", "Aaron,Petra L | \n", "Ajron,Petra L | \n", "
3 | \n", "Abaineh,Yohannes T | \n", "Abjineh,Yohannes T | \n", "
4 | \n", "Abbene,Anthony M | \n", "Abbene,Anthony M | \n", "
5 | \n", "Abbey,Emmanuel | \n", "Abbey,Emmjnuel | \n", "
Obs | \n", "name | \n", "name2 | \n", "
---|---|---|
1 | \n", "Aaron,Patricia G | \n", "Ajron,Pjtricij G | \n", "
2 | \n", "Aaron,Petra L | \n", "Ajron,Petrj L | \n", "
3 | \n", "Abaineh,Yohannes T | \n", "Abjineh,Yohjnnes T | \n", "
4 | \n", "Abbene,Anthony M | \n", "Abbene,Anthony M | \n", "
5 | \n", "Abbey,Emmanuel | \n", "Abbey,Emmjnuel | \n", "
Notice that the first call used 1 for the number of substitutions and so it only replaces the first found match of \"a\" with \"j\"; Where as, the second version uses -1 and so replaces all occurences of \"a\" with \"j\".
\n", "\n",
"re = prxparse('/.*(ment|ice)/');\n",
"call prxsubstr(re, Agency, pos, len);\n",
"dept = substr(Agency, pos, len);\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "SAS",
"language": "sas",
"name": "sas"
},
"language_info": {
"codemirror_mode": "sas",
"file_extension": ".sas",
"mimetype": "text/x-sas",
"name": "sas"
}
},
"nbformat": 4,
"nbformat_minor": 2
}