9. Character Functions

In this lesson, we’ll investigate some of the functions available in SAS that can be applied only to character variables. For example, if you want to remove blanks from a character string, you might consider using the compress function. Or, if you want to select a smaller substring, say a first name, from a larger string containing one’s full name, you might want to take advantage of the substr function. Some of the functions that we will learn about are old standbys, such as: length, substr, compbl, compress, verify, input, put, tranwrd, scan, trim, upcase, lowcase, | | (concatenation), index, indexc, and spedis. And, some of the functions that we will learn about are new just to SAS Version 9. They include: anyalpha, anydigit, catx, cats, lengthc, propcase, strip, count, and countc.

Let’s also take this opportunity to introduce you to a couple of great resources for finding information about a variety of SAS topics. One resource is sasCommunity.org. You can learn more about the sasCommunity.org just by mucking around for a bit on the site. Another resource is the SAS Global Forum Conference which is held each year. However, the really helpful place to end up is a resource that allows you to search for previous papers and presentations from the SAS Global Forum (annual conferences for SAS Users). It is using this search engine where you’ll find the material this lesson is based on!

For this lesson, we will work through Ron Cody’s paper An Introduction to SAS Character Function.

9.1. Lengths of Character Variables

Before we actually discuss these functions, we need to understand how SAS software assigns storage lengths to character variables. It is important to remember two things: 1) The storage length of a character variable is set at compile time. and 2) this length is determined by the first appearance of a character variable in a DATA step. There are several ways to check the storage length of character variables in your SAS data set. One way is to run PROC CONTENTS. If you are using SAS 9 and above, the new function LENGTHC can be used to determine the storage length of a character variable.

Example

The following SAS program illustrates properties of the length of a character varible in SAS.

DATA chars1;
    FILE print;
    string = 'abc';
    length string $ 7;  /* Does this do anything */
    storage_length = lengthc(string);
    display = ":" || string || ":";
    put storage_length= ;
    put display= ;
RUN;
SAS Connection established. Subprocess id is 8422
SAS Output

The SAS System

storage_length=3                                                                                                                    
display=:abc:                                                                                                                       

Let's examine the program. We are creating a temporary dataset called chars1. The FILE statement determines where output from PUT statement is sent to. In this case, the use of the PRINT option sends output from PUT to the ODS output destination used for all other output such as plots and other output from PROC steps. We then create a character variable called string that contains the string 'abc'. Since we create the variable first, the next line where we set the length of string to be 7 characters is ignored. We can see that this happens by using the lengthc function to save the length of the string variable to the numeric variable storage_length which is then output by using the PUT statments. Note that both PUT statements use the format

PUT <varname>=;

which results in both printing out the variables name along with it's stored value as we can see in the output shown above. Note that we have also created another string called display that contains the character value stored in string with semicolons appended on either end by using the string concatenation operators ||.

In order to correctly set the length of string to the desired length of 7 characters, we must run the LENGTH statement first before assigning a value to string as in the following program.

DATA chars2;
    FILE print;
    length string $ 7;  /* Does this do anything */
    string = 'abc';
    storage_length = lengthc(string);
    display = ":" || string || ":";
    put storage_length= ;
    put display= ;
RUN;
SAS Output

The SAS System

storage_length=7                                                                                                                    
display=:abc    :                                                                                                                   

Now that we have switched the order of the LENGTH statement and creating the variable string, the length of string is now set to 7 instead of 3 and whitespace has been added to pad the extra character values to make it length 7 as shown in the display variable.

9.2. Removing Characters from a String

In this section we will examine using the different COMPRESS functions to remove character values from SAS character strings.

Example

This example will demonstrate how to convert multiple blanks to a single blank. Suppose you have some names and addresses in a file. Some of the data entry clerks placed extra spaces between the first and last names and in the address fields. You would like to store all names and addresses with single blanks. Here is an example of how this is done:

DATA multiple;
    INPUT #1 @1  name    $20.
          #2 @1  address $30.
          #3 @1  city    $15.
             @20 state    $2.
             @25 zip      $5.;
    name = compbl(name);
    address = compbl(address);
    city = compbl(city);
DATALINES;
Ron Cody
89 Lazy Brook Road
Flemington         NJ    08822
Bill     Brown
28   Cathy   Street
North   City       NY    11518
;
RUN;

PROC PRINT data = multiple;
RUN;
SAS Output

The SAS System

Obs name address city state zip
1 Ron Cody 89 Lazy Brook Road Flemington NJ 0882
2 Bill Brown 28 Cathy Street North City NY 1151

The compbl function reduces anywhere it finds multiple successive blanks into a single blank. Note that it removed the extra white space between 28 and Cathy and Street in the second address.

A more general problem is to remove selected characters from a string. For example, suppose you want to remove blanks, parentheses, and dashes from a phone number that has been stored as a character value. Here comes the COMPRESS function to the rescue! The COMPRESS function can remove any number of specified characters from a character variable.

Example

The program below uses the COMPRESS function twice. The first time, to remove blanks from the string, and the second to remove blanks plus the other above mentioned characters. Here is the code:

DATA phone;
    INPUT phone $ 1-15;
    phone1 = compress(phone);
    phone2 = compress(phone,'(-) ');
DATALINES;
(908)235-4490
(201) 555-77 99
;
RUN;

PROC PRINT data = phone NOOBS;
    TITLE 'The phone data set';
RUN;
SAS Output

The phone data set

phone phone1 phone2
(908)235-4490 (908)235-4490 9082354490
(201) 555-77 99 (201)555-7799 2015557799

The variable PHONE1 has just blanks removed. Notice that the COMPRESS function does not have a second argument here. When it is omitted, the COMPRESS function removes only blanks. For the variable PHONE2, the second argument of the COMPRESS function contains a list of the characters to remove: left parenthesis, blank, right parenthesis, and dash. This string is placed in single or double quotes. Remember, when you specify a list of characters to remove, blanks are no longer included unless you explicitly include a blank in the list.

9.3. Character Data Verification

A common task in data processing is to validate data. For example, you may want to be sure that only certain values are present in a character variable.

Example

In the example below, only the values 'A', 'B', 'C', 'D', and 'E' are valid data values. A very easy way to test if there are any invalid characters present is shown next:

DATA verify;
    INPUT @1  id $3.
          @5  answer $5.;
    position = verify(answer,'abcde');
DATALINES;
001 acbed
002 abxde
003 12cce
004 abc e
;
RUN;

PROC PRINT data = verify NOOBS;
    TITLE 'The verify data set';
RUN;
SAS Output

The verify data set

id answer position
001 acbed 0
002 abxde 3
003 12cce 1
004 abc e 4

The workhorse of this example is the VERIFY function. It is a bit complicated. It inspects every character in the first argument and, if it finds any value not in the verify string (the second argument), it will return the position of the first offending value. If all the values of the string are located in the verify string, a value of 0 is returned. To help clarify this, look at the output shown above.

One thing to be careful of when using the VERIFY function (and many of the other character functions) is trailing blanks. For example, look at the following:

DATA trailing;
    length string $ 10;
    string = 'abc';
    pos1 = verify(string,'abcde');
    pos2 = verify(trim(string),'abcde');
RUN;

PROC PRINT data = trailing NOOBS;
    TITLE 'The trailing data set';
RUN;
SAS Output

The trailing data set

string pos1 pos2
abc 4 0

The LENGTH statement forces SAS to add 7 trailing blanks to pad string to be of length 10, so the value of POS1 is 4, the position of the first trailing blank. One way to avoid the trailing blank problem is to remove the trailing blanks before using the verify function. The TRIM function does this for us. We use TRIM first for POS2, so now with the trailing blanks removed there are no invalid characters and VERIFY returns 0. Do note that TRIM only removes leading and trailing blanks, so this would not work if the blanks were in the middle of the string. In this case, you could add a blank to the verify string or use compress to remove all the blanks.

9.4. Extracting Substrings

A substring is a part a longer string (although it can actually be the same length but this would not be too useful). To extract a substring in SAS we can use the SUBSTR function.

Example

In this example, you have ID codes which contain in the first two positions, a state abbreviation. Furthermore, positions 7-9 contain a numeric code. You want to create two new variables; one containing the two digit state codes and the other, a numeric variable constructed from the three numerals in positions 7,8, and 9. Here goes:

DATA pieces_parts;
    INPUT id $ 1-9;
    length state $ 2;
    state = substr(id,1,2);
    num = input(substr(id,7,3),3.);
DATALINES;
NYXXXX123
NJ1234567
;
RUN;

PROC PRINT data = pieces_parts NOOBS;
    TITLE 'The pieces_parts data set';
RUN;
SAS Output

The pieces_parts data set

id state num
NYXXXX123 NY 123
NJ1234567 NJ 567

Creating the state code is easy. We use the SUBSTR function. The first argument is the variable from which we want to extract the substring, the second argument is the starting position of the substring, and the last argument is the length of the substring (not the ending position as you might guess). Also note the use of the LENGTH statement to set the length of STATE to 2 bytes. Without a LENGTH statement, the length of STATE would be the same as the length of ID. Why? Remember that character variable lengths are set at compile time. The starting position and length parameters are constants here, but they could have been computed or read in from an external file. So, without a LENGTH statement, what is SAS to do? What is the longest substring you can extract from a string of length n? The answer is n and that is what SAS uses as the default length of the result.

Extracting the three digit number code is more complicated. First we use the SUBSTR function to pull out the three numerals (numerals are character representations of numbers). However, the result of a SUBSTR function is always a character value. To convert the character value to a number, we use the INPUT function. The INPUT function takes the first argument and "reads" it as if it were coming from a file, according to the informat listed as the second argument. So, for the first observation, the SUBSTR function would return the string '123' and the INPUT function would convert this to the number 123. As a point of interest, you may use a longer informat as the second argument without any problems. For example, the INPUT statement could have been written as:

input(substr(id,7,3),8.);

and everything would have worked out fine. This fact is useful in situations where you do not know the length of the string ahead of time.

There is a particularly useful and somewhat obscure use of the SUBSTR function that we would like to discuss next. You can use this function to place characters in specific locations within a string by placing the SUBSTR function on the left hand side of the equal sign (in the older manuals I think this was called a SUBSTR pesudo function).

Example

Suppose you have some systolic blood pressures (SBP) and diastolic blood pressures (DBP) in a SAS data set. You want to print out these values and star high values with an asterisk. Here is a program that uses the SUBSTR function on the left of the equals sign to do that:

DATA pressure;
     INPUT sbp dbp @@;
     length sbp_chk dbp_chk $ 4;
     sbp_chk = put(sbp,3.);
     dbp_chk = put(dbp,3.);
     if sbp gt 160 then substr(sbp_chk,4,1) = '*';
     if dbp gt  90 then substr(dbp_chk,4,1) = '*';
DATALINES;
120 80 180 92 200 110
;
RUN;

PROC PRINT data = pressure NOOBS;
    TITLE 'The pressure data set';
RUN;
SAS Output

The pressure data set

sbp dbp sbp_chk dbp_chk
120 80 120 80
180 92 180* 92*
200 110 200* 110*

We first need to set the lengths of SBP_CHK and DBP_CHK to 4 (three spaces for the value plus one for the possible asterisk). Next, we use a PUT function to perform a numeric to character conversion. The PUT function is, in some ways, similar to the INPUT function. It "writes out" the value of the first argument, according to the FORMAT specified in the second argument. By "write out" we actually mean assign the value to the variable on the left of the equal sign. The SUBSTR function then places an asterisk in the fourth position when a value of SBP is greater than 160 or a value of DBP is greater than 90, as you can see in the output above.

9.5. Parsing a String with SCAN

Parsing a string means to take it apart based on some rules. We can parse a string by using the SCAN function if we know what character values are used to separate different parts of the string.

Example

In the example to follow, five separate character values were placed together on a line with either a space, a comma, a semi-colon, a period, or an explanation mark between them. You would like to extract the five values and assign them to five character variables. Without the SCAN function this would be hard; with it, it's easy:

DATA parse;
     INPUT long_str $ 1-80;
     array pieces[5] $ 10 piece1-piece5;
     do i = 1 to 5;
         pieces[i] = scan(long_str,i,',.! ');
     end;
     drop long_str i;
DATALINES;
this line,contains!five.words
abcdefghijkl xxx yyy
;
RUN;

PROC PRINT data = parse NOOBS;
    TITLE 'The parse data set';
RUN;
SAS Output

The parse data set

piece1 piece2 piece3 piece4 piece5
this line contains five words
abcdefghij xxx yyy    

The function:

SCAN(char_var,n,'list-of-delimiters');

returns the nth "word" from the char_var, where a "word" is defined as anything between two delimiters. If there are fewer than n words in the character variable, the SCAN function will return a blank. If n is negative, the scan will proceed from right to left. If n is greater than the number of words in the string, a missing value is returned.

By placing the SCAN function in a DO loop, we can pick out the nth word in the string. Even though we have not discussed loops and array in SAS yet, hopefully the general syntax makes sense from using loops and arrays in other languages such as R.

Example

Here is an interesting example that uses the SCAN function to extract the last name from a character variable that contains first and last name as well as a possible middle name or initial. In this example, you want to create a list in alphabetical order by last name. First the program, then the explanation:

DATA first_last;
     INPUT @1  name  $20.
           @21 phone $13.;
     *** The next statement extracts the last name from name;
     last_name = scan(name,-1,' ');    /* scans from the right */
DATALINES;
Jeff W. Snoker        (908)782-4382
Raymond Albert        (732)235-4444
Alred Edward Newman   (800)123-4321
Steven J. Foster      (201)567-9876
Jose Romerez          (516)593-2377
;
RUN;

PROC REPORT data = first_last NOWINDOWS;
    TITLE 'Names and Phone Numbers in Alphabetical Order (by Last Name)';
    COLUMNS name phone last_name;
    DEFINE last_name / order noprint width = 20;
    DEFINE name      / display 'Name' left width = 20;
    DEFINE phone     / display 'Phone Number' width = 13 format=$13.;
RUN;
SAS Output

Names and Phone Numbers in Alphabetical Order (by Last Name)

Name Phone Number
Raymond Albert (732)235-44
Steven J. Foster (201)567-98
Alred Edward Newman (800)123-43
Jose Romerez (516)593-23
Jeff W. Snoker (908)782-43

It is easy to extract the last name by using a –1 as the second argument of the SCAN function. Remember, a negative value for this arguments results in a scan from right to left. Output from the REPORT procedure is shown above.

The REPORT procedure simply creates a table. The COLUMNS statement defines what variables will be used to make the table, and the DEFINE statements describes how they will be used. For example, the order and noprint options are used for last_name so that we use the last name to order the rows but do not print the column to the output table.

9.6. Locating the Position of One String Within Another String

Two somewhat similar functions, INDEX and INDEXC can be used to locate a string, or one of several strings within a longer string. For example, if you have a string ‘ABCDEFG’ and want the location of the letters DEF (starting position 4), the following INDEX function could be used:

INDEX('ABCDEFG','DEF');

The first argument is the argument you want to search, the second argument is the string you are searching for. This would return a value of 4, the starting position of the string ‘DEF’. If you want to know the starting position of any one of several characters, the INDEXC function can be used. As an example, if you wanted the starting position of any of the letters ‘G’, ‘C’, ‘B’, or ‘F’ in the string ‘ABCDEFG’, you would code:

INDEXC('ABCDEFG','GCBF');

or

INDEXC('ABCDEFG','G','C','B','F');

The function would return a value of 2, the position of the ‘B’, the first letter found in the first argument.

Example

Here is a short program which demonstrate these two functions: If the search fails, both functions return a zero.

DATA locate;
    INPUT string $ 1-10;
    first   = index(string,'xyz');
    first_c = indexc(string,'x','y','z');
DATALINES;
abcxyz1234
1234567890
abcx1y2z39
abczzzxyz3
;
RUN;

PROC PRINT data = locate NOOBS;
    TITLE 'The locate data set';
RUN;
SAS Output

The locate data set

string first first_c
abcxyz1234 4 4
1234567890 0 0
abcx1y2z39 0 4
abczzzxyz3 7 4

9.7. Converting Between Lower and Upper Case and Proper Case

The two companion functions UPCASE and LOWCASE do just what you would expect. These two functions are especially useful when data entry clerks are careless and a mixture of upper and lower cases values are entered for the same variable. You may want to place all of your character variables in an array and UPCASE (or LOWCASE) them all.

Example

The following example program uses the UPCASE function to convert all character variables to upper case.

DATA up_down;
     length a b c d e $ 1;
     INPUT a b c d e x y;
DATALINES;
M f P p D 1 2
m f m F M 3 4
;
RUN;

DATA upper;
    SET up_down;
    array all_c[*] _character_;
    DO i = 1 to dim(all_c);
        all_c[i] = upcase(all_c[i]);
    END;
    DROP i;
RUN;

PROC PRINT data = upper NOOBS;
    TITLE 'The upper data set';
RUN;
SAS Output

The upper data set

a b c d e x y
M F P P D 1 2
M F M F M 3 4

This program uses the _CHARACTER_ keyword to select all the character variables instead of individually listing them out when creating the array. The result of running this program is to convert all values for the variables A,B,C, D, and E to upper case. The LOWCASE function could be used in place of the UPCASE function if you wanted all your character values in lower case.

Again, do not worry if you don't fully understand the array and do loops yet. We will cover these in more detail in an unpcoming lesson.

A handy new V9 function is PROPCASE. This function capitalizes the first letter of each word.

Example

The following example program uses the PROPCASE function to text stored in a character variable to proper case, i.e. the frist letter is capitalized followed by all lower case letters within a word.

DATA proper;
     INPUT Name $40.;
     propname = propcase(Name);
DATALINES;
rOn coDY
the tall and the short
the "%$#@!" escape
;
RUN;

PROC PRINT data = proper NOOBS;
    TITLE 'The proper data set';
RUN;
SAS Output

The proper data set

Name propname
rOn coDY Ron Cody
the tall and the short The Tall And The Short
the "%$#@!" escape The "%$#@!" Escape

9.8. Substituting One Word for Another in a String

TRANWRD (translate word), can perform a search and replace operation on a string variable. The syntax of the TRANWRD function is:

TRANWRD (char_var,'find_str','replace_str');

where

  • char_var is the string we are searching within

  • find_str is the string we are searching for within char_var

  • replace_str is the string which will be substituted for find_str within char_var if found.

That is, the function will replace every occurrence of find_str with replace_str.

Example

In the following example, we want to standardize addresses by converting the words 'Street', 'Avenue', and 'Road' to the abbreviations 'St.', 'Ave.', and 'Rd.' respectively. Look at the following program:

DATA convert;
     INPUT @1 address $20.;
     *** Convert Street, Avenue, and Road 
         to their abbreviations;
     address = tranwrd(address,'Street','St.');
     address = tranwrd(address,'Avenue','Ave.');
     address = tranwrd(address,'Road','Rd.');
DATALINES;
89 Lazy Brook Road
123 River Rd.
12 Main Street
;
RUN;

PROC PRINT data = convert;
    TITLE 'The convert data set';
RUN;
SAS Output

The convert data set

Obs address
1 89 Lazy Brook Rd.
2 123 River Rd.
3 12 Main St.

The TRANWRD function must replace the find_str with a non null string, so if the replace_str is given as the empty string ‘’, TRANWRD will instead use an empty space ‘ ‘. If you want to replace with an empty string, you will have to use the TRANSTRN function with trimn(‘’) as the replacement string.

TITLE "Comparison of TRANSWRD and TRANSTRN with "
DATA _null_;
   FILE print;
   string1='*' || tranwrd('abcxabc', 'abc', '') || '*';
   put string1=;
   string2='*' || transtrn('abcxabc', 'abc', '') || '*';
   put string2=;
   string3='*' || transtrn('abcxabc', 'abc', trimn('')) || '*';
   put string3=;
RUN;
SAS Output

The convert data set

string1=* x *                                                                                                                       
string2=* x *                                                                                                                       
string3=*x*                                                                                                                         

9.9. Fuzzy Comparisons: The SPEDIS Function

The SPEDIS function measures the “spelling distance” between two strings. If the two strings are identical, the function returns a 0. For each category of spelling error, the function assigns “penalty” points. For example, if the first letter in the two strings is different, there is a relatively large penalty. If two letters are interchanged, the number of penalty points is smaller. This function is very useful in performing a fuzzy merge where there may be differences in spelling between two files. You can also use it with character data consisting of numerals, such as social security numbers. The syntax of this function is:

SPEDIS(string1,string2);

Example

The following program demonstrates some example distances between strings.

DATA compare;
     length string1 string2 $ 15;
     INPUT string1 string2;
     points = spedis(string1, string2);
DATALINES;
same same
same sam
firstletter xirstletter
lastletter lastlettex
receipt reciept
;
RUN;

PROC PRINT data = compare NOOBS;
    TITLE 'The compare data set';
RUN;
SAS Output

The compare data set

string1 string2 points
same same 0
same sam 8
firstletter xirstletter 18
lastletter lastlettex 10
receipt reciept 7

9.10. The ANY Functions

The “ANY” functions of SAS 9 allow you to determine the position of a class of characters (digits, alpha-numeric, alpha, white space, or punctuation). The complete list is:

  • ANYALNUM - find any alphanumeric values (both letter or digit)

  • ANYALPHA - find any letter

  • ANYDIGIT - find any number

  • ANYPUNCT - find any punctuation

  • ANYSPACE - find any space

These functions return the first position of a character of the appropriate class. If no appropriate characters are found, the functions return a 0.

Example

The following program demonstrates the ANYALPHA and ANYDIGIT functions.

DATA find_alpha_digit;
     input string $20.;
     first_alpha = anyalpha(string);
     first_digit = anydigit(string);
DATALINES;
no digits here
the 3 and 4
123 456 789
;
RUN;

PROC PRINT data = find_alpha_digit NOOBS;
    TITLE 'The find_alpha_digit data set';
RUN;
SAS Output

The find_alpha_digit data set

string first_alpha first_digit
no digits here 1 0
the 3 and 4 1 5
123 456 789 0 1

9.11. The NOT functions

The “NOT” functions work in a similar way to the “ANY” functions except that they return the position of the first class of character that does not match the designated class.

The complete list is:

  • NOTALNUM

  • NOTALPHA

  • NOTDIGIT

  • NOTPUNCT

  • NOTSPACE

These functions return the first position of a character that does not match the appropriate class. If no “bad” characters are found (characters that do not match the designation) the functions return a 0.

Example

The following program demonstrates the NOTALPHA and NOTDIGIT functions.

DATA data_cleaning;
     input string $20.;
     only_alpha = notalpha(trim(string));
     only_digit = notdigit(trim(string));
DATALINES;
abcdefg
1234567
abc123
1234abcd
;
RUN;

PROC PRINT data = data_cleaning NOOBS;
    TITLE 'The data_cleaning data set';
RUN;
SAS Output

The data_cleaning data set

string only_alpha only_digit
abcdefg 0 1
1234567 1 0
abc123 4 1
1234abcd 1 5

Notice the use of the TRIM function in the program above. Without it, both NOTALPHA and NOTDIGIT would return the position of the first blank. You can see why we called the data set DATA_CLEANING. It is an easy way to determine if a string contains only a single class of characters.

9.12. Concatenation Functions: CAT, CATS, and CATX

Although you can concatenate (join) two strings using the concatenation operator (either || or !!), several new functions and call routines, new with SAS 9, can be used instead. The CAT function does the same thing of concatenating strings just like || or !!, but the advantage of the other new functions is that they can automatically strip off leading and trailing blanks and can insert separation characters for you. We will demonstrate only two of the concatenation functions here. The CATS function strips leading and trailing blanks before joining two or more strings. The CATX function works the same as the CATS function, but allows you to specify one or more separation characters to insert between the strings. The syntax for these two functions is:

CATS(string1,string2,<stringn>);

CATX(separator,string1,string2,<stringn>);

Example

The following program demonstrates the differences between the different concatenation functions.

DATA join_up;
     FILE print;
     length cats $ 6 catx $ 17;
     string1 = 'ABC   ';
     string2 = '   XYZ   ';
     string3 = '12345';
     cat_op = string1 || string2;
     cat  = cat(string1, string2);
     cats = cats(string1, string2);
     catx = catx('***', string1, string2, string3);
     PUT string1=;
     PUT string2=;
     PUT string3=;
     PUT cat_op=;
     PUT cat=;
     PUT cats=;
     PUT catx=;
RUN;

PROC PRINT data = join_up NOOBS;
    TITLE 'The join_up data set';
    VAR string1 string2 string3 cat_op cat cats catx;
RUN;
SAS Output

The join_up data set

string1=ABC                                                                                                                         
string2=XYZ                                                                                                                         
string3=12345                                                                                                                       
cat_op=ABC      XYZ                                                                                                                 
cat=ABC      XYZ                                                                                                                    
cats=ABCXYZ                                                                                                                         
catx=ABC***XYZ***12345                                                                                                              

The join_up data set

string1 string2 string3 cat_op cat cats catx
ABC XYZ 12345 ABC XYZ ABC XYZ ABCXYZ ABC***XYZ***12345

Without the LENGTH statement in this program, the length of the two variables CATS and CATX would be 200 (the default length for the CAT functions). Note that the default length when using the concatenation operator is the sum of the lengths of the arguments to be joined.

9.13. The LENGTH, LENGTHN, and LENGTHN Functions

We spoke earlier about storage length of character variables. The collection of LENGTH functions in this section have different and useful purposes. The LENGTHC function (V9) returns the storage length of character variables. The other two functions, LENGTH and LENGTHN both return the length of a character variable not counting trailing blanks. The only difference between LENGTH and LENGTHN is that LENGTHN returns a 0 for a null string while LENGTH returns a 1.

Example

The following program demonstrates the LENGTH, LENGTHN, and LENGTHC functions.

DATA how_long;
   one = 'ABC   ';
   two = ' ';   /* character missing value */
   three = 'ABC   XYZ';
   length_one = length(one);
   lengthn_one = lengthn(one);
   lengthc_one = lengthc(one);
   length_two = length(two);
   lengthn_two = lengthn(two);
   lengthc_two = lengthc(two);
   length_three = length(three);
   lengthn_three = lengthn(three);
   lengthc_three = lengthc(three);
RUN;

PROC PRINT data = how_long NOOBS;
    TITLE 'The how_long data set';
RUN;
SAS Output

The how_long data set

one two three length_one lengthn_one lengthc_one length_two lengthn_two lengthc_two length_three lengthn_three lengthc_three
ABC   ABC XYZ 3 3 6 1 0 1 9 9 9

9.14. Comparing Two Strings Using the COMPARE Function

You may wonder why we need a function to compare two character strings. Why can’t you simple use an equal sign? The COMPARE function gives you more flexibility in comparing strings. The syntax is:

COMPARE(string1, string2 <,'modifiers'>)

You may use one or more modifiers from the following list, placed in single or double quotes as follows:

  • i or I - ignore case

  • l or L - removes leading blanks

  • n or N - removes quotes from any argument that is an n-literal and ignore case. An n-literal is a string in quotes, followed by an ‘n’, which is useful for non-valid SAS names.

  • : (color) - truncate the longer string to the length of the shorter string. Note that the default is to pad the shorter string with blanks before a comparison.

For example, if you want to ignore case and remove leading blanks, you could code:

yes_or_no = compare(string1,string2,'il');

The COMPARE function returns the first position in which the two strings are different. It returns a negative number if string1 is aplhabetically or numerically ordered before string2. It returns 0 if the two strings are the same.

DATA compare_str;
    string1 = "   Robert Parker";
    string2 = "ROBERT PARKER";
    compare1 = compare(string1, string2);
    compare2 = compare(string1, string2, 'i');
    compare3 = compare(string1, string2, 'l');
    compare4 = compare(string1, string2, 'il');
RUN;

PROC PRINT data = compare_str;
    TITLE "Compare Strings Dataset";
RUN;
SAS Output

Compare Strings Dataset

Obs string1 string2 compare1 compare2 compare3 compare4
1 Robert Parker ROBERT PARKER -1 -1 2 0

9.15. Removing Leading and Trailing

The STRIP function removes leading and trailing blanks. The TRIM function removes trailing blanks, and the LEFT function removes leading blanks.

TITLE "Whitespace Example";

DATA _NULL_;
   FILE print;
   string = "    ROBERT    ";
   strip = '*' || strip(string) || '*';
   trim = '*' || trim(string) || '*';
   left = '*' || left(string) || '*';
   put strip;
   put trim;
   put left;
RUN;
SAS Output

Whitespace Example

*ROBERT*                                                                                                                            
*    ROBERT*                                                                                                                        
*ROBERT        *                                                                                                                    

9.16. Counting Occurences of Characters or Substrings: COUNT and COUNTC

Two other SAS 9 functions are COUNT and COUNTC. COUNT is used to count the number of times a particular substring appears in a string. COUNTC counts the number of times one or more characters appear. You can also use modifiers to ignore case (use the ‘i’ modifier) or ignore trailing blanks (use the ‘t’ modifier) in both the string and the find_string.

The syntax for these two functions is:

count(string,find_string,<'modifiers'>)

countc(string,find_string,<'modifiers'>)

Example

The following program illustrates the COUNT and COUNTC functions

DATA Dracula;
    INPUT string $20.;
    count_a_or_b = count(string,'ab');
    countc_a_or_b = countc(string,'ab');
    count_abc = count(string,'abc');
    countc_abc = countc(string,'abc');
    case_a = countc(string,'a','i');
DATALINES;
xxabcxabcxxbbbb
cbacba
aaAA
;
RUN;

PROC PRINT data = Dracula NOOBS;
    TITLE 'The Dracula data set';
RUN;
SAS Output

The Dracula data set

string count_a_or_b countc_a_or_b count_abc countc_abc case_a
xxabcxabcxxbbbb 2 8 2 10 2
cbacba 0 4 0 6 2
aaAA 0 2 0 2 4

9.17. PRX Functions: Regular Expressions

SAS also has a family of PRX functions that allow string matching based on regular expressions instead of fixed strings. What are regular expressions?

  • Ways to search for specific strings

  • Can be very complicated or simple

  • Highly useful - think “Find” on steriods

  • http://www.regular-expressions.info/reference.html

  • They can be used to match a large number of strings in one statement

  • . matches any single character

  • * means repeate as many (even if 0) more times the last character

  • ? makes the last thing optional

  • ^ matches the start of a vector ^a - starst with “a”

  • $ matches the end of the vector b$ - ends with “b”

We will cover the following SAS functions which use regular expressions

  • PRXPARSE - defines the regular expression (and substitution string) for the SAS functions

  • PRXMATCH - search the string for a given regular expression from PRXPARSE and returns the starting position if a match is found and 0 otherwise

  • CALL PRXSUBSTR - similar to PRXMATCH, search the string for the given regular expression from PRXPARSE but can find both the starting position and length of the match or return 0 for starting postion and length if no match is found.

  • PRXCHANGE - search and replace the regular expression from PRXPARSE with the given replacement string. There is a parameter to specify how many matches to replace. A value of -1 for how many times to replace mean to replace all matches found.

Let’s use the salaries dataset to illustrate these functions.

FILENAME salaries '/folders/myfolders/SAS_Notes/data/Baltimore_City_Employee_Salaries_FY2015.csv';

PROC IMPORT datafile = salaries out = sal dbms = CSV replace;
   getnames = yes;
RUN;

PROC PRINT data = sal (obs=5);
   VAR name;
RUN;
SAS Output

The SAS System

Obs name
1 Aaron,Patricia G
2 Aaron,Petra L
3 Abaineh,Yohannes T
4 Abbene,Anthony M
5 Abbey,Emmanuel

Example

The following SAS program shows how to use regular expressions to find all names that include 'Rawlings'.

DATA sal_rawlings;
   set sal;
   re = prxparse('/Rawlings/i');
   if missing(re) then do;
      putlog 'ERROR: regex is malformed';
      stop;
      end;
   if prxmatch(re, name) = 0 THEN DELETE;
RUN;

PROC PRINT data = sal_rawlings;
   VAR name;
RUN;
SAS Output

The SAS System

Obs name
1 Rawlings,Kellye A
2 Rawlings,Paula M
3 Rawlings-Blake,Stephani

The variable re stores the regular expression defined by PRXPARSE. Note the the regular expression must occur between two forward slashes '/regex/'. The i in the regular expression definition '/Rawlings/i' tells SAS to ignore case when doing the search, so both 'Rawlings' and 'rawlings' will match. We then pass the regular expression to PRXMATCH to search the name variable. If there is a match somewhere it returns a non-zero index. Note that we have an if statement to catch mistakes in our regular expression and stop the program if we have made a mistake.

PRXSUBSTR works similarly but has a two variables that you pass in called `pos` and `len` in which to store the starting position and length of the string match. This would be needed for a followup call to SUBSTR in order to find and possible replace with substring.

Example

The following SAS program used PRXCHANGE to replace "a" with "j" in the names of the salary dataset.

DATA sal_a1;
   set sal;
   re = prxparse('s/a/j/');
   if missing(re) then do;
      putlog 'ERROR: regex is malformed';
      stop;
      end;
   name2 = prxchange(re, 1, name);
RUN;

PROC PRINT data = sal_a1(obs=5);
   VAR name name2;
RUN;

DATA sal_a2;
   set sal;
   re = prxparse('s/a/j/');
   if missing(re) then do;
      putlog 'ERROR: regex is malformed';
      stop;
      end;
   name2 = prxchange(re, -1, name);
RUN;

PROC PRINT data = sal_a2(obs=5);
   VAR name name2;
RUN;
SAS Output

The SAS System

Obs name name2
1 Aaron,Patricia G Ajron,Patricia G
2 Aaron,Petra L Ajron,Petra L
3 Abaineh,Yohannes T Abjineh,Yohannes T
4 Abbene,Anthony M Abbene,Anthony M
5 Abbey,Emmanuel Abbey,Emmjnuel

The SAS System

Obs name name2
1 Aaron,Patricia G Ajron,Pjtricij G
2 Aaron,Petra L Ajron,Petrj L
3 Abaineh,Yohannes T Abjineh,Yohjnnes T
4 Abbene,Anthony M Abbene,Anthony M
5 Abbey,Emmanuel Abbey,Emmjnuel

Notice that the first call used 1 for the number of substitutions and so it only replaces the first found match of "a" with "j"; Where as, the second version uses -1 and so replaces all occurences of "a" with "j".

9.18. Exercises

  1. Read in the Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, using PROC IMPORT.

  2. Make a temporary SAS dataset called health_sal using the salaries dataset, with only the agencies (JobTitle) of those with “fire” (anywhere in the job title), if any, in the name. Be sure to ignore case in the string matching for ‘fire’.

  3. Make a temporary data set called trans which contains only agencies that contain “TRANS” (case sensitive).

  4. What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.

  5. Create a subset of the salaries dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with (case sensitive) string matching. Call this emer. How many employees are in this new dataset?

  6. Create a variable called dept in the emer data set by extracting the the part of the Agency variable up until ‘ment’ or ‘ice’. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Make frequency table of the dept variable using PROC FREQ. To extract the desired substring, use:


re = prxparse('/.*(ment|ice)/');
call prxsubstr(re, Agency, pos, len);
dept = substr(Agency, pos, len);