9. Character Functions¶

In this lesson, we’ll investigate some of the functions available in SAS that can be applied only to character variables. For example, if you want to remove blanks from a character string, you might consider using the compress function. Or, if you want to select a smaller substring, say a first name, from a larger string containing one’s full name, you might want to take advantage of the substr function. Some of the functions that we will learn about are old standbys, such as: length, substr, compbl, compress, verify, input, put, tranwrd, scan, trim, upcase, lowcase, | | (concatenation), index, indexc, and spedis. And, some of the functions that we will learn about are new just to SAS Version 9. They include: anyalpha, anydigit, catx, cats, lengthc, propcase, strip, count, and countc.

Let’s also take this opportunity to introduce you to a couple of great resources for finding information about a variety of SAS topics. One resource is sasCommunity.org. You can learn more about the sasCommunity.org just by mucking around for a bit on the site. Another resource is the SAS Global Forum Conference which is held each year. However, the really helpful place to end up is a resource that allows you to search for previous papers and presentations from the SAS Global Forum (annual conferences for SAS Users). It is using this search engine where you’ll find the material this lesson is based on!

For this lesson, we will work through Ron Cody’s paper An Introduction to SAS Character Function.

9.1. Lengths of Character Variables¶

Before we actually discuss these functions, we need to understand how SAS software assigns storage lengths to character variables. It is important to remember two things: 1) The storage length of a character variable is set at compile time. and 2) this length is determined by the first appearance of a character variable in a DATA step. There are several ways to check the storage length of character variables in your SAS data set. One way is to run PROC CONTENTS. If you are using SAS 9 and above, the new function LENGTHC can be used to determine the storage length of a character variable.

Example

The following SAS program illustrates properties of the length of a character varible in SAS.

DATA chars1;
    FILE print;
    string = 'abc';
    length string $ 7;  /* Does this do anything */
    storage_length = lengthc(string);
    display = ":" || string || ":";
    put storage_length= ;
    put display= ;
RUN;

SAS Connection established. Subprocess id is 8422

SAS Output

The SAS System

storage_length=3                                                                                                                    
display=:abc:

Let's examine the program. We are creating a temporary dataset called chars1. The FILE statement determines where output from PUT statement is sent to. In this case, the use of the PRINT option sends output from PUT to the ODS output destination used for all other output such as plots and other output from PROC steps. We then create a character variable called string that contains the string 'abc'. Since we create the variable first, the next line where we set the length of string to be 7 characters is ignored. We can see that this happens by using the lengthc function to save the length of the string variable to the numeric variable storage_length which is then output by using the PUT statments. Note that both PUT statements use the format

PUT <varname>=;

which results in both printing out the variables name along with it's stored value as we can see in the output shown above. Note that we have also created another string called display that contains the character value stored in string with semicolons appended on either end by using the string concatenation operators ||.

In order to correctly set the length of string to the desired length of 7 characters, we must run the LENGTH statement first before assigning a value to string as in the following program.

DATA chars2;
    FILE print;
    length string $ 7;  /* Does this do anything */
    string = 'abc';
    storage_length = lengthc(string);
    display = ":" || string || ":";
    put storage_length= ;
    put display= ;
RUN;

SAS Output

The SAS System

storage_length=7                                                                                                                    
display=:abc    :

Now that we have switched the order of the LENGTH statement and creating the variable string, the length of string is now set to 7 instead of 3 and whitespace has been added to pad the extra character values to make it length 7 as shown in the display variable.

9.2. Removing Characters from a String¶

In this section we will examine using the different COMPRESS functions to remove character values from SAS character strings.

Example

This example will demonstrate how to convert multiple blanks to a single blank. Suppose you have some names and addresses in a file. Some of the data entry clerks placed extra spaces between the first and last names and in the address fields. You would like to store all names and addresses with single blanks. Here is an example of how this is done:

DATA multiple;
    INPUT #1 @1  name    $20.
          #2 @1  address $30.
          #3 @1  city    $15.
             @20 state    $2.
             @25 zip      $5.;
    name = compbl(name);
    address = compbl(address);
    city = compbl(city);
DATALINES;
Ron Cody
89 Lazy Brook Road
Flemington         NJ    08822
Bill     Brown
28   Cathy   Street
North   City       NY    11518
;
RUN;

PROC PRINT data = multiple;
RUN;

SAS Output

The SAS System


Obs	name	address	city	state	zip
1	Ron Cody	89 Lazy Brook Road	Flemington	NJ	0882
2	Bill Brown	28 Cathy Street	North City	NY	1151

The compbl function reduces anywhere it finds multiple successive blanks into a single blank. Note that it removed the extra white space between 28 and Cathy and Street in the second address.

A more general problem is to remove selected characters from a string. For example, suppose you want to remove blanks, parentheses, and dashes from a phone number that has been stored as a character value. Here comes the COMPRESS function to the rescue! The COMPRESS function can remove any number of specified characters from a character variable.

Example

The program below uses the COMPRESS function twice. The first time, to remove blanks from the string, and the second to remove blanks plus the other above mentioned characters. Here is the code:

DATA phone;
    INPUT phone $ 1-15;
    phone1 = compress(phone);
    phone2 = compress(phone,'(-) ');
DATALINES;
(908)235-4490
(201) 555-77 99
;
RUN;

PROC PRINT data = phone NOOBS;
    TITLE 'The phone data set';
RUN;

SAS Output

The phone data set


phone	phone1	phone2
(908)235-4490	(908)235-4490	9082354490
(201) 555-77 99	(201)555-7799	2015557799

The variable PHONE1 has just blanks removed. Notice that the COMPRESS function does not have a second argument here. When it is omitted, the COMPRESS function removes only blanks. For the variable PHONE2, the second argument of the COMPRESS function contains a list of the characters to remove: left parenthesis, blank, right parenthesis, and dash. This string is placed in single or double quotes. Remember, when you specify a list of characters to remove, blanks are no longer included unless you explicitly include a blank in the list.

9.3. Character Data Verification¶

A common task in data processing is to validate data. For example, you may want to be sure that only certain values are present in a character variable.

Example

In the example below, only the values 'A', 'B', 'C', 'D', and 'E' are valid data values. A very easy way to test if there are any invalid characters present is shown next:

DATA verify;
    INPUT @1  id $3.
          @5  answer $5.;
    position = verify(answer,'abcde');
DATALINES;
001 acbed
002 abxde
003 12cce
004 abc e
;
RUN;

PROC PRINT data = verify NOOBS;
    TITLE 'The verify data set';
RUN;

SAS Output

The verify data set


id	answer	position
001	acbed	0
002	abxde	3
003	12cce	1
004	abc e	4

The workhorse of this example is the VERIFY function. It is a bit complicated. It inspects every character in the first argument and, if it finds any value not in the verify string (the second argument), it will return the position of the first offending value. If all the values of the string are located in the verify string, a value of 0 is returned. To help clarify this, look at the output shown above.

One thing to be careful of when using the VERIFY function (and many of the other character functions) is trailing blanks. For example, look at the following:

DATA trailing;
    length string $ 10;
    string = 'abc';
    pos1 = verify(string,'abcde');
    pos2 = verify(trim(string),'abcde');
RUN;

PROC PRINT data = trailing NOOBS;
    TITLE 'The trailing data set';
RUN;

SAS Output

The trailing data set


string	pos1	pos2
abc	4	0

The LENGTH statement forces SAS to add 7 trailing blanks to pad string to be of length 10, so the value of POS1 is 4, the position of the first trailing blank. One way to avoid the trailing blank problem is to remove the trailing blanks before using the verify function. The TRIM function does this for us. We use TRIM first for POS2, so now with the trailing blanks removed there are no invalid characters and VERIFY returns 0. Do note that TRIM only removes leading and trailing blanks, so this would not work if the blanks were in the middle of the string. In this case, you could add a blank to the verify string or use compress to remove all the blanks.

9.4. Extracting Substrings¶

A substring is a part a longer string (although it can actually be the same length but this would not be too useful). To extract a substring in SAS we can use the SUBSTR function.

Example

In this example, you have ID codes which contain in the first two positions, a state abbreviation. Furthermore, positions 7-9 contain a numeric code. You want to create two new variables; one containing the two digit state codes and the other, a numeric variable constructed from the three numerals in positions 7,8, and 9. Here goes:

DATA pieces_parts;
    INPUT id $ 1-9;
    length state $ 2;
    state = substr(id,1,2);
    num = input(substr(id,7,3),3.);
DATALINES;
NYXXXX123
NJ1234567
;
RUN;

PROC PRINT data = pieces_parts NOOBS;
    TITLE 'The pieces_parts data set';
RUN;

SAS Output

The pieces_parts data set


id	state	num
NYXXXX123	NY	123
NJ1234567	NJ	567

Creating the state code is easy. We use the SUBSTR function. The first argument is the variable from which we want to extract the substring, the second argument is the starting position of the substring, and the last argument is the length of the substring (not the ending position as you might guess). Also note the use of the LENGTH statement to set the length of STATE to 2 bytes. Without a LENGTH statement, the length of STATE would be the same as the length of ID. Why? Remember that character variable lengths are set at compile time. The starting position and length parameters are constants here, but they could have been computed or read in from an external file. So, without a LENGTH statement, what is SAS to do? What is the longest substring you can extract from a string of length n? The answer is n and that is what SAS uses as the default length of the result.

Extracting the three digit number code is more complicated. First we use the SUBSTR function to pull out the three numerals (numerals are character representations of numbers). However, the result of a SUBSTR function is always a character value. To convert the character value to a number, we use the INPUT function. The INPUT function takes the first argument and "reads" it as if it were coming from a file, according to the informat listed as the second argument. So, for the first observation, the SUBSTR function would return the string '123' and the INPUT function would convert this to the number 123. As a point of interest, you may use a longer informat as the second argument without any problems. For example, the INPUT statement could have been written as:

input(substr(id,7,3),8.);

and everything would have worked out fine. This fact is useful in situations where you do not know the length of the string ahead of time.

There is a particularly useful and somewhat obscure use of the SUBSTR function that we would like to discuss next. You can use this function to place characters in specific locations within a string by placing the SUBSTR function on the left hand side of the equal sign (in the older manuals I think this was called a SUBSTR pesudo function).

Example

Suppose you have some systolic blood pressures (SBP) and diastolic blood pressures (DBP) in a SAS data set. You want to print out these values and star high values with an asterisk. Here is a program that uses the SUBSTR function on the left of the equals sign to do that:

DATA pressure;
     INPUT sbp dbp @@;
     length sbp_chk dbp_chk $ 4;
     sbp_chk = put(sbp,3.);
     dbp_chk = put(dbp,3.);
     if sbp gt 160 then substr(sbp_chk,4,1) = '*';
     if dbp gt  90 then substr(dbp_chk,4,1) = '*';
DATALINES;
120 80 180 92 200 110
;
RUN;

PROC PRINT data = pressure NOOBS;
    TITLE 'The pressure data set';
RUN;

SAS Output

The pressure data set


sbp	dbp	sbp_chk	dbp_chk
120	80	120	80
180	92	180*	92*
200	110	200*	110*

We first need to set the lengths of SBP_CHK and DBP_CHK to 4 (three spaces for the value plus one for the possible asterisk). Next, we use a PUT function to perform a numeric to character conversion. The PUT function is, in some ways, similar to the INPUT function. It "writes out" the value of the first argument, according to the FORMAT specified in the second argument. By "write out" we actually mean assign the value to the variable on the left of the equal sign. The SUBSTR function then places an asterisk in the fourth position when a value of SBP is greater than 160 or a value of DBP is greater than 90, as you can see in the output above.

9.5. Parsing a String with SCAN¶

Parsing a string means to take it apart based on some rules. We can parse a string by using the SCAN function if we know what character values are used to separate different parts of the string.

Example

In the example to follow, five separate character values were placed together on a line with either a space, a comma, a semi-colon, a period, or an explanation mark between them. You would like to extract the five values and assign them to five character variables. Without the SCAN function this would be hard; with it, it's easy:

DATA parse;
     INPUT long_str $ 1-80;
     array pieces[5] $ 10 piece1-piece5;
     do i = 1 to 5;
         pieces[i] = scan(long_str,i,',.! ');
     end;
     drop long_str i;
DATALINES;
this line,contains!five.words
abcdefghijkl xxx yyy
;
RUN;

PROC PRINT data = parse NOOBS;
    TITLE 'The parse data set';
RUN;

SAS Output

The parse data set


piece1	piece2	piece3	piece4	piece5
this	line	contains	five	words
abcdefghij	xxx	yyy

The function:

SCAN(char_var,n,'list-of-delimiters');

returns the nth "word" from the char_var, where a "word" is defined as anything between two delimiters. If there are fewer than n words in the character variable, the SCAN function will return a blank. If n is negative, the scan will proceed from right to left. If n is greater than the number of words in the string, a missing value is returned.

By placing the SCAN function in a DO loop, we can pick out the nth word in the string. Even though we have not discussed loops and array in SAS yet, hopefully the general syntax makes sense from using loops and arrays in other languages such as R.

Example

Here is an interesting example that uses the SCAN function to extract the last name from a character variable that contains first and last name as well as a possible middle name or initial. In this example, you want to create a list in alphabetical order by last name. First the program, then the explanation:

DATA first_last;
     INPUT @1  name  $20.
           @21 phone $13.;
     *** The next statement extracts the last name from name;
     last_name = scan(name,-1,' ');    /* scans from the right */
DATALINES;
Jeff W. Snoker        (908)782-4382
Raymond Albert        (732)235-4444
Alred Edward Newman   (800)123-4321
Steven J. Foster      (201)567-9876
Jose Romerez          (516)593-2377
;
RUN;

PROC REPORT data = first_last NOWINDOWS;
    TITLE 'Names and Phone Numbers in Alphabetical Order (by Last Name)';
    COLUMNS name phone last_name;
    DEFINE last_name / order noprint width = 20;
    DEFINE name      / display 'Name' left width = 20;
    DEFINE phone     / display 'Phone Number' width = 13 format=$13.;
RUN;

SAS Output

Names and Phone Numbers in Alphabetical Order (by Last Name)


Name	Phone Number
Raymond Albert	(732)235-44
Steven J. Foster	(201)567-98
Alred Edward Newman	(800)123-43
Jose Romerez	(516)593-23
Jeff W. Snoker	(908)782-43

It is easy to extract the last name by using a –1 as the second argument of the SCAN function. Remember, a negative value for this arguments results in a scan from right to left. Output from the REPORT procedure is shown above.

The REPORT procedure simply creates a table. The COLUMNS statement defines what variables will be used to make the table, and the DEFINE statements describes how they will be used. For example, the order and noprint options are used for last_name so that we use the last name to order the rows but do not print the column to the output table.

9.6. Locating the Position of One String Within Another String¶

Two somewhat similar functions, INDEX and INDEXC can be used to locate a string, or one of several strings within a longer string. For example, if you have a string ‘ABCDEFG’ and want the location of the letters DEF (starting position 4), the following INDEX function could be used:

INDEX('ABCDEFG','DEF');

The first argument is the argument you want to search, the second argument is the string you are searching for. This would return a value of 4, the starting position of the string ‘DEF’. If you want to know the starting position of any one of several characters, the INDEXC function can be used. As an example, if you wanted the starting position of any of the letters ‘G’, ‘C’, ‘B’, or ‘F’ in the string ‘ABCDEFG’, you would code:

INDEXC('ABCDEFG','GCBF');

or

INDEXC('ABCDEFG','G','C','B','F');

The function would return a value of 2, the position of the ‘B’, the first letter found in the first argument.

Example

Here is a short program which demonstrate these two functions: If the search fails, both functions return a zero.

DATA locate;
    INPUT string $ 1-10;
    first   = index(string,'xyz');
    first_c = indexc(string,'x','y','z');
DATALINES;
abcxyz1234
1234567890
abcx1y2z39
abczzzxyz3
;
RUN;

PROC PRINT data = locate NOOBS;
    TITLE 'The locate data set';
RUN;

SAS Output

The locate data set


string	first	first_c
abcxyz1234	4	4
1234567890	0	0
abcx1y2z39	0	4
abczzzxyz3	7	4

9.7. Converting Between Lower and Upper Case and Proper Case¶

The two companion functions UPCASE and LOWCASE do just what you would expect. These two functions are especially useful when data entry clerks are careless and a mixture of upper and lower cases values are entered for the same variable. You may want to place all of your character variables in an array and UPCASE (or LOWCASE) them all.

Example

The following example program uses the UPCASE function to convert all character variables to upper case.

DATA up_down;
     length a b c d e $ 1;
     INPUT a b c d e x y;
DATALINES;
M f P p D 1 2
m f m F M 3 4
;
RUN;

DATA upper;
    SET up_down;
    array all_c[*] _character_;
    DO i = 1 to dim(all_c);
        all_c[i] = upcase(all_c[i]);
    END;
    DROP i;
RUN;

PROC PRINT data = upper NOOBS;
    TITLE 'The upper data set';
RUN;

SAS Output

The upper data set


a	b	c	d	e	x	y
M	F	P	P	D	1	2
M	F	M	F	M	3	4

This program uses the _CHARACTER_ keyword to select all the character variables instead of individually listing them out when creating the array. The result of running this program is to convert all values for the variables A,B,C, D, and E to upper case. The LOWCASE function could be used in place of the UPCASE function if you wanted all your character values in lower case.

Again, do not worry if you don't fully understand the array and do loops yet. We will cover these in more detail in an unpcoming lesson.

A handy new V9 function is PROPCASE. This function capitalizes the first letter of each word.

Example

The following example program uses the PROPCASE function to text stored in a character variable to proper case, i.e. the frist letter is capitalized followed by all lower case letters within a word.

DATA proper;
     INPUT Name $40.;
     propname = propcase(Name);
DATALINES;
rOn coDY
the tall and the short
the "%$#@!" escape
;
RUN;

PROC PRINT data = proper NOOBS;
    TITLE 'The proper data set';
RUN;

SAS Output

The proper data set


Name	propname
rOn coDY	Ron Cody
the tall and the short	The Tall And The Short
the "%$#@!" escape	The "%$#@!" Escape

9.8. Substituting One Word for Another in a String¶

TRANWRD (translate word), can perform a search and replace operation on a string variable. The syntax of the TRANWRD function is:

TRANWRD (char_var,'find_str','replace_str');

where

char_var is the string we are searching within
find_str is the string we are searching for within char_var
replace_str is the string which will be substituted for find_str within char_var if found.

That is, the function will replace every occurrence of find_str with replace_str.

Example

In the following example, we want to standardize addresses by converting the words 'Street', 'Avenue', and 'Road' to the abbreviations 'St.', 'Ave.', and 'Rd.' respectively. Look at the following program:

DATA convert;
     INPUT @1 address $20.;
     *** Convert Street, Avenue, and Road 
         to their abbreviations;
     address = tranwrd(address,'Street','St.');
     address = tranwrd(address,'Avenue','Ave.');
     address = tranwrd(address,'Road','Rd.');
DATALINES;
89 Lazy Brook Road
123 River Rd.
12 Main Street
;
RUN;

PROC PRINT data = convert;
    TITLE 'The convert data set';
RUN;

SAS Output

The convert data set


Obs	address
1	89 Lazy Brook Rd.
2	123 River Rd.
3	12 Main St.

The TRANWRD function must replace the find_str with a non null string, so if the replace_str is given as the empty string ‘’, TRANWRD will instead use an empty space ‘ ‘. If you want to replace with an empty string, you will have to use the TRANSTRN function with trimn(‘’) as the replacement string.

TITLE "Comparison of TRANSWRD and TRANSTRN with "
DATA _null_;
   FILE print;
   string1='*' || tranwrd('abcxabc', 'abc', '') || '*';
   put string1=;
   string2='*' || transtrn('abcxabc', 'abc', '') || '*';
   put string2=;
   string3='*' || transtrn('abcxabc', 'abc', trimn('')) || '*';
   put string3=;
RUN;

SAS Output

The convert data set

string1=* x *                                                                                                                       
string2=* x *                                                                                                                       
string3=*x*

9.9. Fuzzy Comparisons: The SPEDIS Function¶

The SPEDIS function measures the “spelling distance” between two strings. If the two strings are identical, the function returns a 0. For each category of spelling error, the function assigns “penalty” points. For example, if the first letter in the two strings is different, there is a relatively large penalty. If two letters are interchanged, the number of penalty points is smaller. This function is very useful in performing a fuzzy merge where there may be differences in spelling between two files. You can also use it with character data consisting of numerals, such as social security numbers. The syntax of this function is:

SPEDIS(string1,string2);

Example

The following program demonstrates some example distances between strings.

DATA compare;
     length string1 string2 $ 15;
     INPUT string1 string2;
     points = spedis(string1, string2);
DATALINES;
same same
same sam
firstletter xirstletter
lastletter lastlettex
receipt reciept
;
RUN;

PROC PRINT data = compare NOOBS;
    TITLE 'The compare data set';
RUN;

SAS Output

The compare data set


string1	string2	points
same	same	0
same	sam	8
firstletter	xirstletter	18
lastletter	lastlettex	10
receipt	reciept	7

9.10. The ANY Functions¶

The “ANY” functions of SAS 9 allow you to determine the position of a class of characters (digits, alpha-numeric, alpha, white space, or punctuation). The complete list is:

ANYALNUM - find any alphanumeric values (both letter or digit)
ANYALPHA - find any letter
ANYDIGIT - find any number
ANYPUNCT - find any punctuation
ANYSPACE - find any space

These functions return the first position of a character of the appropriate class. If no appropriate characters are found, the functions return a 0.

Example

The following program demonstrates the ANYALPHA and ANYDIGIT functions.

DATA find_alpha_digit;
     input string $20.;
     first_alpha = anyalpha(string);
     first_digit = anydigit(string);
DATALINES;
no digits here
the 3 and 4
123 456 789
;
RUN;

PROC PRINT data = find_alpha_digit NOOBS;
    TITLE 'The find_alpha_digit data set';
RUN;

SAS Output

The find_alpha_digit data set


string	first_alpha	first_digit
no digits here	1	0
the 3 and 4	1	5
123 456 789	0	1

9.11. The NOT functions¶

The “NOT” functions work in a similar way to the “ANY” functions except that they return the position of the first class of character that does not match the designated class.

The complete list is:

NOTALNUM
NOTALPHA
NOTDIGIT
NOTPUNCT
NOTSPACE

These functions return the first position of a character that does not match the appropriate class. If no “bad” characters are found (characters that do not match the designation) the functions return a 0.

Example

The following program demonstrates the NOTALPHA and NOTDIGIT functions.

DATA data_cleaning;
     input string $20.;
     only_alpha = notalpha(trim(string));
     only_digit = notdigit(trim(string));
DATALINES;
abcdefg
1234567
abc123
1234abcd
;
RUN;

PROC PRINT data = data_cleaning NOOBS;
    TITLE 'The data_cleaning data set';
RUN;

SAS Output

The data_cleaning data set


string	only_alpha	only_digit
abcdefg	0	1
1234567	1	0
abc123	4	1
1234abcd	1	5

Notice the use of the TRIM function in the program above. Without it, both NOTALPHA and NOTDIGIT would return the position of the first blank. You can see why we called the data set DATA_CLEANING. It is an easy way to determine if a string contains only a single class of characters.

9.12. Concatenation Functions: CAT, CATS, and CATX¶

Although you can concatenate (join) two strings using the concatenation operator (either || or !!), several new functions and call routines, new with SAS 9, can be used instead. The CAT function does the same thing of concatenating strings just like || or !!, but the advantage of the other new functions is that they can automatically strip off leading and trailing blanks and can insert separation characters for you. We will demonstrate only two of the concatenation functions here. The CATS function strips leading and trailing blanks before joining two or more strings. The CATX function works the same as the CATS function, but allows you to specify one or more separation characters to insert between the strings. The syntax for these two functions is:

CATS(string1,string2,<stringn>);

CATX(separator,string1,string2,<stringn>);

Example

The following program demonstrates the differences between the different concatenation functions.

DATA join_up;
     FILE print;
     length cats $ 6 catx $ 17;
     string1 = 'ABC   ';
     string2 = '   XYZ   ';
     string3 = '12345';
     cat_op = string1 || string2;
     cat  = cat(string1, string2);
     cats = cats(string1, string2);
     catx = catx('***', string1, string2, string3);
     PUT string1=;
     PUT string2=;
     PUT string3=;
     PUT cat_op=;
     PUT cat=;
     PUT cats=;
     PUT catx=;
RUN;

PROC PRINT data = join_up NOOBS;
    TITLE 'The join_up data set';
    VAR string1 string2 string3 cat_op cat cats catx;
RUN;

SAS Output

The join_up data set

string1=ABC                                                                                                                         
string2=XYZ                                                                                                                         
string3=12345                                                                                                                       
cat_op=ABC      XYZ                                                                                                                 
cat=ABC      XYZ                                                                                                                    
cats=ABCXYZ                                                                                                                         
catx=ABC***XYZ***12345

The join_up data set


string1	string2	string3	cat_op	cat	cats	catx
ABC	XYZ	12345	ABC XYZ	ABC XYZ	ABCXYZ	ABC*XYZ*12345

Without the LENGTH statement in this program, the length of the two variables CATS and CATX would be 200 (the default length for the CAT functions). Note that the default length when using the concatenation operator is the sum of the lengths of the arguments to be joined.

9.13. The LENGTH, LENGTHN, and LENGTHN Functions¶

We spoke earlier about storage length of character variables. The collection of LENGTH functions in this section have different and useful purposes. The LENGTHC function (V9) returns the storage length of character variables. The other two functions, LENGTH and LENGTHN both return the length of a character variable not counting trailing blanks. The only difference between LENGTH and LENGTHN is that LENGTHN returns a 0 for a null string while LENGTH returns a 1.

Example

The following program demonstrates the LENGTH, LENGTHN, and LENGTHC functions.

DATA how_long;
   one = 'ABC   ';
   two = ' ';   /* character missing value */
   three = 'ABC   XYZ';
   length_one = length(one);
   lengthn_one = lengthn(one);
   lengthc_one = lengthc(one);
   length_two = length(two);
   lengthn_two = lengthn(two);
   lengthc_two = lengthc(two);
   length_three = length(three);
   lengthn_three = lengthn(three);
   lengthc_three = lengthc(three);
RUN;

PROC PRINT data = how_long NOOBS;
    TITLE 'The how_long data set';
RUN;

SAS Output

The how_long data set


one	two	three	length_one	lengthn_one	lengthc_one	length_two	lengthn_two	lengthc_two	length_three	lengthn_three	lengthc_three
ABC		ABC XYZ	3	3	6	1	0	1	9	9	9

9.14. Comparing Two Strings Using the COMPARE Function¶

You may wonder why we need a function to compare two character strings. Why can’t you simple use an equal sign? The COMPARE function gives you more flexibility in comparing strings. The syntax is:

COMPARE(string1, string2 <,'modifiers'>)

You may use one or more modifiers from the following list, placed in single or double quotes as follows:

i or I - ignore case
l or L - removes leading blanks
n or N - removes quotes from any argument that is an n-literal and ignore case. An n-literal is a string in quotes, followed by an ‘n’, which is useful for non-valid SAS names.
: (color) - truncate the longer string to the length of the shorter string. Note that the default is to pad the shorter string with blanks before a comparison.

For example, if you want to ignore case and remove leading blanks, you could code:

yes_or_no = compare(string1,string2,'il');

The COMPARE function returns the first position in which the two strings are different. It returns a negative number if string1 is aplhabetically or numerically ordered before string2. It returns 0 if the two strings are the same.

DATA compare_str;
    string1 = "   Robert Parker";
    string2 = "ROBERT PARKER";
    compare1 = compare(string1, string2);
    compare2 = compare(string1, string2, 'i');
    compare3 = compare(string1, string2, 'l');
    compare4 = compare(string1, string2, 'il');
RUN;

PROC PRINT data = compare_str;
    TITLE "Compare Strings Dataset";
RUN;

SAS Output

Compare Strings Dataset


Obs	string1	string2	compare1	compare2	compare3	compare4
1	Robert Parker	ROBERT PARKER	-1	-1	2	0

9.15. Removing Leading and Trailing¶

The STRIP function removes leading and trailing blanks. The TRIM function removes trailing blanks, and the LEFT function removes leading blanks.

TITLE "Whitespace Example";

DATA _NULL_;
   FILE print;
   string = "    ROBERT    ";
   strip = '*' || strip(string) || '*';
   trim = '*' || trim(string) || '*';
   left = '*' || left(string) || '*';
   put strip;
   put trim;
   put left;
RUN;

SAS Output

Whitespace Example

*ROBERT*                                                                                                                            
*    ROBERT*                                                                                                                        
*ROBERT        *

9.16. Counting Occurences of Characters or Substrings: COUNT and COUNTC¶

Two other SAS 9 functions are COUNT and COUNTC. COUNT is used to count the number of times a particular substring appears in a string. COUNTC counts the number of times one or more characters appear. You can also use modifiers to ignore case (use the ‘i’ modifier) or ignore trailing blanks (use the ‘t’ modifier) in both the string and the find_string.

The syntax for these two functions is:

count(string,find_string,<'modifiers'>)

countc(string,find_string,<'modifiers'>)

Example

The following program illustrates the COUNT and COUNTC functions

DATA Dracula;
    INPUT string $20.;
    count_a_or_b = count(string,'ab');
    countc_a_or_b = countc(string,'ab');
    count_abc = count(string,'abc');
    countc_abc = countc(string,'abc');
    case_a = countc(string,'a','i');
DATALINES;
xxabcxabcxxbbbb
cbacba
aaAA
;
RUN;

PROC PRINT data = Dracula NOOBS;
    TITLE 'The Dracula data set';
RUN;

SAS Output

The Dracula data set


string	count_a_or_b	countc_a_or_b	count_abc	countc_abc	case_a
xxabcxabcxxbbbb	2	8	2	10	2
cbacba	0	4	0	6	2
aaAA	0	2	0	2	4

9.17. PRX Functions: Regular Expressions¶

SAS also has a family of PRX functions that allow string matching based on regular expressions instead of fixed strings. What are regular expressions?

Ways to search for specific strings
Can be very complicated or simple
Highly useful - think “Find” on steriods
http://www.regular-expressions.info/reference.html
They can be used to match a large number of strings in one statement
. matches any single character
* means repeate as many (even if 0) more times the last character
? makes the last thing optional
^ matches the start of a vector ^a - starst with “a”
$ matches the end of the vector b$ - ends with “b”

We will cover the following SAS functions which use regular expressions

PRXPARSE - defines the regular expression (and substitution string) for the SAS functions
PRXMATCH - search the string for a given regular expression from PRXPARSE and returns the starting position if a match is found and 0 otherwise
CALL PRXSUBSTR - similar to PRXMATCH, search the string for the given regular expression from PRXPARSE but can find both the starting position and length of the match or return 0 for starting postion and length if no match is found.
PRXCHANGE - search and replace the regular expression from PRXPARSE with the given replacement string. There is a parameter to specify how many matches to replace. A value of -1 for how many times to replace mean to replace all matches found.

Let’s use the salaries dataset to illustrate these functions.

FILENAME salaries '/folders/myfolders/SAS_Notes/data/Baltimore_City_Employee_Salaries_FY2015.csv';

PROC IMPORT datafile = salaries out = sal dbms = CSV replace;
   getnames = yes;
RUN;

PROC PRINT data = sal (obs=5);
   VAR name;
RUN;

SAS Output

The SAS System


Obs	name
1	Aaron,Patricia G
2	Aaron,Petra L
3	Abaineh,Yohannes T
4	Abbene,Anthony M
5	Abbey,Emmanuel

Example

The following SAS program shows how to use regular expressions to find all names that include 'Rawlings'.

DATA sal_rawlings;
   set sal;
   re = prxparse('/Rawlings/i');
   if missing(re) then do;
      putlog 'ERROR: regex is malformed';
      stop;
      end;
   if prxmatch(re, name) = 0 THEN DELETE;
RUN;

PROC PRINT data = sal_rawlings;
   VAR name;
RUN;

SAS Output

The SAS System


Obs	name
1	Rawlings,Kellye A
2	Rawlings,Paula M
3	Rawlings-Blake,Stephani

The variable re stores the regular expression defined by PRXPARSE. Note the the regular expression must occur between two forward slashes '/regex/'. The i in the regular expression definition '/Rawlings/i' tells SAS to ignore case when doing the search, so both 'Rawlings' and 'rawlings' will match. We then pass the regular expression to PRXMATCH to search the name variable. If there is a match somewhere it returns a non-zero index. Note that we have an if statement to catch mistakes in our regular expression and stop the program if we have made a mistake.

PRXSUBSTR works similarly but has a two variables that you pass in called `pos` and `len` in which to store the starting position and length of the string match. This would be needed for a followup call to SUBSTR in order to find and possible replace with substring.

Example

The following SAS program used PRXCHANGE to replace "a" with "j" in the names of the salary dataset.

DATA sal_a1;
   set sal;
   re = prxparse('s/a/j/');
   if missing(re) then do;
      putlog 'ERROR: regex is malformed';
      stop;
      end;
   name2 = prxchange(re, 1, name);
RUN;

PROC PRINT data = sal_a1(obs=5);
   VAR name name2;
RUN;

DATA sal_a2;
   set sal;
   re = prxparse('s/a/j/');
   if missing(re) then do;
      putlog 'ERROR: regex is malformed';
      stop;
      end;
   name2 = prxchange(re, -1, name);
RUN;

PROC PRINT data = sal_a2(obs=5);
   VAR name name2;
RUN;

SAS Output

The SAS System


Obs	name	name2
1	Aaron,Patricia G	Ajron,Patricia G
2	Aaron,Petra L	Ajron,Petra L
3	Abaineh,Yohannes T	Abjineh,Yohannes T
4	Abbene,Anthony M	Abbene,Anthony M
5	Abbey,Emmanuel	Abbey,Emmjnuel

The SAS System


Obs	name	name2
1	Aaron,Patricia G	Ajron,Pjtricij G
2	Aaron,Petra L	Ajron,Petrj L
3	Abaineh,Yohannes T	Abjineh,Yohjnnes T
4	Abbene,Anthony M	Abbene,Anthony M
5	Abbey,Emmanuel	Abbey,Emmjnuel

Notice that the first call used 1 for the number of substitutions and so it only replaces the first found match of "a" with "j"; Where as, the second version uses -1 and so replaces all occurences of "a" with "j".

9.18. Exercises¶

Read in the Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, using PROC IMPORT.
Make a temporary SAS dataset called health_sal using the salaries dataset, with only the agencies (JobTitle) of those with “fire” (anywhere in the job title), if any, in the name. Be sure to ignore case in the string matching for ‘fire’.
Make a temporary data set called trans which contains only agencies that contain “TRANS” (case sensitive).
What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.
Create a subset of the salaries dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with (case sensitive) string matching. Call this emer. How many employees are in this new dataset?
Create a variable called dept in the emer data set by extracting the the part of the Agency variable up until ‘ment’ or ‘ice’. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Make frequency table of the dept variable using PROC FREQ. To extract the desired substring, use:


re = prxparse('/.*(ment|ice)/');
call prxsubstr(re, Agency, pos, len);
dept = substr(Agency, pos, len);

8. SAS Formats and Dates 10. Data Cleaning

Intro to SAS Notes

9. Character Functions¶

9.1. Lengths of Character Variables¶

Example

9.2. Removing Characters from a String¶

Example

Example

9.3. Character Data Verification¶

Example

9.4. Extracting Substrings¶

Example

Example

9.5. Parsing a String with SCAN¶

Example

Example

9.6. Locating the Position of One String Within Another String¶

Example

9.7. Converting Between Lower and Upper Case and Proper Case¶

Example

Example

9.8. Substituting One Word for Another in a String¶

Example

9.9. Fuzzy Comparisons: The SPEDIS Function¶

Example

9.10. The ANY Functions¶

Example

9.11. The NOT functions¶

Example

9.12. Concatenation Functions: CAT, CATS, and CATX¶

Example

9.13. The LENGTH, LENGTHN, and LENGTHN Functions¶

Example

9.14. Comparing Two Strings Using the COMPARE Function¶

9.15. Removing Leading and Trailing¶

9.16. Counting Occurences of Characters or Substrings: COUNT and COUNTC¶

Example

9.17. PRX Functions: Regular Expressions¶

Example

Example

9.18. Exercises¶