9. Character Functions¶
In this lesson, we’ll investigate some of the functions available in SAS that can be applied only to character variables. For example, if you want to remove blanks from a character string, you might consider using the compress function. Or, if you want to select a smaller substring, say a first name, from a larger string containing one’s full name, you might want to take advantage of the substr function. Some of the functions that we will learn about are old standbys, such as: length, substr, compbl, compress, verify, input, put, tranwrd, scan, trim, upcase, lowcase, | | (concatenation), index, indexc, and spedis. And, some of the functions that we will learn about are new just to SAS Version 9. They include: anyalpha, anydigit, catx, cats, lengthc, propcase, strip, count, and countc.
Let’s also take this opportunity to introduce you to a couple of great resources for finding information about a variety of SAS topics. One resource is sasCommunity.org. You can learn more about the sasCommunity.org just by mucking around for a bit on the site. Another resource is the SAS Global Forum Conference which is held each year. However, the really helpful place to end up is a resource that allows you to search for previous papers and presentations from the SAS Global Forum (annual conferences for SAS Users). It is using this search engine where you’ll find the material this lesson is based on!
For this lesson, we will work through Ron Cody’s paper An Introduction to SAS Character Function.
9.1. Lengths of Character Variables¶
Before we actually discuss these functions, we need to understand how SAS software assigns storage lengths to character variables. It is important to remember two things: 1) The storage length of a character variable is set at compile time. and 2) this length is determined by the first appearance of a character variable in a DATA step. There are several ways to check the storage length of character variables in your SAS data set. One way is to run PROC CONTENTS. If you are using SAS 9 and above, the new function LENGTHC can be used to determine the storage length of a character variable.
DATA chars1; FILE print; string = 'abc'; length string $ 7; /* Does this do anything */ storage_length = lengthc(string); display = ":" || string || ":"; put storage_length= ; put display= ; RUN;
SAS Connection established. Subprocess id is 8422
DATA chars2; FILE print; length string $ 7; /* Does this do anything */ string = 'abc'; storage_length = lengthc(string); display = ":" || string || ":"; put storage_length= ; put display= ; RUN;
storage_length=7 display=:abc :
9.2. Removing Characters from a String¶
In this section we will examine using the different COMPRESS functions to remove character values from SAS character strings.
DATA multiple; INPUT #1 @1 name $20. #2 @1 address $30. #3 @1 city $15. @20 state $2. @25 zip $5.; name = compbl(name); address = compbl(address); city = compbl(city); DATALINES; Ron Cody 89 Lazy Brook Road Flemington NJ 08822 Bill Brown 28 Cathy Street North City NY 11518 ; RUN; PROC PRINT data = multiple; RUN;
|1||Ron Cody||89 Lazy Brook Road||Flemington||NJ||0882|
|2||Bill Brown||28 Cathy Street||North City||NY||1151|
A more general problem is to remove selected characters from a string. For example, suppose you want to remove blanks, parentheses, and dashes from a phone number that has been stored as a character value. Here comes the COMPRESS function to the rescue! The COMPRESS function can remove any number of specified characters from a character variable.
DATA phone; INPUT phone $ 1-15; phone1 = compress(phone); phone2 = compress(phone,'(-) '); DATALINES; (908)235-4490 (201) 555-77 99 ; RUN; PROC PRINT data = phone NOOBS; TITLE 'The phone data set'; RUN;
|(201) 555-77 99||(201)555-7799||2015557799|
9.3. Character Data Verification¶
A common task in data processing is to validate data. For example, you may want to be sure that only certain values are present in a character variable.
DATA verify; INPUT @1 id $3. @5 answer $5.; position = verify(answer,'abcde'); DATALINES; 001 acbed 002 abxde 003 12cce 004 abc e ; RUN; PROC PRINT data = verify NOOBS; TITLE 'The verify data set'; RUN;
DATA trailing; length string $ 10; string = 'abc'; pos1 = verify(string,'abcde'); pos2 = verify(trim(string),'abcde'); RUN; PROC PRINT data = trailing NOOBS; TITLE 'The trailing data set'; RUN;
9.4. Extracting Substrings¶
A substring is a part a longer string (although it can actually be the same length but this would not be too useful). To extract a substring in SAS we can use the SUBSTR function.
DATA pieces_parts; INPUT id $ 1-9; length state $ 2; state = substr(id,1,2); num = input(substr(id,7,3),3.); DATALINES; NYXXXX123 NJ1234567 ; RUN; PROC PRINT data = pieces_parts NOOBS; TITLE 'The pieces_parts data set'; RUN;
There is a particularly useful and somewhat obscure use of the SUBSTR function that we would like to discuss next. You can use this function to place characters in specific locations within a string by placing the SUBSTR function on the left hand side of the equal sign (in the older manuals I think this was called a SUBSTR pesudo function).
DATA pressure; INPUT sbp dbp @@; length sbp_chk dbp_chk $ 4; sbp_chk = put(sbp,3.); dbp_chk = put(dbp,3.); if sbp gt 160 then substr(sbp_chk,4,1) = '*'; if dbp gt 90 then substr(dbp_chk,4,1) = '*'; DATALINES; 120 80 180 92 200 110 ; RUN; PROC PRINT data = pressure NOOBS; TITLE 'The pressure data set'; RUN;
9.5. Parsing a String with SCAN¶
Parsing a string means to take it apart based on some rules. We can parse a string by using the SCAN function if we know what character values are used to separate different parts of the string.
DATA parse; INPUT long_str $ 1-80; array pieces $ 10 piece1-piece5; do i = 1 to 5; pieces[i] = scan(long_str,i,',.! '); end; drop long_str i; DATALINES; this line,contains!five.words abcdefghijkl xxx yyy ; RUN; PROC PRINT data = parse NOOBS; TITLE 'The parse data set'; RUN;
DATA first_last; INPUT @1 name $20. @21 phone $13.; *** The next statement extracts the last name from name; last_name = scan(name,-1,' '); /* scans from the right */ DATALINES; Jeff W. Snoker (908)782-4382 Raymond Albert (732)235-4444 Alred Edward Newman (800)123-4321 Steven J. Foster (201)567-9876 Jose Romerez (516)593-2377 ; RUN; PROC REPORT data = first_last NOWINDOWS; TITLE 'Names and Phone Numbers in Alphabetical Order (by Last Name)'; COLUMNS name phone last_name; DEFINE last_name / order noprint width = 20; DEFINE name / display 'Name' left width = 20; DEFINE phone / display 'Phone Number' width = 13 format=$13.; RUN;
|Steven J. Foster||(201)567-98|
|Alred Edward Newman||(800)123-43|
|Jeff W. Snoker||(908)782-43|
9.6. Locating the Position of One String Within Another String¶
Two somewhat similar functions, INDEX and INDEXC can be used to locate a string, or one of several strings within a longer string. For example, if you have a string ‘ABCDEFG’ and want the location of the letters DEF (starting position 4), the following INDEX function could be used:
The first argument is the argument you want to search, the second argument is the string you are searching for. This would return a value of 4, the starting position of the string ‘DEF’. If you want to know the starting position of any one of several characters, the INDEXC function can be used. As an example, if you wanted the starting position of any of the letters ‘G’, ‘C’, ‘B’, or ‘F’ in the string ‘ABCDEFG’, you would code:
The function would return a value of 2, the position of the ‘B’, the first letter found in the first argument.
DATA locate; INPUT string $ 1-10; first = index(string,'xyz'); first_c = indexc(string,'x','y','z'); DATALINES; abcxyz1234 1234567890 abcx1y2z39 abczzzxyz3 ; RUN; PROC PRINT data = locate NOOBS; TITLE 'The locate data set'; RUN;
9.7. Converting Between Lower and Upper Case and Proper Case¶
The two companion functions UPCASE and LOWCASE do just what you would expect. These two functions are especially useful when data entry clerks are careless and a mixture of upper and lower cases values are entered for the same variable. You may want to place all of your character variables in an array and UPCASE (or LOWCASE) them all.
DATA up_down; length a b c d e $ 1; INPUT a b c d e x y; DATALINES; M f P p D 1 2 m f m F M 3 4 ; RUN; DATA upper; SET up_down; array all_c[*] _character_; DO i = 1 to dim(all_c); all_c[i] = upcase(all_c[i]); END; DROP i; RUN; PROC PRINT data = upper NOOBS; TITLE 'The upper data set'; RUN;
A handy new V9 function is PROPCASE. This function capitalizes the first letter of each word.
DATA proper; INPUT Name $40.; propname = propcase(Name); DATALINES; rOn coDY the tall and the short the "%$#@!" escape ; RUN; PROC PRINT data = proper NOOBS; TITLE 'The proper data set'; RUN;
|rOn coDY||Ron Cody|
|the tall and the short||The Tall And The Short|
|the "%$#@!" escape||The "%$#@!" Escape|
9.8. Substituting One Word for Another in a String¶
TRANWRD (translate word), can perform a search and replace operation on a string variable. The syntax of the TRANWRD function is:
char_var is the string we are searching within
find_str is the string we are searching for within char_var
replace_str is the string which will be substituted for find_str within char_var if found.
That is, the function will replace every occurrence of find_str with replace_str.
DATA convert; INPUT @1 address $20.; *** Convert Street, Avenue, and Road to their abbreviations; address = tranwrd(address,'Street','St.'); address = tranwrd(address,'Avenue','Ave.'); address = tranwrd(address,'Road','Rd.'); DATALINES; 89 Lazy Brook Road 123 River Rd. 12 Main Street ; RUN; PROC PRINT data = convert; TITLE 'The convert data set'; RUN;
|1||89 Lazy Brook Rd.|
|2||123 River Rd.|
|3||12 Main St.|
The TRANWRD function must replace the find_str with a non null string, so if the replace_str is given as the empty string ‘’, TRANWRD will instead use an empty space ‘ ‘. If you want to replace with an empty string, you will have to use the TRANSTRN function with trimn(‘’) as the replacement string.
TITLE "Comparison of TRANSWRD and TRANSTRN with " DATA _null_; FILE print; string1='*' || tranwrd('abcxabc', 'abc', '') || '*'; put string1=; string2='*' || transtrn('abcxabc', 'abc', '') || '*'; put string2=; string3='*' || transtrn('abcxabc', 'abc', trimn('')) || '*'; put string3=; RUN;
string1=* x * string2=* x * string3=*x*
9.9. Fuzzy Comparisons: The SPEDIS Function¶
The SPEDIS function measures the “spelling distance” between two strings. If the two strings are identical, the function returns a 0. For each category of spelling error, the function assigns “penalty” points. For example, if the first letter in the two strings is different, there is a relatively large penalty. If two letters are interchanged, the number of penalty points is smaller. This function is very useful in performing a fuzzy merge where there may be differences in spelling between two files. You can also use it with character data consisting of numerals, such as social security numbers. The syntax of this function is:
DATA compare; length string1 string2 $ 15; INPUT string1 string2; points = spedis(string1, string2); DATALINES; same same same sam firstletter xirstletter lastletter lastlettex receipt reciept ; RUN; PROC PRINT data = compare NOOBS; TITLE 'The compare data set'; RUN;
9.10. The ANY Functions¶
The “ANY” functions of SAS 9 allow you to determine the position of a class of characters (digits, alpha-numeric, alpha, white space, or punctuation). The complete list is:
ANYALNUM- find any alphanumeric values (both letter or digit)
ANYALPHA- find any letter
ANYDIGIT- find any number
ANYPUNCT- find any punctuation
ANYSPACE- find any space
These functions return the first position of a character of the appropriate class. If no appropriate characters are found, the functions return a 0.
DATA find_alpha_digit; input string $20.; first_alpha = anyalpha(string); first_digit = anydigit(string); DATALINES; no digits here the 3 and 4 123 456 789 ; RUN; PROC PRINT data = find_alpha_digit NOOBS; TITLE 'The find_alpha_digit data set'; RUN;
|no digits here||1||0|
|the 3 and 4||1||5|
|123 456 789||0||1|
9.11. The NOT functions¶
The “NOT” functions work in a similar way to the “ANY” functions except that they return the position of the first class of character that does not match the designated class.
The complete list is:
These functions return the first position of a character that does not match the appropriate class. If no “bad” characters are found (characters that do not match the designation) the functions return a 0.
DATA data_cleaning; input string $20.; only_alpha = notalpha(trim(string)); only_digit = notdigit(trim(string)); DATALINES; abcdefg 1234567 abc123 1234abcd ; RUN; PROC PRINT data = data_cleaning NOOBS; TITLE 'The data_cleaning data set'; RUN;
9.12. Concatenation Functions: CAT, CATS, and CATX¶
Although you can concatenate (join) two strings using the concatenation operator (either || or !!), several new functions and call routines, new with SAS 9, can be used instead. The CAT function does the same thing of concatenating strings just like || or !!, but the advantage of the other new functions is that they can automatically strip off leading and trailing blanks and can insert separation characters for you. We will demonstrate only two of the concatenation functions here. The CATS function strips leading and trailing blanks before joining two or more strings. The CATX function works the same as the CATS function, but allows you to specify one or more separation characters to insert between the strings. The syntax for these two functions is:
DATA join_up; FILE print; length cats $ 6 catx $ 17; string1 = 'ABC '; string2 = ' XYZ '; string3 = '12345'; cat_op = string1 || string2; cat = cat(string1, string2); cats = cats(string1, string2); catx = catx('***', string1, string2, string3); PUT string1=; PUT string2=; PUT string3=; PUT cat_op=; PUT cat=; PUT cats=; PUT catx=; RUN; PROC PRINT data = join_up NOOBS; TITLE 'The join_up data set'; VAR string1 string2 string3 cat_op cat cats catx; RUN;
string1=ABC string2=XYZ string3=12345 cat_op=ABC XYZ cat=ABC XYZ cats=ABCXYZ catx=ABC***XYZ***12345
|ABC||XYZ||12345||ABC XYZ||ABC XYZ||ABCXYZ||ABC***XYZ***12345|
9.13. The LENGTH, LENGTHN, and LENGTHN Functions¶
We spoke earlier about storage length of character variables. The collection of LENGTH functions in this section have different and useful purposes. The LENGTHC function (V9) returns the storage length of character variables. The other two functions, LENGTH and LENGTHN both return the length of a character variable not counting trailing blanks. The only difference between LENGTH and LENGTHN is that LENGTHN returns a 0 for a null string while LENGTH returns a 1.
DATA how_long; one = 'ABC '; two = ' '; /* character missing value */ three = 'ABC XYZ'; length_one = length(one); lengthn_one = lengthn(one); lengthc_one = lengthc(one); length_two = length(two); lengthn_two = lengthn(two); lengthc_two = lengthc(two); length_three = length(three); lengthn_three = lengthn(three); lengthc_three = lengthc(three); RUN; PROC PRINT data = how_long NOOBS; TITLE 'The how_long data set'; RUN;
9.14. Comparing Two Strings Using the COMPARE Function¶
You may wonder why we need a function to compare two character strings. Why can’t you simple use an equal sign? The COMPARE function gives you more flexibility in comparing strings. The syntax is:
COMPARE(string1, string2 <,'modifiers'>)
You may use one or more modifiers from the following list, placed in single or double quotes as follows:
i or I - ignore case
l or L - removes leading blanks
n or N - removes quotes from any argument that is an n-literal and ignore case. An n-literal is a string in quotes, followed by an ‘n’, which is useful for non-valid SAS names.
: (color) - truncate the longer string to the length of the shorter string. Note that the default is to pad the shorter string with blanks before a comparison.
For example, if you want to ignore case and remove leading blanks, you could code:
yes_or_no = compare(string1,string2,'il');
The COMPARE function returns the first position in which the two strings are different. It returns a negative number if string1 is aplhabetically or numerically ordered before string2. It returns 0 if the two strings are the same.
DATA compare_str; string1 = " Robert Parker"; string2 = "ROBERT PARKER"; compare1 = compare(string1, string2); compare2 = compare(string1, string2, 'i'); compare3 = compare(string1, string2, 'l'); compare4 = compare(string1, string2, 'il'); RUN; PROC PRINT data = compare_str; TITLE "Compare Strings Dataset"; RUN;
|1||Robert Parker||ROBERT PARKER||-1||-1||2||0|
9.15. Removing Leading and Trailing¶
The STRIP function removes leading and trailing blanks. The TRIM function removes trailing blanks, and the LEFT function removes leading blanks.
TITLE "Whitespace Example"; DATA _NULL_; FILE print; string = " ROBERT "; strip = '*' || strip(string) || '*'; trim = '*' || trim(string) || '*'; left = '*' || left(string) || '*'; put strip; put trim; put left; RUN;
*ROBERT* * ROBERT* *ROBERT *
9.16. Counting Occurences of Characters or Substrings: COUNT and COUNTC¶
Two other SAS 9 functions are COUNT and COUNTC. COUNT is used to count the number of times a particular substring appears in a string. COUNTC counts the number of times one or more characters appear. You can also use modifiers to ignore case (use the ‘i’ modifier) or ignore trailing blanks (use the ‘t’ modifier) in both the string and the find_string.
The syntax for these two functions is:
DATA Dracula; INPUT string $20.; count_a_or_b = count(string,'ab'); countc_a_or_b = countc(string,'ab'); count_abc = count(string,'abc'); countc_abc = countc(string,'abc'); case_a = countc(string,'a','i'); DATALINES; xxabcxabcxxbbbb cbacba aaAA ; RUN; PROC PRINT data = Dracula NOOBS; TITLE 'The Dracula data set'; RUN;
9.17. PRX Functions: Regular Expressions¶
SAS also has a family of PRX functions that allow string matching based on regular expressions instead of fixed strings. What are regular expressions?
Ways to search for specific strings
Can be very complicated or simple
Highly useful - think “Find” on steriods
They can be used to match a large number of strings in one statement
.matches any single character
*means repeate as many (even if 0) more times the last character
?makes the last thing optional
^matches the start of a vector
^a- starst with “a”
$matches the end of the vector
b$- ends with “b”
We will cover the following SAS functions which use regular expressions
PRXPARSE- defines the regular expression (and substitution string) for the SAS functions
PRXMATCH- search the string for a given regular expression from PRXPARSE and returns the starting position if a match is found and 0 otherwise
CALL PRXSUBSTR - similar to PRXMATCH, search the string for the given regular expression from PRXPARSE but can find both the starting position and length of the match or return 0 for starting postion and length if no match is found.
PRXCHANGE - search and replace the regular expression from PRXPARSE with the given replacement string. There is a parameter to specify how many matches to replace. A value of -1 for how many times to replace mean to replace all matches found.
Let’s use the salaries dataset to illustrate these functions.
FILENAME salaries '/folders/myfolders/SAS_Notes/data/Baltimore_City_Employee_Salaries_FY2015.csv'; PROC IMPORT datafile = salaries out = sal dbms = CSV replace; getnames = yes; RUN; PROC PRINT data = sal (obs=5); VAR name; RUN;
DATA sal_rawlings; set sal; re = prxparse('/Rawlings/i'); if missing(re) then do; putlog 'ERROR: regex is malformed'; stop; end; if prxmatch(re, name) = 0 THEN DELETE; RUN; PROC PRINT data = sal_rawlings; VAR name; RUN;
DATA sal_a1; set sal; re = prxparse('s/a/j/'); if missing(re) then do; putlog 'ERROR: regex is malformed'; stop; end; name2 = prxchange(re, 1, name); RUN; PROC PRINT data = sal_a1(obs=5); VAR name name2; RUN; DATA sal_a2; set sal; re = prxparse('s/a/j/'); if missing(re) then do; putlog 'ERROR: regex is malformed'; stop; end; name2 = prxchange(re, -1, name); RUN; PROC PRINT data = sal_a2(obs=5); VAR name name2; RUN;
|1||Aaron,Patricia G||Ajron,Patricia G|
|2||Aaron,Petra L||Ajron,Petra L|
|3||Abaineh,Yohannes T||Abjineh,Yohannes T|
|4||Abbene,Anthony M||Abbene,Anthony M|
|1||Aaron,Patricia G||Ajron,Pjtricij G|
|2||Aaron,Petra L||Ajron,Petrj L|
|3||Abaineh,Yohannes T||Abjineh,Yohjnnes T|
|4||Abbene,Anthony M||Abbene,Anthony M|
Read in the Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, using PROC IMPORT.
Make a temporary SAS dataset called
health_salusing the salaries dataset, with only the agencies (
JobTitle) of those with “fire” (anywhere in the job title), if any, in the name. Be sure to ignore case in the string matching for ‘fire’.
Make a temporary data set called
transwhich contains only agencies that contain “TRANS” (case sensitive).
What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.
Create a subset of the salaries dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with (case sensitive) string matching. Call this
emer. How many employees are in this new dataset?
Create a variable called
emerdata set by extracting the the part of the
Agencyvariable up until ‘ment’ or ‘ice’. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Make frequency table of the dept variable using PROC FREQ. To extract the desired substring, use:
re = prxparse('/.*(ment|ice)/'); call prxsubstr(re, Agency, pos, len); dept = substr(Agency, pos, len);