{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Character Functions\n", "\n", "In this lesson, we'll investigate some of the functions available in SAS that can be applied only to character variables. For example, if you want to remove blanks from a character string, you might consider using the **compress** function. Or, if you want to select a smaller substring, say a first name, from a larger string containing one's full name, you might want to take advantage of the **substr** function. Some of the functions that we will learn about are old standbys, such as: **length**, **substr**, **compbl**, **compress**, **verify**, **input**, **put**, **tranwrd**, **scan**, **trim**, **upcase**, **lowcase**, **| |** (concatenation), **index**, **indexc**, and **spedis**. And, some of the functions that we will learn about are new just to SAS Version 9. They include: **anyalpha**, **anydigit**, **catx**, **cats**, **lengthc**, **propcase**, **strip**, **count**, and **countc**.\n", "\n", "Let's also take this opportunity to introduce you to a couple of great resources for finding information about a variety of SAS topics. One resource is sasCommunity.org. You can learn more about the sasCommunity.org just by mucking around for a bit on the site. Another resource is the SAS Global Forum Conference which is held each year. However, the really helpful place to end up is a resource that allows you to search for previous papers and presentations from the SAS Global Forum (annual conferences for SAS Users). It is using this search engine where you'll find the material this lesson is based on!\n", "\n", "For this lesson, we will work through Ron Cody's paper An Introduction to SAS Character Function.\n", "\n", "## Lengths of Character Variables\n", "\n", "Before we actually discuss these functions, we need to understand how SAS software assigns storage lengths to character variables. It is important to remember two things: 1) The storage length of a character variable is set at compile time. and 2) this length is determined by the first appearance of a character variable in a DATA step. There are several ways to check the storage length of character variables in your SAS data set. One way is to run PROC CONTENTS. If you are using SAS 9 and above, the new function LENGTHC can be used to determine the storage length of a character variable. \n", "\n", "
\n", "

Example

\n", "

The following SAS program illustrates properties of the length of a character varible in SAS.

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SAS Connection established. Subprocess id is 8422\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "
\n",
       "storage_length=3                                                                                                                    \n",
       "display=:abc:                                                                                                                       \n",
       "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA chars1;\n", " FILE print;\n", " string = 'abc';\n", " length string $ 7; /* Does this do anything */\n", " storage_length = lengthc(string);\n", " display = \":\" || string || \":\";\n", " put storage_length= ;\n", " put display= ;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Let's examine the program. We are creating a temporary dataset called chars1. The FILE statement determines where output from PUT statement is sent to. In this case, the use of the PRINT option sends output from PUT to the ODS output destination used for all other output such as plots and other output from PROC steps. We then create a character variable called string that contains the string 'abc'. Since we create the variable first, the next line where we set the length of string to be 7 characters is ignored. We can see that this happens by using the lengthc function to save the length of the string variable to the numeric variable storage_length which is then output by using the PUT statments. Note that both PUT statements use the format

\n", "
PUT <varname>=;
\n", "

which results in both printing out the variables name along with it's stored value as we can see in the output shown above. Note that we have also created another string called display that contains the character value stored in string with semicolons appended on either end by using the string concatenation operators ||.

\n", "

In order to correctly set the length of string to the desired length of 7 characters, we must run the LENGTH statement first before assigning a value to string as in the following program.

\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "
\n",
       "storage_length=7                                                                                                                    \n",
       "display=:abc    :                                                                                                                   \n",
       "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA chars2;\n", " FILE print;\n", " length string $ 7; /* Does this do anything */\n", " string = 'abc';\n", " storage_length = lengthc(string);\n", " display = \":\" || string || \":\";\n", " put storage_length= ;\n", " put display= ;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Now that we have switched the order of the LENGTH statement and creating the variable string, the length of string is now set to 7 instead of 3 and whitespace has been added to pad the extra character values to make it length 7 as shown in the display variable.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Removing Characters from a String\n", "\n", "In this section we will examine using the different COMPRESS functions to remove character values from SAS character strings.\n", "\n", "
\n", "

Example

\n", "

This example will demonstrate how to convert multiple blanks to a single blank. Suppose you have some names and addresses in a file. Some of the data entry clerks placed extra spaces between the first and last names and in the address fields. You would like to store all names and addresses with single blanks. Here is an example of how this is done:

\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsnameaddresscitystatezip
1Ron Cody89 Lazy Brook RoadFlemingtonNJ0882
2Bill Brown28 Cathy StreetNorth CityNY1151
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA multiple;\n", " INPUT #1 @1 name $20.\n", " #2 @1 address $30.\n", " #3 @1 city $15.\n", " @20 state $2.\n", " @25 zip $5.;\n", " name = compbl(name);\n", " address = compbl(address);\n", " city = compbl(city);\n", "DATALINES;\n", "Ron Cody\n", "89 Lazy Brook Road\n", "Flemington NJ 08822\n", "Bill Brown\n", "28 Cathy Street\n", "North City NY 11518\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = multiple;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The compbl function reduces anywhere it finds multiple successive blanks into a single blank. Note that it removed the extra white space between 28 and Cathy and Street in the second address.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A more general problem is to remove selected characters from a string. For example, suppose you want to remove blanks, parentheses, and dashes from a phone number that has been stored as a character value. Here comes the **COMPRESS** function to the rescue! The COMPRESS function can remove any number of specified characters from a character variable.\n", "\n", "
\n", "

Example

\n", "

The program below uses the COMPRESS function twice. The first time, to remove blanks from the string, and the second to remove blanks plus the other above mentioned characters. Here is the code:

\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The phone data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
phonephone1phone2
(908)235-4490(908)235-44909082354490
(201) 555-77 99(201)555-77992015557799
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA phone;\n", " INPUT phone $ 1-15;\n", " phone1 = compress(phone);\n", " phone2 = compress(phone,'(-) ');\n", "DATALINES;\n", "(908)235-4490\n", "(201) 555-77 99\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = phone NOOBS;\n", " TITLE 'The phone data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The variable PHONE1 has just blanks removed. Notice that the COMPRESS function does not have a second argument here. When it is omitted, the COMPRESS function removes only blanks. For the variable PHONE2, the second argument of the COMPRESS function contains a list of the characters to remove: left parenthesis, blank, right parenthesis, and dash. This string is placed in single or double quotes. Remember, when you specify a list of characters to remove, blanks are no longer included unless you explicitly include a blank in the list.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Character Data Verification\n", "\n", "A common task in data processing is to validate data. For example, you may want to be sure that only certain values are present in a character variable. \n", "\n", "
\n", "

Example

\n", "

In the example below, only the values 'A', 'B', 'C', 'D', and 'E' are valid data values. A very easy way to test if there are any invalid characters present is shown next:

\n", "
" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The verify data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
idanswerposition
001acbed0
002abxde3
00312cce1
004abc e4
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA verify;\n", " INPUT @1 id $3.\n", " @5 answer $5.;\n", " position = verify(answer,'abcde');\n", "DATALINES;\n", "001 acbed\n", "002 abxde\n", "003 12cce\n", "004 abc e\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = verify NOOBS;\n", " TITLE 'The verify data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The workhorse of this example is the VERIFY function. It is a bit complicated. It inspects every character in the first argument and, if it finds any value not in the verify string (the second argument), it will return the position of the first offending value. If all the values of the string are located in the verify string, a value of 0 is returned. To help clarify this, look at the output shown above.

\n", "

One thing to be careful of when using the VERIFY function (and many of the other character functions) is trailing blanks. For example, look at the following:

\n", "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The trailing data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stringpos1pos2
abc40
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA trailing;\n", " length string $ 10;\n", " string = 'abc';\n", " pos1 = verify(string,'abcde');\n", " pos2 = verify(trim(string),'abcde');\n", "RUN;\n", "\n", "PROC PRINT data = trailing NOOBS;\n", " TITLE 'The trailing data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The LENGTH statement forces SAS to add 7 trailing blanks to pad string to be of length 10, so the value of POS1 is 4, the position of the first trailing blank. One way to avoid the trailing blank problem is to remove the trailing blanks before using the verify function. The TRIM function does this for us. We use TRIM first for POS2, so now with the trailing blanks removed there are no invalid characters and VERIFY returns 0. Do note that TRIM only removes leading and trailing blanks, so this would not work if the blanks were in the middle of the string. In this case, you could add a blank to the verify string or use compress to remove all the blanks.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting Substrings\n", "\n", "A substring is a part a longer string (although it can actually be the same length but this would not be too useful). To extract a substring in SAS we can use the SUBSTR function.\n", "\n", "
\n", "

Example

\n", "

In this example, you have ID codes which contain in the first two positions, a state abbreviation. Furthermore, positions 7-9 contain a numeric code. You want to create two new variables; one containing the two digit state codes and the other, a numeric variable constructed from the three numerals in positions 7,8, and 9. Here goes:

\n", "
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The pieces_parts data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
idstatenum
NYXXXX123NY123
NJ1234567NJ567
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA pieces_parts;\n", " INPUT id $ 1-9;\n", " length state $ 2;\n", " state = substr(id,1,2);\n", " num = input(substr(id,7,3),3.);\n", "DATALINES;\n", "NYXXXX123\n", "NJ1234567\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = pieces_parts NOOBS;\n", " TITLE 'The pieces_parts data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Creating the state code is easy. We use the SUBSTR function. The first argument is the variable from which we want to extract the substring, the second argument is the starting position of the substring, and the last argument is the length of the substring (not the ending position as you might guess). Also note the use of the LENGTH statement to set the length of STATE to 2 bytes. Without a LENGTH statement, the length of STATE would be the same as the length of ID. Why? Remember that character variable lengths are set at compile time. The starting position and length parameters are constants here, but they could have been computed or read in from an external file. So, without a LENGTH statement, what is SAS to do? What is the longest substring you can extract from a string of length n? The answer is n and that is what SAS uses as the default length of the result.

\n", "

Extracting the three digit number code is more complicated. First we use the SUBSTR function to pull out the three numerals (numerals are character representations of numbers). However, the result of a SUBSTR function is always a character value. To convert the character value to a number, we use the INPUT function. The INPUT function takes the first argument and \"reads\" it as if it were coming from a file, according to the informat listed as the second argument. So, for the first observation, the SUBSTR function would return the string '123' and the INPUT function would convert this to the number 123. As a point of interest, you may use a longer informat as the second argument without any problems. For example, the INPUT statement could have been written as:

\n", "
input(substr(id,7,3),8.);
\n", "

and everything would have worked out fine. This fact is useful in situations where you do not know the length of the string ahead of time.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a particularly useful and somewhat obscure use of the SUBSTR function that we would like to discuss next. You can use this function to place characters in specific locations within a string by placing the SUBSTR function on the left hand side of the equal sign (in the older manuals I think this was called a SUBSTR pesudo function).\n", "\n", "
\n", "

Example

\n", "

Suppose you have some systolic blood pressures (SBP) and diastolic blood pressures (DBP) in a SAS data set. You want to print out these values and star high values with an asterisk. Here is a program that uses the SUBSTR function on the left of the equals sign to do that:

\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The pressure data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
sbpdbpsbp_chkdbp_chk
1208012080
18092180*92*
200110200*110*
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA pressure;\n", " INPUT sbp dbp @@;\n", " length sbp_chk dbp_chk $ 4;\n", " sbp_chk = put(sbp,3.);\n", " dbp_chk = put(dbp,3.);\n", " if sbp gt 160 then substr(sbp_chk,4,1) = '*';\n", " if dbp gt 90 then substr(dbp_chk,4,1) = '*';\n", "DATALINES;\n", "120 80 180 92 200 110\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = pressure NOOBS;\n", " TITLE 'The pressure data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

We first need to set the lengths of SBP_CHK and DBP_CHK to 4 (three spaces for the value plus one for the possible asterisk). Next, we use a PUT function to perform a numeric to character conversion. The PUT function is, in some ways, similar to the INPUT function. It \"writes out\" the value of the first argument, according to the FORMAT specified in the second argument. By \"write out\" we actually mean assign the value to the variable on the left of the equal sign. The SUBSTR function then places an asterisk in the fourth position when a value of SBP is greater than 160 or a value of DBP is greater than 90, as you can see in the output above.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parsing a String with SCAN\n", "\n", "Parsing a string means to take it apart based on some rules. We can parse a string by using the SCAN function if we know what character values are used to separate different parts of the string.\n", "\n", "
\n", "

Example

\n", "

In the example to follow, five separate character values were placed together on a line with either a space, a comma, a semi-colon, a period, or an explanation mark between them. You would like to extract the five values and assign them to five character variables. Without the SCAN function this would be hard; with it, it's easy:

\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The parse data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
piece1piece2piece3piece4piece5
thislinecontainsfivewords
abcdefghijxxxyyy    
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA parse;\n", " INPUT long_str $ 1-80;\n", " array pieces[5] $ 10 piece1-piece5;\n", " do i = 1 to 5;\n", " pieces[i] = scan(long_str,i,',.! ');\n", " end;\n", " drop long_str i;\n", "DATALINES;\n", "this line,contains!five.words\n", "abcdefghijkl xxx yyy\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = parse NOOBS;\n", " TITLE 'The parse data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The function:

\n", "
SCAN(char_var,n,'list-of-delimiters');
\n", "

returns the nth \"word\" from the char_var, where a \"word\" is defined as anything between two delimiters. If there are fewer than n words in the character variable, the SCAN function will return a blank. If n is negative, the scan will proceed from right to left. If n is greater than the number of words in the string, a missing value is returned.

\n", "

By placing the SCAN function in a DO loop, we can pick out the nth word in the string. Even though we have not discussed loops and array in SAS yet, hopefully the general syntax makes sense from using loops and arrays in other languages such as R.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

Here is an interesting example that uses the SCAN function to extract the last name from a character variable that contains first and last name as well as a possible middle name or initial. In this example, you want to create a list in alphabetical order by last name. First the program, then the explanation:

\n", "
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Names and Phone Numbers in Alphabetical Order (by Last Name)

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
NamePhone Number
Raymond Albert(732)235-44
Steven J. Foster(201)567-98
Alred Edward Newman(800)123-43
Jose Romerez(516)593-23
Jeff W. Snoker(908)782-43
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA first_last;\n", " INPUT @1 name $20.\n", " @21 phone $13.;\n", " *** The next statement extracts the last name from name;\n", " last_name = scan(name,-1,' '); /* scans from the right */\n", "DATALINES;\n", "Jeff W. Snoker (908)782-4382\n", "Raymond Albert (732)235-4444\n", "Alred Edward Newman (800)123-4321\n", "Steven J. Foster (201)567-9876\n", "Jose Romerez (516)593-2377\n", ";\n", "RUN;\n", "\n", "PROC REPORT data = first_last NOWINDOWS;\n", " TITLE 'Names and Phone Numbers in Alphabetical Order (by Last Name)';\n", " COLUMNS name phone last_name;\n", " DEFINE last_name / order noprint width = 20;\n", " DEFINE name / display 'Name' left width = 20;\n", " DEFINE phone / display 'Phone Number' width = 13 format=$13.;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

It is easy to extract the last name by using a –1 as the second argument of the SCAN function. Remember, a negative value for this arguments results in a scan from right to left. Output from the REPORT procedure is shown above.

\n", "

The REPORT procedure simply creates a table. The COLUMNS statement defines what variables will be used to make the table, and the DEFINE statements describes how they will be used. For example, the order and noprint options are used for last_name so that we use the last name to order the rows but do not print the column to the output table.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Locating the Position of One String Within Another String\n", "\n", "Two somewhat similar functions, INDEX and INDEXC can be used to locate a string, or one of several strings within a longer string. For example, if you have a string 'ABCDEFG' and want the location of the letters DEF (starting position 4), the following INDEX function could be used:\n", "\n", "`INDEX('ABCDEFG','DEF');`\n", "\n", "The first argument is the argument you want to search, the second argument is the string you are searching for. This would return a value of 4, the starting position of the string 'DEF'. If you want to know the starting position of any one of several characters, the INDEXC function can be used. As an example, if you wanted the starting position of any of the letters 'G', 'C', 'B', or 'F' in the string 'ABCDEFG', you would code: \n", "\n", "`INDEXC('ABCDEFG','GCBF');`\n", "\n", "or\n", "\n", "`INDEXC('ABCDEFG','G','C','B','F');`\n", "\n", "The function would return a value of 2, the position of the 'B', the first letter found in the first argument.\n", "\n", "
\n", "

Example

\n", "

Here is a short program which demonstrate these two functions: If the search fails, both functions return a zero.

\n", "
" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The locate data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stringfirstfirst_c
abcxyz123444
123456789000
abcx1y2z3904
abczzzxyz374
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA locate;\n", " INPUT string $ 1-10;\n", " first = index(string,'xyz');\n", " first_c = indexc(string,'x','y','z');\n", "DATALINES;\n", "abcxyz1234\n", "1234567890\n", "abcx1y2z39\n", "abczzzxyz3\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = locate NOOBS;\n", " TITLE 'The locate data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting Between Lower and Upper Case and Proper Case\n", "\n", "The two companion functions UPCASE and LOWCASE do just what you would expect. These two functions are especially useful when data entry clerks are careless and a mixture of upper and lower cases values are entered for the same variable. You may want to place all of your character variables in an array and UPCASE (or LOWCASE) them all. \n", "\n", "
\n", "

Example

\n", "

The following example program uses the UPCASE function to convert all character variables to upper case.

\n", "
" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The upper data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
abcdexy
MFPPD12
MFMFM34
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA up_down;\n", " length a b c d e $ 1;\n", " INPUT a b c d e x y;\n", "DATALINES;\n", "M f P p D 1 2\n", "m f m F M 3 4\n", ";\n", "RUN;\n", "\n", "DATA upper;\n", " SET up_down;\n", " array all_c[*] _character_;\n", " DO i = 1 to dim(all_c);\n", " all_c[i] = upcase(all_c[i]);\n", " END;\n", " DROP i;\n", "RUN;\n", "\n", "PROC PRINT data = upper NOOBS;\n", " TITLE 'The upper data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

This program uses the _CHARACTER_ keyword to select all the character variables instead of individually listing them out when creating the array. The result of running this program is to convert all values for the variables A,B,C, D, and E to upper case. The LOWCASE function could be used in place of the UPCASE function if you wanted all your character values in lower case.

\n", "

Again, do not worry if you don't fully understand the array and do loops yet. We will cover these in more detail in an unpcoming lesson.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A handy new V9 function is PROPCASE. This function capitalizes the first letter of each word. \n", "\n", "
\n", "

Example

\n", "

The following example program uses the PROPCASE function to text stored in a character variable to proper case, i.e. the frist letter is capitalized followed by all lower case letters within a word.

\n", "
" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The proper data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Namepropname
rOn coDYRon Cody
the tall and the shortThe Tall And The Short
the "%$#@!" escapeThe "%$#@!" Escape
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA proper;\n", " INPUT Name $40.;\n", " propname = propcase(Name);\n", "DATALINES;\n", "rOn coDY\n", "the tall and the short\n", "the \"%$#@!\" escape\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = proper NOOBS;\n", " TITLE 'The proper data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Substituting One Word for Another in a String\n", "\n", "TRANWRD (translate word), can perform a search and replace operation on a string variable. The syntax of the TRANWRD function is: \n", "\n", "`TRANWRD (char_var,'find_str','replace_str');`\n", "\n", "where\n", "\n", "* char_var is the string we are searching within\n", "* find_str is the string we are searching for within char_var\n", "* replace_str is the string which will be substituted for find_str within char_var if found.\n", "\n", "That is, the function will replace every occurrence of find_str with replace_str. \n", "\n", "
\n", "

Example

\n", "

In the following example, we want to standardize addresses by converting the words 'Street', 'Avenue', and 'Road' to the abbreviations 'St.', 'Ave.', and 'Rd.' respectively. Look at the following program:

\n", "
" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The convert data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsaddress
189 Lazy Brook Rd.
2123 River Rd.
312 Main St.
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA convert;\n", " INPUT @1 address $20.;\n", " *** Convert Street, Avenue, and Road \n", " to their abbreviations;\n", " address = tranwrd(address,'Street','St.');\n", " address = tranwrd(address,'Avenue','Ave.');\n", " address = tranwrd(address,'Road','Rd.');\n", "DATALINES;\n", "89 Lazy Brook Road\n", "123 River Rd.\n", "12 Main Street\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = convert;\n", " TITLE 'The convert data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The TRANWRD function must replace the find_str with a non null string, so if the replace_str is given as the empty string '', TRANWRD will instead use an empty space ' '. If you want to replace with an empty string, you will have to use the TRANSTRN function with trimn('') as the replacement string. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The convert data set

\n", "
\n", "
\n", "
\n",
       "string1=* x *                                                                                                                       \n",
       "string2=* x *                                                                                                                       \n",
       "string3=*x*                                                                                                                         \n",
       "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TITLE \"Comparison of TRANSWRD and TRANSTRN with \"\n", "DATA _null_;\n", " FILE print;\n", " string1='*' || tranwrd('abcxabc', 'abc', '') || '*';\n", " put string1=;\n", " string2='*' || transtrn('abcxabc', 'abc', '') || '*';\n", " put string2=;\n", " string3='*' || transtrn('abcxabc', 'abc', trimn('')) || '*';\n", " put string3=;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fuzzy Comparisons: The SPEDIS Function\n", "\n", "The SPEDIS function measures the \"spelling distance\" between two strings. If the two strings are identical, the function returns a 0. For each category of spelling error, the function assigns \"penalty\" points. For example, if the first letter in the two strings is different, there is a relatively large penalty. If two letters are interchanged, the number of penalty points is smaller. This function is very useful in performing a fuzzy merge where there may be differences in spelling between two files. You can also use it with character data consisting of numerals, such as social security numbers. The syntax of this function is:\n", "\n", "`SPEDIS(string1,string2);`\n", "\n", "
\n", "

Example

\n", "

The following program demonstrates some example distances between strings.

\n", "
" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The compare data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
string1string2points
samesame0
samesam8
firstletterxirstletter18
lastletterlastlettex10
receiptreciept7
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA compare;\n", " length string1 string2 $ 15;\n", " INPUT string1 string2;\n", " points = spedis(string1, string2);\n", "DATALINES;\n", "same same\n", "same sam\n", "firstletter xirstletter\n", "lastletter lastlettex\n", "receipt reciept\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = compare NOOBS;\n", " TITLE 'The compare data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The ANY Functions\n", "\n", "The \"ANY\" functions of SAS 9 allow you to determine the position of a class of characters (digits, alpha-numeric, alpha, white space, or punctuation). The complete list is:\n", "\n", "* `ANYALNUM` - find any alphanumeric values (both letter or digit)\n", "* `ANYALPHA` - find any letter\n", "* `ANYDIGIT` - find any number\n", "* `ANYPUNCT` - find any punctuation\n", "* `ANYSPACE` - find any space\n", "\n", "These functions return the first position of a character of the appropriate class. If no appropriate characters are found, the functions return a 0. \n", "\n", "
\n", "

Example

\n", "

The following program demonstrates the ANYALPHA and ANYDIGIT functions.

\n", "
" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The find_alpha_digit data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stringfirst_alphafirst_digit
no digits here10
the 3 and 415
123 456 78901
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA find_alpha_digit;\n", " input string $20.;\n", " first_alpha = anyalpha(string);\n", " first_digit = anydigit(string);\n", "DATALINES;\n", "no digits here\n", "the 3 and 4\n", "123 456 789\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = find_alpha_digit NOOBS;\n", " TITLE 'The find_alpha_digit data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The NOT functions\n", "\n", "The \"NOT\" functions work in a similar way to the \"ANY\" functions except that they return the position of the first class of character that does not match the designated class. \n", "\n", "The complete list is:\n", "\n", "* `NOTALNUM`\n", "* `NOTALPHA`\n", "* `NOTDIGIT`\n", "* `NOTPUNCT`\n", "* `NOTSPACE`\n", "\n", "These functions return the first position of a character that does not match the appropriate class. If no \"bad\" characters are found (characters that do not match the designation) the functions return a 0. \n", "\n", "
\n", "

Example

\n", "

The following program demonstrates the NOTALPHA and NOTDIGIT functions.

\n", "
" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The data_cleaning data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stringonly_alphaonly_digit
abcdefg01
123456710
abc12341
1234abcd15
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA data_cleaning;\n", " input string $20.;\n", " only_alpha = notalpha(trim(string));\n", " only_digit = notdigit(trim(string));\n", "DATALINES;\n", "abcdefg\n", "1234567\n", "abc123\n", "1234abcd\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = data_cleaning NOOBS;\n", " TITLE 'The data_cleaning data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Notice the use of the TRIM function in the program above. Without it, both NOTALPHA and NOTDIGIT would return the position of the first blank. You can see why we called the data set DATA_CLEANING. It is an easy way to determine if a string contains only a single class of characters.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concatenation Functions: CAT, CATS, and CATX\n", "\n", "Although you can concatenate (join) two strings using the concatenation operator (either || or !!), several new functions and call routines, new with SAS 9, can be used instead. The CAT function does the same thing of concatenating strings just like || or !!, but the advantage of the other new functions is that they can automatically strip off leading and trailing blanks and can insert separation characters for you. We will demonstrate only two of the concatenation functions here. The CATS function strips leading and trailing blanks before joining two or more strings. The CATX function works the same as the CATS function, but allows you to specify one or more separation characters to insert between the strings. The syntax for these two functions is:\n", "\n", "`CATS(string1,string2,);`\n", "\n", "`CATX(separator,string1,string2,);`\n", "\n", "
\n", "

Example

\n", "

The following program demonstrates the differences between the different concatenation functions.

\n", "
" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The join_up data set

\n", "
\n", "
\n", "
\n",
       "string1=ABC                                                                                                                         \n",
       "string2=XYZ                                                                                                                         \n",
       "string3=12345                                                                                                                       \n",
       "cat_op=ABC      XYZ                                                                                                                 \n",
       "cat=ABC      XYZ                                                                                                                    \n",
       "cats=ABCXYZ                                                                                                                         \n",
       "catx=ABC***XYZ***12345                                                                                                              \n",
       "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

The join_up data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
string1string2string3cat_opcatcatscatx
ABCXYZ12345ABC XYZABC XYZABCXYZABC***XYZ***12345
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA join_up;\n", " FILE print;\n", " length cats $ 6 catx $ 17;\n", " string1 = 'ABC ';\n", " string2 = ' XYZ ';\n", " string3 = '12345';\n", " cat_op = string1 || string2;\n", " cat = cat(string1, string2);\n", " cats = cats(string1, string2);\n", " catx = catx('***', string1, string2, string3);\n", " PUT string1=;\n", " PUT string2=;\n", " PUT string3=;\n", " PUT cat_op=;\n", " PUT cat=;\n", " PUT cats=;\n", " PUT catx=;\n", "RUN;\n", "\n", "PROC PRINT data = join_up NOOBS;\n", " TITLE 'The join_up data set';\n", " VAR string1 string2 string3 cat_op cat cats catx;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Without the LENGTH statement in this program, the length of the two variables CATS and CATX would be 200 (the default length for the CAT functions). Note that the default length when using the concatenation operator is the sum of the lengths of the arguments to be joined.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The LENGTH, LENGTHN, and LENGTHN Functions\n", "\n", "We spoke earlier about storage length of character variables. The collection of LENGTH functions in this section have different and useful purposes. The LENGTHC function (V9) returns the storage length of character variables. The other two functions, LENGTH and LENGTHN both return the length of a character variable not counting trailing blanks. The only difference between LENGTH and LENGTHN is that LENGTHN returns a 0 for a null string while LENGTH returns a 1.\n", "\n", "
\n", "

Example

\n", "

The following program demonstrates the LENGTH, LENGTHN, and LENGTHC functions.

\n", "
" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The how_long data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
onetwothreelength_onelengthn_onelengthc_onelength_twolengthn_twolengthc_twolength_threelengthn_threelengthc_three
ABC  ABC XYZ336101999
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA how_long;\n", " one = 'ABC ';\n", " two = ' '; /* character missing value */\n", " three = 'ABC XYZ';\n", " length_one = length(one);\n", " lengthn_one = lengthn(one);\n", " lengthc_one = lengthc(one);\n", " length_two = length(two);\n", " lengthn_two = lengthn(two);\n", " lengthc_two = lengthc(two);\n", " length_three = length(three);\n", " lengthn_three = lengthn(three);\n", " lengthc_three = lengthc(three);\n", "RUN;\n", "\n", "PROC PRINT data = how_long NOOBS;\n", " TITLE 'The how_long data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing Two Strings Using the COMPARE Function\n", "\n", "You may wonder why we need a function to compare two character strings. Why can't you simple use an equal sign? The COMPARE function gives you more flexibility in comparing strings. The syntax is:\n", "\n", "`COMPARE(string1, string2 <,'modifiers'>)`\n", "\n", "You may use one or more modifiers from the following list, placed in single or double quotes as follows: \n", "\n", "* i or I - ignore case\n", "* l or L - removes leading blanks\n", "* n or N - removes quotes from any argument that is an n-literal and ignore case. An n-literal is a string in quotes, followed by an 'n', which is useful for non-valid SAS names.\n", "* : (color) - truncate the longer string to the length of the shorter string. Note that the default is to pad the shorter string with blanks before a comparison.\n", "\n", "For example, if you want to ignore case and remove leading blanks, you could code:\n", "\n", "`yes_or_no = compare(string1,string2,'il');`\n", "\n", "The COMPARE function returns the first position in which the two strings are different. It returns a negative number if string1 is aplhabetically or numerically ordered before string2. It returns 0 if the two strings are the same." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Compare Strings Dataset

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsstring1string2compare1compare2compare3compare4
1Robert ParkerROBERT PARKER-1-120
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA compare_str;\n", " string1 = \" Robert Parker\";\n", " string2 = \"ROBERT PARKER\";\n", " compare1 = compare(string1, string2);\n", " compare2 = compare(string1, string2, 'i');\n", " compare3 = compare(string1, string2, 'l');\n", " compare4 = compare(string1, string2, 'il');\n", "RUN;\n", "\n", "PROC PRINT data = compare_str;\n", " TITLE \"Compare Strings Dataset\";\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Removing Leading and Trailing\n", "\n", "The STRIP function removes leading and trailing blanks. The TRIM function removes trailing blanks, and the LEFT function removes leading blanks." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Whitespace Example

\n", "
\n", "
\n", "
\n",
       "*ROBERT*                                                                                                                            \n",
       "*    ROBERT*                                                                                                                        \n",
       "*ROBERT        *                                                                                                                    \n",
       "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TITLE \"Whitespace Example\";\n", "\n", "DATA _NULL_;\n", " FILE print;\n", " string = \" ROBERT \";\n", " strip = '*' || strip(string) || '*';\n", " trim = '*' || trim(string) || '*';\n", " left = '*' || left(string) || '*';\n", " put strip;\n", " put trim;\n", " put left;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Counting Occurences of Characters or Substrings: COUNT and COUNTC\n", "\n", "Two other SAS 9 functions are COUNT and COUNTC. COUNT is used to count the number of times a particular substring appears in a string. COUNTC counts the number of times one or more characters appear. You can also use modifiers to ignore case (use the 'i' modifier) or ignore trailing blanks (use the 't' modifier) in both the string and the find_string.\n", "\n", "The syntax for these two functions is:\n", "\n", "`count(string,find_string,<'modifiers'>)`\n", "\n", "`countc(string,find_string,<'modifiers'>)`\n", "\n", "
\n", "

Example

\n", "

The following program illustrates the COUNT and COUNTC functions

\n", "
" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The Dracula data set

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stringcount_a_or_bcountc_a_or_bcount_abccountc_abccase_a
xxabcxabcxxbbbb282102
cbacba04062
aaAA02024
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA Dracula;\n", " INPUT string $20.;\n", " count_a_or_b = count(string,'ab');\n", " countc_a_or_b = countc(string,'ab');\n", " count_abc = count(string,'abc');\n", " countc_abc = countc(string,'abc');\n", " case_a = countc(string,'a','i');\n", "DATALINES;\n", "xxabcxabcxxbbbb\n", "cbacba\n", "aaAA\n", ";\n", "RUN;\n", "\n", "PROC PRINT data = Dracula NOOBS;\n", " TITLE 'The Dracula data set';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PRX Functions: Regular Expressions\n", "\n", "SAS also has a family of PRX functions that allow string matching based on regular expressions instead of fixed strings. What are regular expressions?\n", "\n", "* Ways to search for specific strings\n", "* Can be very complicated or simple\n", "* Highly useful - think \"Find\" on steriods\n", "* http://www.regular-expressions.info/reference.html\n", "* They can be used to match a large number of strings in one statement\n", "* `.` matches any single character\n", "* `*` means repeate as many (even if 0) more times the last character\n", "* `?` makes the last thing optional\n", "* `^` matches the start of a vector `^a` - starst with \"a\"\n", "* `$` matches the end of the vector `b$` - ends with \"b\"\n", "\n", "We will cover the following SAS functions which use regular expressions\n", "\n", "* `PRXPARSE` - defines the regular expression (and substitution string) for the SAS functions\n", "* `PRXMATCH` - search the string for a given regular expression from PRXPARSE and returns the starting position if a match is found and 0 otherwise\n", "* CALL PRXSUBSTR - similar to PRXMATCH, search the string for the given regular expression from PRXPARSE but can find both the starting position and length of the match or return 0 for starting postion and length if no match is found.\n", "* PRXCHANGE - search and replace the regular expression from PRXPARSE with the given replacement string. There is a parameter to specify how many matches to replace. A value of -1 for how many times to replace mean to replace all matches found.\n", "\n", "Let's use the salaries dataset to illustrate these functions." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsname
1Aaron,Patricia G
2Aaron,Petra L
3Abaineh,Yohannes T
4Abbene,Anthony M
5Abbey,Emmanuel
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FILENAME salaries '/folders/myfolders/SAS_Notes/data/Baltimore_City_Employee_Salaries_FY2015.csv';\n", "\n", "PROC IMPORT datafile = salaries out = sal dbms = CSV replace;\n", " getnames = yes;\n", "RUN;\n", "\n", "PROC PRINT data = sal (obs=5);\n", " VAR name;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following SAS program shows how to use regular expressions to find all names that include 'Rawlings'.

\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsname
1Rawlings,Kellye A
2Rawlings,Paula M
3Rawlings-Blake,Stephani
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA sal_rawlings;\n", " set sal;\n", " re = prxparse('/Rawlings/i');\n", " if missing(re) then do;\n", " putlog 'ERROR: regex is malformed';\n", " stop;\n", " end;\n", " if prxmatch(re, name) = 0 THEN DELETE;\n", "RUN;\n", "\n", "PROC PRINT data = sal_rawlings;\n", " VAR name;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The variable re stores the regular expression defined by PRXPARSE. Note the the regular expression must occur between two forward slashes '/regex/'. The i in the regular expression definition '/Rawlings/i' tells SAS to ignore case when doing the search, so both 'Rawlings' and 'rawlings' will match. We then pass the regular expression to PRXMATCH to search the name variable. If there is a match somewhere it returns a non-zero index. Note that we have an if statement to catch mistakes in our regular expression and stop the program if we have made a mistake.

\n", "

PRXSUBSTR works similarly but has a two variables that you pass in called `pos` and `len` in which to store the starting position and length of the string match. This would be needed for a followup call to SUBSTR in order to find and possible replace with substring.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following SAS program used PRXCHANGE to replace \"a\" with \"j\" in the names of the salary dataset.

\n", "
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsnamename2
1Aaron,Patricia GAjron,Patricia G
2Aaron,Petra LAjron,Petra L
3Abaineh,Yohannes TAbjineh,Yohannes T
4Abbene,Anthony MAbbene,Anthony M
5Abbey,EmmanuelAbbey,Emmjnuel
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obsnamename2
1Aaron,Patricia GAjron,Pjtricij G
2Aaron,Petra LAjron,Petrj L
3Abaineh,Yohannes TAbjineh,Yohjnnes T
4Abbene,Anthony MAbbene,Anthony M
5Abbey,EmmanuelAbbey,Emmjnuel
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA sal_a1;\n", " set sal;\n", " re = prxparse('s/a/j/');\n", " if missing(re) then do;\n", " putlog 'ERROR: regex is malformed';\n", " stop;\n", " end;\n", " name2 = prxchange(re, 1, name);\n", "RUN;\n", "\n", "PROC PRINT data = sal_a1(obs=5);\n", " VAR name name2;\n", "RUN;\n", "\n", "DATA sal_a2;\n", " set sal;\n", " re = prxparse('s/a/j/');\n", " if missing(re) then do;\n", " putlog 'ERROR: regex is malformed';\n", " stop;\n", " end;\n", " name2 = prxchange(re, -1, name);\n", "RUN;\n", "\n", "PROC PRINT data = sal_a2(obs=5);\n", " VAR name name2;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Notice that the first call used 1 for the number of substitutions and so it only replaces the first found match of \"a\" with \"j\"; Where as, the second version uses -1 and so replaces all occurences of \"a\" with \"j\".

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "1. Read in the Baltimore city employee salaries dataset, Baltimore_City_Employee_Salaries_FY2015.csv, using PROC IMPORT.\n", "2. Make a temporary SAS dataset called `health_sal` using the salaries dataset, with only the agencies (`JobTitle`) of those with \"fire\" (anywhere in the job title), if any, in the name. Be sure to ignore case in the string matching for 'fire'.\n", "3. Make a temporary data set called `trans` which contains only agencies that contain “TRANS” (case sensitive).\n", "4. What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored.\n", "5. Create a subset of the salaries dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with (case sensitive) string matching. Call this `emer`. How many employees are in this new dataset?\n", "6. Create a variable called `dept` in the `emer` data set by extracting the the part of the `Agency` variable up until 'ment' or 'ice'. E.g. we want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Make frequency table of the dept variable using PROC FREQ. To extract the desired substring, use:\n", "
\n",
    "re = prxparse('/.*(ment|ice)/');\n",
    "call prxsubstr(re, Agency, pos, len);\n",
    "dept = substr(Agency, pos, len);\n",
    "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "SAS", "language": "sas", "name": "sas" }, "language_info": { "codemirror_mode": "sas", "file_extension": ".sas", "mimetype": "text/x-sas", "name": "sas" } }, "nbformat": 4, "nbformat_minor": 2 }