{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Summarization\n", "\n", "In this section, we will learn the basics of exploratory data analysis in SAS. We will learn how to summarize one categorical variable, one quantitative variable, and basic summaries of bivariate data. We will cover both numeric and graphical summaries. In a later section, we will return to plotting to improve the appearance of our plots for presentations.\n", "\n", "There are a number of procedures that are available in SAS that are designed specifically to produce a variety of different descriptive statistics and to display them in meaningful reports. The four procedures in particular, of which I am thinking, are the MEANS, SUMMARY, UNIVARIATE, and FREQ procedures.\n", "\n", "The FREQ procedure is used to summarize discrete data values, and therefore can be used to calculate summary statistics such as the percentage of people with blue eyes and the number of elm trees succumbing to Dutch elm disease.\n", "\n", "The MEANS, SUMMARY, and UNIVARIATE procedures are used to summarize continuous numeric values, and therefore can be used to calculate statistics, such as mean height, median salary, and minimum mileage. \n", "\n", "We'll work mostly with the MEANS procedure. Then, since the SUMMARY and UNIVARIATE procedures have similar options and statements as the MEANS procedure, we'll spend less time on them. The greatest difference between the three procedures is that the UNIVARIATE procedure calculates a few more additional statistics not available in the MEANS and SUMMARY procedures. If you do not need to calculate the additional statistics that are available in UNIVARIATE, however, it is much more efficient to use the MEANS and SUMMARY procedures.\n", "\n", "All three of the procedures take the following generic form:\n", "
\n",
    "\n",
    "PROC PROCNAME options;\n",
    "     statement1;\n",
    "     statement2;\n",
    "     etc;\n",
    "RUN;\n",
    "\n",
    "
\n", "where, not surprisingly, PROCNAME stands for the name of the procedure, and is therefore — either MEANS, SUMMARY, or UNIVARIATE.\n", "\n", "For plotting, we will mainly focus on using PROC SGPLOT." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarizing Quantitative Data\n", "\n", "Throughout our investigation of the MEANS, SUMMARY, and UNIVARIATE procedures, we'll use the hemotology dataset, hem2.sas7bdat (see CANVAS for the dataset), arising from the ICDB Study. The following program tells SAS to display the contents, and print the first 15 observations, of the data set:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SAS Connection established. Subprocess id is 4314\n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The CONTENTS Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Data Set NamePHC6089.HEM2Observations635
Member TypeDATAVariables9
EngineV9Indexes0
Created11/19/2008 14:48:26Observation Length72
Last Modified11/19/2008 14:48:26Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationWINDOWS_32  
Encodingwlatin1 Western (Windows)  
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Engine/Host Dependent Information
Data Set Page Size8192
Number of Data Set Pages6
First Data Page1
Max Obs per Page113
Obs in First Data Page87
Number of Data Set Repairs0
Filename/folders/myfolders/SAS_Notes/data/hem2.sas7bdat
Release Created9.0101M3
Host CreatedXP_PRO
Inode Number160
Access Permissionrwxrwx---
Owner Nameroot
File Size49KB
File Size (bytes)50176
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Alphabetic List of Variables and Attributes
#VariableTypeLen
6hcritNum8
5hemogNum8
2hospNum8
8mchNum8
9mchcNum8
7mcvNum8
4rbcNum8
1subjNum8
3wbcNum8
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Variables in Creation Order
#VariableTypeLen
1subjNum8
2hospNum8
3wbcNum8
4rbcNum8
5hemogNum8
6hcritNum8
7mcvNum8
8mchNum8
9mchcNum8
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssubjhospwbcrbchemoghcritmcvmchmchc
1110027117.54.3813.840.993.331.533.7
2110029117.65.2015.245.888.029.233.1
3110039117.54.3313.139.491.030.233.2
4110040118.34.5212.438.184.227.432.5
5110045118.94.7214.642.790.430.934.1
6110049116.24.7113.841.788.529.233.0
7110051116.44.5613.037.983.128.534.3
8110052117.13.6912.535.997.233.834.8
9110053117.44.4714.443.697.532.233.0
10110055116.14.3412.838.288.129.633.6
11110057119.54.7013.440.586.028.433.0
12110058116.53.7611.634.291.030.733.8
13110059117.54.2912.336.885.728.633.4
14110060117.64.5713.842.091.830.132.8
15110062114.64.8713.942.988.228.532.3
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME phc6089 '/folders/myfolders/SAS_Notes/data/';\n", " \n", "PROC CONTENTS data = phc6089.hem2 position;\n", "RUN;\n", " \n", "PROC PRINT data = phc6089.hem2 (OBS = 15);\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, download and save the hematology data set to a convenient location on your computer. Then, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Finally, run the program. You may recall that the CONTENTS procedure's POSITION option tells SAS to display the contents of the data set in the order in which the variables appear in the data set. Therefore, you should see output that looks something like what is shown above.\n", "\n", "The first two variables, subj and hosp, tell us the subject number and at what hospital the subject's data were collected. The remaining variables, wbc, rbc, hemog, ... are the blood data variables of most interest. For example, the variables wbc and rbc contain the subject's white blood cell and red blood cell counts, respectively. The really important thing to note when reviewing the output is that all of the blood data variables are continuous numeric variables, which lend themselves perfectly to a descriptive analysis using the MEANS procedure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The MEANS and SUMMARY Procedures\n", "\n", "In this section, we'll learn the syntax of the simplest MEANS and SUMMARY procedures, as well as familiarize ourselves with the output they generate.\n", "\n", "
\n", "

Example

\n", "

The MEANS procedure can include many statements and options for specifying the desired statistics. For the sake of simplicity, we'll start out with the most basic form of the MEANS procedure. The following program simply tells SAS to display basic summary statistics for each numeric variable in the phc6089.hem2 data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The MEANS Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
VariableNMeanStd DevMinimumMaximum
\n", "
\n", "
subj
\n", "
hosp
\n", "
wbc
\n", "
rbc
\n", "
hemog
\n", "
hcrit
\n", "
mcv
\n", "
mch
\n", "
mchc
\n", "
\n", "
\n", "
\n", "
635
\n", "
635
\n", "
635
\n", "
635
\n", "
635
\n", "
635
\n", "
635
\n", "
634
\n", "
634
\n", "
\n", "
\n", "
\n", "
327199.50
\n", "
32.7133858
\n", "
7.1276850
\n", "
4.4350079
\n", "
13.4696063
\n", "
39.4653543
\n", "
89.1184252
\n", "
30.4537855
\n", "
34.1524290
\n", "
\n", "
\n", "
\n", "
144410.20
\n", "
14.4426330
\n", "
1.9019097
\n", "
0.3941710
\n", "
1.1097954
\n", "
3.1623819
\n", "
4.5190963
\n", "
1.7232248
\n", "
0.7562054
\n", "
\n", "
\n", "
\n", "
110027.00
\n", "
11.0000000
\n", "
3.0000000
\n", "
3.1200000
\n", "
9.9000000
\n", "
29.7000000
\n", "
65.0000000
\n", "
22.0000000
\n", "
31.6000000
\n", "
\n", "
\n", "
\n", "
520098.00
\n", "
52.0000000
\n", "
14.2000000
\n", "
5.9500000
\n", "
17.7000000
\n", "
51.4000000
\n", "
106.0000000
\n", "
37.0000000
\n", "
36.7000000
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC MEANS data = phc6089.hem2;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program, and review the output to familiarize yourself with the summary statistics that the MEANS procedure calculates by default. As you can see, in its most basic form, the MEANS procedure prints N (the number of nonmissing values), the mean, the standard deviation, and the minimum and maximum values of every numeric variable in the data set.

\n", "

In most cases, you probably don't want SAS to calculate summary statistics for every numeric variable in your data set. Instead, you'll probably just want to focus on a few important variables. For our hematology data set, for example, it doesn't make much sense for SAS to calculate summary statistics for the subj and hosp variables. After all, how does it help us to know that the average subj number is 327199.5?

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program uses the MEANS procedure's VAR statement to restrict SAS to summarizing just the seven blood data variables in the icdb.hem2 data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The MEANS Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
VariableNMeanStd DevMinimumMaximum
\n", "
\n", "
wbc
\n", "
rbc
\n", "
hemog
\n", "
hcrit
\n", "
mcv
\n", "
mch
\n", "
mchc
\n", "
\n", "
\n", "
\n", "
635
\n", "
635
\n", "
635
\n", "
635
\n", "
635
\n", "
634
\n", "
634
\n", "
\n", "
\n", "
\n", "
7.1276850
\n", "
4.4350079
\n", "
13.4696063
\n", "
39.4653543
\n", "
89.1184252
\n", "
30.4537855
\n", "
34.1524290
\n", "
\n", "
\n", "
\n", "
1.9019097
\n", "
0.3941710
\n", "
1.1097954
\n", "
3.1623819
\n", "
4.5190963
\n", "
1.7232248
\n", "
0.7562054
\n", "
\n", "
\n", "
\n", "
3.0000000
\n", "
3.1200000
\n", "
9.9000000
\n", "
29.7000000
\n", "
65.0000000
\n", "
22.0000000
\n", "
31.6000000
\n", "
\n", "
\n", "
\n", "
14.2000000
\n", "
5.9500000
\n", "
17.7000000
\n", "
51.4000000
\n", "
106.0000000
\n", "
37.0000000
\n", "
36.7000000
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC MEANS data = phc6089.hem2;\n", " var wbc rbc hemog hcrit mcv mch mchc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program, and review the output to convince yourself that the subj and hosp variables have been excluded from the analysis.

\n", "

The other thing you might notice about the output is that there are many more decimal places displayed than are necessary. By default, SAS uses the best. format to display values in reports created by the MEANS procedure. In a technical sense, it means that SAS chooses the format that provides the most information about the summary statistics while maintaining a default field width of 12. In a practical sense, it means that often too many decimal places are displayed.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program is identical to the program in the previous example except for two things:

\n", "
    \n", "
  1. The MEANS keyword has been replaced with the SUMMARY keyword
  2. \n", "
  3. The PRINT option has been added to the PROC statement.
  4. \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The SUMMARY Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
VariableNMeanStd DevMinimumMaximum
\n", "
\n", "
wbc
\n", "
rbc
\n", "
hemog
\n", "
hcrit
\n", "
mcv
\n", "
mch
\n", "
mchc
\n", "
\n", "
\n", "
\n", "
635
\n", "
635
\n", "
635
\n", "
635
\n", "
635
\n", "
634
\n", "
634
\n", "
\n", "
\n", "
\n", "
7.13
\n", "
4.44
\n", "
13.47
\n", "
39.47
\n", "
89.12
\n", "
30.45
\n", "
34.15
\n", "
\n", "
\n", "
\n", "
1.90
\n", "
0.39
\n", "
1.11
\n", "
3.16
\n", "
4.52
\n", "
1.72
\n", "
0.76
\n", "
\n", "
\n", "
\n", "
3.00
\n", "
3.12
\n", "
9.90
\n", "
29.70
\n", "
65.00
\n", "
22.00
\n", "
31.60
\n", "
\n", "
\n", "
\n", "
14.20
\n", "
5.95
\n", "
17.70
\n", "
51.40
\n", "
106.00
\n", "
37.00
\n", "
36.70
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SUMMARY data = phc6089.hem2 MAXDEC = 2 FW = 10 PRINT;\n", " var wbc rbc hemog hcrit mcv mch mchc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The MEANS and SUMMARY procedures perform the same functions except for the default setting of the PRINT option. By default, the MEANS procedure produces printed output, while the SUMMARY procedure does not. With the MEANS procedure, you have to use the NOPRINT option to suppress printing, while with the SUMMARY procedure, you have to use the PRINT option to get a printed report.

\n", "

Launch and run the SAS program, and review the output to convince yourself that there is no difference between the two reports created by the MEANS and SUMMARY procedures.

\n", "

There is actually a difference. The VAR statement in the above program tells SAS which of the (numeric) variables to summarize. If you do not include a VAR statement in the SUMMARY procedure, SAS merely gives a simple count of the number of observations in the data set. To convince yourself of this, delete the VAR statement, and re-run the SAS program. You should see output that looks something like this:

" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The SUMMARY Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
N Obs
635
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SUMMARY data = phc6089.hem2 MAXDEC = 2 FW = 10 PRINT;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specifying Statistics\n", "\n", "The default statistics that the MEANS procedure produces — N, mean, standard deviation, minimum, and maximum — might not be the ones that you actually need. You might prefer to limit your output to just the mean and standard deviation of the values. Or you might want to compute a completely different statistic, such as the median or range of values.\n", "\n", "In order to tell SAS to calculate summary statistics other than those calculated by default, simply place the desired statistics keywords as options in the PROC MEANS statement.\n", "\n", "
\n", "

Example

\n", "

The following program tells SAS to calculate and display the sum, range and median of the red blood cell counts appearing in the phc6089.hem2 data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The MEANS Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Analysis Variable : rbc
SumRangeMedian
2816.232.834.41
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC MEANS data=phc6089.hem2 fw=10 maxdec=2 sum range median;\n", " var rbc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program, and review the output to convince yourself that the report is generated as described. You might want to note, in particular, that when you specify a statistic in the PROC MEANS statement, the default statistics are not produced. Incidentally, you can generate the exact same report using the SUMMARY procedure, providing you again add the PRINT option to the end of the PROC statement.

\n", "

The following keywords can be used with the MEANS and SUMMARY procedures to compute statistics:

\n", "
Descriptive Statistics
\n", "\n", "\n", "\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\n", "
KeywordDescription
CLMTwo-sided confidence limit for the mean
CSSCorrected sum of squares
CVCoefficient of variation
KURTKurtosis
LCLMOne-sided confidence limit below the mean
MAXMaximum value
MEANAverage value
MINMinimum value
NNo. of observations with non-missing values
NMISSNo. of observations with missing values
RANGERange
SKEWSkewness
STDStandard deviation
STDERRStandard error of the mean
SUMSum
SUMWGTSum of the Weight variable values
UCLMOne-sided confidence limit above the mean
USSUncorrected sum of squares
VARVariance
\n", "\n", "
Quantile Statistics
\n", "\n", "\n", "\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\n", "
KeywordDescription
MEDIAN or P50Median or 50th percentile
P11st percentile
P55th percentile
P1010th percentile
Q1 or P25Lower quartile or 25th percentile
Q3 or P75Upper quartile or 75th percentile
P9090th percentile
P9595th percentile
P9999th percentile
QRANGEDifference between upper and lower quartiles: Q3-Q1
\n", "\n", "\n", "
Hypothesis Testing
\n", "\n", "\n", "\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\n", "
KeywordDescription
PROBTProbability of a greater absolute value for the t value
TStudent's t for testing that the population mean is 0
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Group Processing\n", "\n", "All of the examples we've looked at so far have involved summarizing all of the observations in the data set. In many cases, we'll instead want to tell SAS to calculate summary statistics for certain subgroups. For example, it makes more sense to calculate the average height for males and females separately rather than calculating an average height of all individuals together. In this section, we'll investigate two ways of producing summary statistics for subgroups. One approach involves using a CLASS statement, and the other involves using a BY statement. As you'll soon see, the approach you choose to use will depend most on how you'd like your final report to look.\n", "\n", "
\n", "

Example

\n", "

The following program uses the VAR and CLASS statements to tell SAS to calculate the default summary statistics of the rbc, wbc, and hcrit variables separately for each of the nine hosp values:

\n", "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The MEANS Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
hospN ObsVariableNMeanStd DevMinimumMaximum
11106\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
106
\n", "
106
\n", "
106
\n", "
\n", "
\n", "
\n", "
4.41
\n", "
7.11
\n", "
39.78
\n", "
\n", "
\n", "
\n", "
0.42
\n", "
1.92
\n", "
3.30
\n", "
\n", "
\n", "
\n", "
3.47
\n", "
3.30
\n", "
32.80
\n", "
\n", "
\n", "
\n", "
5.55
\n", "
13.10
\n", "
48.90
\n", "
\n", "
21108\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
108
\n", "
108
\n", "
108
\n", "
\n", "
\n", "
\n", "
4.43
\n", "
7.37
\n", "
39.28
\n", "
\n", "
\n", "
\n", "
0.39
\n", "
1.94
\n", "
2.90
\n", "
\n", "
\n", "
\n", "
3.33
\n", "
3.10
\n", "
33.00
\n", "
\n", "
\n", "
\n", "
5.35
\n", "
12.60
\n", "
48.00
\n", "
\n", "
2242\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
42
\n", "
42
\n", "
42
\n", "
\n", "
\n", "
\n", "
4.40
\n", "
7.37
\n", "
38.80
\n", "
\n", "
\n", "
\n", "
0.43
\n", "
2.15
\n", "
3.09
\n", "
\n", "
\n", "
\n", "
3.12
\n", "
3.90
\n", "
31.00
\n", "
\n", "
\n", "
\n", "
5.12
\n", "
14.20
\n", "
45.30
\n", "
\n", "
236\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
6
\n", "
6
\n", "
6
\n", "
\n", "
\n", "
\n", "
4.28
\n", "
5.17
\n", "
39.37
\n", "
\n", "
\n", "
\n", "
0.45
\n", "
1.37
\n", "
3.31
\n", "
\n", "
\n", "
\n", "
3.64
\n", "
3.30
\n", "
35.90
\n", "
\n", "
\n", "
\n", "
4.94
\n", "
6.60
\n", "
44.80
\n", "
\n", "
3152\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
52
\n", "
52
\n", "
52
\n", "
\n", "
\n", "
\n", "
4.42
\n", "
7.50
\n", "
39.28
\n", "
\n", "
\n", "
\n", "
0.41
\n", "
1.87
\n", "
3.34
\n", "
\n", "
\n", "
\n", "
3.55
\n", "
4.00
\n", "
33.90
\n", "
\n", "
\n", "
\n", "
5.62
\n", "
13.20
\n", "
47.80
\n", "
\n", "
4192\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
92
\n", "
92
\n", "
92
\n", "
\n", "
\n", "
\n", "
4.50
\n", "
7.00
\n", "
40.19
\n", "
\n", "
\n", "
\n", "
0.44
\n", "
1.93
\n", "
3.51
\n", "
\n", "
\n", "
\n", "
3.54
\n", "
3.00
\n", "
32.50
\n", "
\n", "
\n", "
\n", "
5.49
\n", "
12.50
\n", "
49.90
\n", "
\n", "
4295\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
95
\n", "
95
\n", "
95
\n", "
\n", "
\n", "
\n", "
4.40
\n", "
7.01
\n", "
39.14
\n", "
\n", "
\n", "
\n", "
0.33
\n", "
1.79
\n", "
2.66
\n", "
\n", "
\n", "
\n", "
3.63
\n", "
3.30
\n", "
33.90
\n", "
\n", "
\n", "
\n", "
5.95
\n", "
12.00
\n", "
51.40
\n", "
\n", "
5165\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
65
\n", "
65
\n", "
65
\n", "
\n", "
\n", "
\n", "
4.50
\n", "
7.25
\n", "
40.00
\n", "
\n", "
\n", "
\n", "
0.42
\n", "
1.95
\n", "
3.50
\n", "
\n", "
\n", "
\n", "
3.80
\n", "
3.50
\n", "
33.00
\n", "
\n", "
\n", "
\n", "
5.70
\n", "
11.50
\n", "
49.00
\n", "
\n", "
5269\n", "
\n", "
rbc
\n", "
wbc
\n", "
hcrit
\n", "
\n", "
\n", "
\n", "
69
\n", "
69
\n", "
69
\n", "
\n", "
\n", "
\n", "
4.43
\n", "
6.74
\n", "
38.81
\n", "
\n", "
\n", "
\n", "
0.32
\n", "
1.66
\n", "
2.89
\n", "
\n", "
\n", "
\n", "
3.75
\n", "
3.90
\n", "
29.70
\n", "
\n", "
\n", "
\n", "
5.40
\n", "
10.20
\n", "
45.00
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC MEANS data=phc6089.hem2 fw=10 maxdec=2;\n", " var rbc wbc hcrit;\n", " class hosp;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

First, you should note that the variables appearing in the CLASS statement need not be character variables. Here, we use the numeric variable hosp to break up the 635 observations in the phc6089.hem2 data set into nine subgroups. When CLASS variables are numeric, they should of course contain a limited number of discrete values that represent meaningful subgroups. Otherwise, you will be certain to generate an awful lot of useless output.

\n", "

Now, launch and run the SAS program, and review the output to convince yourself that the report is generated as described. As you can see, the MEANS procedure does not generate statistics for the CLASS variables. Their values are instead used only to categorize the data.

\n", "

Let's see what happens when our CLASS statement contains more than one variable.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program reads some data on national parks into a temporary SAS data set called parks, and then uses the MEANS procedure's VAR and CLASS statements to tell SAS to sum the number of musems and camping facilities for each combination of the Type and Region variables:

\n", "
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The MEANS Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
TypeRegionN ObsVariableSum
NMEast2\n", "
\n", "
Museums
\n", "
Camping
\n", "
\n", "
\n", "
\n", "
2
\n", "
0
\n", "
\n", "
 West2\n", "
\n", "
Museums
\n", "
Camping
\n", "
\n", "
\n", "
\n", "
3
\n", "
7
\n", "
\n", "
NPEast2\n", "
\n", "
Museums
\n", "
Camping
\n", "
\n", "
\n", "
\n", "
8
\n", "
12
\n", "
\n", "
 West5\n", "
\n", "
Museums
\n", "
Camping
\n", "
\n", "
\n", "
\n", "
20
\n", "
31
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA parks;\n", " input ParkName $ 1-21 Type $ Region $ Museums Camping;\n", " DATALINES;\n", "Dinosaur NM West 2 6\n", "Ellis Island NM East 1 0\n", "Everglades NP East 5 2\n", "Grand Canyon NP West 5 3\n", "Great Smoky Mountains NP East 3 10\n", "Hawaii Volcanoes NP West 2 2\n", "Lava Beds NM West 1 1\n", "Statue of Liberty NM East 1 0\n", "Theodore Roosevelt NP West 2 2\n", "Yellowstone NP West 9 11\n", "Yosemite NP West 2 13\n", ";\n", "RUN;\n", " \n", "PROC MEANS data = parks fw = 10 maxdec = 0 sum;\n", " var museums camping;\n", " class type region;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Now, launch and run the SAS program, and review the output. You should see that, for example, SAS determined that the number of museums in National Monuments in the East is 2. The number of museums in National Monuments in the West is 3. And so on.

\n", "

It is probably actually more important here to note how SAS processed the CLASS statement. As you can see, the Type variable appears first and the Region variable appears second in the CLASS statement. For that reason, the Type variable appears first and the Region variable appears second in the output. In general, the order of the variables in the CLASS statement determines their order in the output table. To convince yourself of this, you might want to change the order of the variables as they appear in the CLASS statement, and re-run the SAS program to see what you get.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

Like the CLASS statement, the BY statement specifies variables to use for categorizing observations. The following program uses the MEANS procedure's BY statement to categorize the observations in the parks data set into four subgroups, as determined by the Type and Region variables, before calculating the sum, minimum and maximum of the museums and camping values for each of the subgroups:

\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The MEANS Procedure

\n", "
\n", "
\n", "
\n", "

Type=NM Region=East

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
VariableSumMinimumMaximum
\n", "
\n", "
Museums
\n", "
Camping
\n", "
\n", "
\n", "
\n", "
2
\n", "
0
\n", "
\n", "
\n", "
\n", "
1
\n", "
0
\n", "
\n", "
\n", "
\n", "
1
\n", "
0
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Type=NM Region=West

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
VariableSumMinimumMaximum
\n", "
\n", "
Museums
\n", "
Camping
\n", "
\n", "
\n", "
\n", "
3
\n", "
7
\n", "
\n", "
\n", "
\n", "
1
\n", "
1
\n", "
\n", "
\n", "
\n", "
2
\n", "
6
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Type=NP Region=East

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
VariableSumMinimumMaximum
\n", "
\n", "
Museums
\n", "
Camping
\n", "
\n", "
\n", "
\n", "
8
\n", "
12
\n", "
\n", "
\n", "
\n", "
3
\n", "
2
\n", "
\n", "
\n", "
\n", "
5
\n", "
10
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Type=NP Region=West

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
VariableSumMinimumMaximum
\n", "
\n", "
Museums
\n", "
Camping
\n", "
\n", "
\n", "
\n", "
20
\n", "
31
\n", "
\n", "
\n", "
\n", "
2
\n", "
2
\n", "
\n", "
\n", "
\n", "
9
\n", "
13
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SORT data = parks out = srtdparks;\n", " by type region;\n", "RUN;\n", " \n", "PROC MEANS data = srtdparks fw = 10 maxdec = 0 sum min max;\n", " var museums camping;\n", " by type region;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

You might want to just go ahead and launch and run the SAS program to see what the report looks like when you use a BY statement instead of a CLASS statement to form the subgroups. You might recall that when you use a CLASS statement, SAS generates a single large table containing all of the summary statistics. As you can see in the output here, when you instead use a BY statement, SAS generates a table for each combination of the Type and Region variables. To be more specific, SAS creates four tables here — one for Type = NM and Region = East, one for Type = NP and Region = East, one for Type = NM and Region = West, and one for Type = NP and Region = West.

\n", "

Of course, there is one thing we've not addressed so far in this code ... that's the SORT procedure. Unlike CLASS processing, BY-group processing requires that your data be sorted in the order of the variables that appear in the BY statement. If the observations in your data set are not sorted in order by the variables appearing in the BY statement, then you have to use the SORT procedure to sort your data set before using it in the MEANS procedure. Don't forget that if you don't specify an output data set using the OUT= option, then the SORT procedure overwrites your initial data set with the newly sorted observations. Here, our SORT procedure tells SAS to sort the parks data set by the Type and Region variables, and to store the sorted data set in a new data set called srtdparks.

\n", "

In closing off our discussion of group processing, we should probably discuss which approach — the CLASS statement or the BY statement — is more appropriate. My personal opinion is that it's all a matter of preference. If you prefer to see your summary statistics in one large table, then you should use the CLASS statement. If you instead prefer to see your summary statistics in a bunch of smaller tables, then you should use the BY statement. My personal opinion doesn't take into account the efficiency of your program, however. The advantage of the CLASS statement is that it is easier to use since you need not sort the data first. The advantage of the BY statement is that it can be more efficient when you are categorizing data by many variables.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating Summarized Datasets\n", "\n", "There are many situations, when performing statistical analyses on continuous data, in which you want to create a data set whose observations contain summary statistics rather than observations containing the original raw data. For example, you might want to create a graph that compares the average weight loss of subjects at, say, ten different weight loss clinics. One way of creating such a graph is to first create a data set that contains ten observations — one for each of the clinics — and an average weight loss variable. The MEANS procedure's OUTPUT statement, in conjunction with the NOPRINT option, provides the mechanism to create such a data set rather than the standard printed output.\n", "\n", "The NOPRINT option tells SAS to suppress all printed output. The OUTPUT statement, which tells SAS to create the output data set, in general, takes the form:\n", "\n", "`OUTPUT OUT=dsn keyword1=name1 keyword2=name2 ....;`\n", "\n", "where *dsn* is the name of the data set you want to create, and **keyword1** is the first statistic you want dumped to the output data set and name1 is the name you want to call the variable in the data set representing that first statistic. Similarly, **keyword2** is the second statistic you want dumped to the output data set and name2 is the name you want to call the variable in the data set representing that second statistic. And so on. When you use the OUTPUT statement without specifying any keywords, the default summary statistics N, MEAN, STD, MIN, and MAX are produced for all of the numeric variables or for all of the variables that are listed in the VAR statement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following program uses the MEANS procedure's OUTPUT statement (and NOPRINT option) to create a temporary data set called hospsummary that has one observation for each hospital that contains summary statistics for the hospital:

\n", "
" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Hospital Statistics

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obshosp_TYPE__FREQ_MeanRBCMeanWBCMeanHCRITMedianRBCMedianWBCMedianHCRIT
1.06354.435017.1276939.46544.4107.039.30
21111064.413217.1066039.78214.4057.239.35
32111084.429917.3694439.27694.4407.639.00
4221424.395717.3707138.80244.4457.139.40
523164.276675.1666739.36674.2355.439.10
6311524.421357.5021239.28464.3757.539.10
7411924.502077.0043540.18594.4556.840.20
8421954.397267.0063239.13584.3506.939.30
9511654.500007.2461539.99694.5007.140.00
10521694.425806.7420338.81454.4106.738.50
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC MEANS data=phc6089.hem2 NOPRINT;\n", " var rbc wbc hcrit;\n", " class hosp;\n", " output out = hospsummary\n", " mean = MeanRBC MeanWBC MeanHCRIT\n", " median = MedianRBC MedianWBC MedianHCRIT;\n", "RUN;\n", " \n", "PROC PRINT data = hospsummary;\n", " title 'Hospital Statistics';\n", "RUN;\n", "\n", "title ; *Reset title;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Let's first review the code. The VAR statement tells SAS the three variables — rbc, wbc, and hcrit — that we want summarized. The CLASS statement tells SAS that we want to categorize the observations by the value of the hosp variable. The OUT= portion of the OUTPUT statement tells SAS that we want to create a temporary data set called hospsummary. The MEAN= portion of the OUTPUT statement tells SAS to calculate the average of the rbc, wbc, and hcrit values and store the results, respectively, in three new variables called MeanRBC, MeanWBC, and MeanHCRIT. The MEDIAN= portion of the OUTPUT statement tells SAS to calculate the median of the rbc, wbc, and hcrit values and store the results, respectively, in three new variables called MedianRBC, MedianWBC, and MedianHCRIT. Note that, for each keyword, the variables must be listed in the same order as they appear in the VAR statement.

\n", "

The NOPRINT option of the PROC MEANS statement tells SAS to suppress printing of the summary statistics. We must use the PRINT procedure then to tell SAS to print contents of the hospsummary data set. Because the PROC PRINT statement contains no DATA= option, SAS prints the current data set. The data set created by the MEANS procedure becomes the most recent data set. Therefore, the hospsummary data set is the one that is printed.

\n", "

Now, launch and run the SAS program and review the output to make sure you understand the summarized data set we created. As we'd expect, the data set contains the hosp variable and the six requested variables, MeanRBC, MeanWBC, ..., MedianHCRIT, that contain the summary statistics. As you can see, the data set also contains two additional variables, _TYPE_ and _FREQ_.

\n", "

Whenever you use a CLASS statement to create an output data set containing statistics on subgroups, SAS automatically creates these two additional variables. Not surprisingly, the _FREQ_ variable indicates the number of observations contributing to each of the statistics calculated. The _TYPE_ variable indicates what kind of a summary statistic each of the observations in hospsummary contains. You can see that, here, _TYPE_ takes on two possible values 0 and 1. When _TYPE_ = 1, it means that the summary statistic is at the subgroup (hosp) level. That's why you'll see that _TYPE_ = 1 for nine of the observations in hospsummary — one for each hospital. All we really wanted here were these nine observations, but SAS had to complicate matters by giving us this \"bonus\" observation in which _TYPE_ = 0. When _TYPE_ = 0, it means that the summary statistics are overall summary statistics. That's why for the one observation in which _TYPE_ = 0, you'll see that _FREQ_ = 635. That tells us that all of the observations in phc6089.hem2 went into calculating the means and medians for that observation in hospsummary. It should also make sense then that hosp = . for that observation. Ugh, this is sounding messy!

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

You can also create a summarized data set using the SUMMARY procedure. The following program is identical to the program in the previous example except for two things:

\n", "
    \n", "
  1. The MEANS keyword has been replaced with the SUMMARY keyword
  2. \n", "
  3. The NOPRINT option has been removed from the PROC statement
  4. \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Hospital Statistics

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obshosp_TYPE__FREQ_MeanRBCMeanWBCMeanHCRITMedianRBCMedianWBCMedianHCRIT
1.06354.435017.1276939.46544.4107.039.30
21111064.413217.1066039.78214.4057.239.35
32111084.429917.3694439.27694.4407.639.00
4221424.395717.3707138.80244.4457.139.40
523164.276675.1666739.36674.2355.439.10
6311524.421357.5021239.28464.3757.539.10
7411924.502077.0043540.18594.4556.840.20
8421954.397267.0063239.13584.3506.939.30
9511654.500007.2461539.99694.5007.140.00
10521694.425806.7420338.81454.4106.738.50
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SUMMARY data=phc6089.hem2;\n", " var rbc wbc hcrit;\n", " class hosp;\n", " output out = hospsummary \n", " mean = MeanRBC MeanWBC MeanHCRIT\n", " median = MedianRBC MedianWBC MedianHCRIT;\n", "RUN;\n", " \n", "PROC PRINT data = hospsummary;\n", " title 'Hospital Statistics';\n", "RUN;\n", "\n", "title ; *reset title;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

There's nothing really new here. This example should just reinforce the fundamental difference between the SUMMARY and MEANS procedure. The SUMMARY procedure by default does not print output. That's why it is not necessary to use a NOPRINT option to tell SAS to suppress printing of output. This example should also reinforce the fundamental similarity between the SUMMARY and MEANS procedure, namely that the two procedures use identical syntax and produce identical output. Launch and run the SAS program, and review the output to convince yourself that there is no difference between the two data sets created by the MEANS and SUMMARY procedures.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

You can also create a summarized data set similar to the hospsummary data set created in the previous two examples by using a BY statement instead of a CLASS statement. The following program does just that:

\n", "
" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Hospital Statistics

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obshosp_TYPE__FREQ_MeanRBCMeanWBCMeanHCRITMedianRBCMedianWBCMedianHCRIT
11101064.413217.1066039.78214.4057.239.35
22101084.429917.3694439.27694.4407.639.00
3220424.395717.3707138.80244.4457.139.40
423064.276675.1666739.36674.2355.439.10
5310524.421357.5021239.28464.3757.539.10
6410924.502077.0043540.18594.4556.840.20
7420954.397267.0063239.13584.3506.939.30
8510654.500007.2461539.99694.5007.140.00
9520694.425806.7420338.81454.4106.738.50
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SORT data = phc6089.hem2 out = srtdhem2;\n", " by hosp;\n", "RUN;\n", " \n", "PROC MEANS data=srtdhem2 NOPRINT;\n", " var rbc wbc hcrit;\n", " by hosp;\n", " output out = hospsummary \n", " mean = MeanRBC MeanWBC MeanHCRIT\n", " median = MedianRBC MedianWBC MedianHCRIT;\n", "RUN;\n", " \n", "PROC PRINT data = hospsummary;\n", " title 'Hospital Statistics';\n", "RUN;\n", "\n", "title ; *reset title;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As you can see, the only difference is that the CLASS statement was replaced by a BY statement, and of course because of that, we had to add a SORT procedure so we could sort the data in phc6089.hem2 by hosp. Launch and run the SAS program, and review the output to convince yourself that there is not much of a difference between the resulting hospsummary data set here and that in the previous examples.

\n", "

Well, okay, here _TYPE_ = 0 means that all of the observations here contain summary statistics at the subgroup level. The meaning of _TYPE_ should now seem very confusing to you. Fortunately, we don't need to worry about it much! There is always SAS Help and Documentation available if you're dying to learn more about it. The more important thing to note here is that the MEANS procedure summarizes each BY group as an independent subset of the input data, and therefore, SAS does not produce any sort of overall summarization as it does when using the CLASS statement.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graphical Summaries with PROC SGPLOT\n", "\n", "In this section, we will look at how to use PROC SGPLOT to create graphical summaries of quantitative variables and an example of using a summarized dataset to creat an interaction plot. Our main focus in this section will be on creating\n", "\n", "* histograms\n", "* boxplots\n", "* scatterplots.\n", "\n", "
\n", "

Example

\n", "

The following program uses data from the hemotology dataset to illustrate how to create a histogram.

\n", "
" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\"The\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SGPLOT data = phc6089.hem2;\n", " HISTOGRAM wbc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The procedure PROC SGPLOT has the general form\n", "

\n",
    "\n",
    "PROC SGPLOT data = data-set;\n",
    "    PLOT_KEYWORD variables / plot-options;\n",
    "RUN;\n",
    "\n",
    "
\n", "\n", "where PLOT_KEYWORD is the SAS statement for the type of plot you would like to create, variables are the variable(s) from data-set that are in the plot, and plot-options are optional additional parameters to modify the plot.

\n", "\n", "

To make a boxplot, we would simply use the keyword HBOX for a horizontal boxplot or VBOX to make a vertical boxplot.

\n", "
" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\"The\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SGPLOT data = phc6089.hem2;\n", " VBOX wbc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

A scatterplot can be created by using the keyword SCATTER. This plot requires two variables, so we have the specify X= and Y= instead of a single variable as we did for HISTOGRAM and VBOX/HBOX.

\n", "
" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\"The\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SGPLOT data = phc6089.hem2;\n", " SCATTER X=rbc Y=wbc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Other standard plots for quantitative variables that are available in PROC SGPLOT include:

\n", "
    \n", "
  • DENSITY - create a density plot
  • \n", "
  • SERIES - create a line plot
  • \n", "
  • BUBBLE - create a bubble plot. This is a scatterplot where the size of the point in a scatterplot is determined by some other third variable.
  • \n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

A great example of being in a situation in which you need to create a summarized data set is when you want to create an interaction plot. The following program uses data from the ICDB Background data set to illustrate how to create a simple plot to depict whether an interaction exists between two class variables, sex and race, where we recode race to be white and non-white, when the analysis variable of interest is education level (ed_level):

\n", "
" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Mean Education Level for Sex and Race combinations

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Obssexrace2_TYPE__FREQ_mn_edlev
1100513.47059
211054.40000
32005423.70849
4210403.42500
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"The\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA back;\n", " SET phc6089.back;\n", " IF race = 4 THEN race2 = 0;\n", " ELSE race2 = 1;\n", "RUN;\n", "\n", "PROC SORT data=back out=back;\n", " by sex race2;\n", "RUN;\n", " \n", "PROC MEANS data=back noprint;\n", " by sex race2;\n", " var ed_level;\n", " output out=meaned mean=mn_edlev;\n", "RUN;\n", " \n", "PROC PRINT data = meaned;\n", " title 'Mean Education Level for Sex and Race combinations';\n", "RUN;\n", " \n", "PROC SGPLOT data=meaned;\n", " title 'Interaction Plot of SEX, RACE, and Mean Education Level';\n", " SERIES X = race2 Y = mn_edlev / group = sex;\n", " XAXIS MIN = -0.5 MAX = 1.5;\n", "RUN;\n", "\n", "title ; *reset title;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The first DATA step adds a new version of the race variable to collapse all non-white races into a single group due to the low sample size of non-white participants. The SORT procedure prepares the Background dataset for BY-group processing. The MEANS procedure calculates the mean education level (\"var ed_level\") for each sex race combination (\"by sex race\"). The OUTPUT statement tells SAS to dump the results into a new dataset called meaned. The PRINT procedure of course tells SAS to print the meaned dataset which is shown above.

\n", "

As we'd expect, the dataset contains one row for each sex and race combination. The primary variable is mn_edlev, the average education level of the subjects of that sex and race combination. Onse the meaned datset is created, all we need to do is use the means in teh dataset to create the interaction plot. The SGPLOT procedure tells SAS to plot the mean education level on the y-axis and race on the x-axis. The \"group=sex\" option of the SERIES statement tells SAS to create two different series plots, one for each gender group. This plots shows that for males (sex=1), the average education level for non-whites is higher than for whites, whereas for females (sex=2), the average education level is lower for non-whites than whites. Note, however, there is a very low sample size for non-whites among both sexes, so this gap in average education level between males and females for non-whites may very well disappear with more data.

\n", "

An alternative way to view this type of interaction is with side by side boxplots as the following code illustrates.

\n", "
" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\"The\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SGPLOT data=back;\n", " title 'Interaction Plot Using Boxplots of SEX, RACE, and Mean Education Level';\n", " VBOX ed_level / category = race2 group = sex;\n", " XAXIS MIN = -0.5 MAX = 1.5;\n", "RUN;\n", "\n", "title ; *reset title;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

With boxplots, we also have a view of the variability in the data, showing that there is lots of overlap between groups. Any differences in the means are most likely not statistically significant.

\n", "

In the VBOX statement, we used the \"category = race2\" option to create side-by-side boxplots and then used the \"group=sex\" option to get a boxplot for each gender for both the white and non-white groups.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The UNIVARIATE Procedure\n", "\n", "In this section, we take a brief look at the UNIVARIATE procedure just so we can see how its output differs from that of the MEANS and SUMMARY procedures.\n", "\n", "
\n", "

Example

\n", "

The following UNIVARIATE procedure illustrates the (almost) simplest version of the procedure, in which it tells SAS to perform a univariate analysis on the red blood cell count (rbc) variable in the phc6089.hem2 data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Univariate Analysis of RBC

\n", "
\n", "
\n", "

The UNIVARIATE Procedure

\n", "

Variable: rbc

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Moments
N635Sum Weights635
Mean4.43500787Sum Observations2816.23
Std Deviation0.394171Variance0.15537078
Skewness0.29025297Kurtosis0.51198988
Uncorrected SS12588.5073Corrected SS98.505075
Coeff Variation8.88771825Std Error Mean0.0156422
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Basic Statistical Measures
LocationVariability
Mean4.435008Std Deviation0.39417
Median4.410000Variance0.15537
Mode4.600000Range2.83000
  Interquartile Range0.52000
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Tests for Location: Mu0=0
TestStatisticp Value
Student's tt283.5284Pr > |t|<.0001
SignM317.5Pr >= |M|<.0001
Signed RankS100965Pr >= |S|<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Quantiles (Definition 5)
LevelQuantile
100% Max5.95
99%5.41
95%5.12
90%4.92
75% Q34.69
50% Median4.41
25% Q14.17
10%3.97
5%3.82
1%3.55
0% Min3.12
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Extreme Observations
LowestHighest
ValueObsValueObs
3.122185.49369
3.331525.5533
3.352275.62286
3.47725.70517
3.543655.95465
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC UNIVARIATE data = phc6089.hem2;\n", " title 'Univariate Analysis of RBC';\n", " var rbc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The simplest version of the UNIVARIATE procedure would be one in which no VAR statement is present. Then, SAS would perform a univariate analysis for each numeric variable in the data set. The DATA= option merely tells SAS on which data set you want to do a univariate analysis. As always, if the DATA= option is absent, SAS performs the analysis on the current data set. The VAR statement tells SAS to perform a univariate analysis on the variable rbc.

\n", "

Launch and run the program and review the output to familiarize yourself with the kinds of summary statistics the univariate procedure calculates. You should see five major sections in the output with the following headings: Moments, Basic Statistical Measures, Tests for Location Mu0 = 0, Quantiles, and Extreme Observations. These sections are presented in the output shown above.

\n", "

With an introductory statistics course in your background, the output should be mostly self-explanatory. For example, the output tells us that the average (\"Mean\") red blood cell count of the 635 subjects (\"N\") in the data set is 4.435 with a standard deviation of 0.394. The median (\"50% Median\") red blood cell count is 4.41. The smallest red blood cell count in the data set is 3.12 (observation #218), while the largest is 5.95 (observation #465).

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

When you specify the NORMAL option, SAS will compute four different test statistics for the null hypothesis that the values of the variable specified in the VAR statement are a random sample from a normal distribution. The four test statistics calculated and presented in the output are: Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling.

\n", "

When you specify the PLOT option, SAS will produce a histogram, a box plot, and a normal probability plot for each variable specified in the VAR statement.

\n", "

If you have a BY statement specified as well, SAS will produce each of these plots for each level of the BY statement.

\n", "

The following UNIVARIATE procedure illustrates the NORMAL and PLOT options on the variable rbc of the hematology data set:

\n", "
" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Univariate Analysis of RBC with NORMAL and PLOT Options

\n", "
\n", "
\n", "

The UNIVARIATE Procedure

\n", "

Variable: rbc

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Moments
N635Sum Weights635
Mean4.43500787Sum Observations2816.23
Std Deviation0.394171Variance0.15537078
Skewness0.29025297Kurtosis0.51198988
Uncorrected SS12588.5073Corrected SS98.505075
Coeff Variation8.88771825Std Error Mean0.0156422
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Basic Statistical Measures
LocationVariability
Mean4.435008Std Deviation0.39417
Median4.410000Variance0.15537
Mode4.600000Range2.83000
  Interquartile Range0.52000
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Tests for Location: Mu0=0
TestStatisticp Value
Student's tt283.5284Pr > |t|<.0001
SignM317.5Pr >= |M|<.0001
Signed RankS100965Pr >= |S|<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Tests for Normality
TestStatisticp Value
Shapiro-WilkW0.992948Pr < W0.0044
Kolmogorov-SmirnovD0.033851Pr > D0.0771
Cramer-von MisesW-Sq0.145326Pr > W-Sq0.0279
Anderson-DarlingA-Sq1.070646Pr > A-Sq0.0085
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Quantiles (Definition 5)
LevelQuantile
100% Max5.95
99%5.41
95%5.12
90%4.92
75% Q34.69
50% Median4.41
25% Q14.17
10%3.97
5%3.82
1%3.55
0% Min3.12
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Extreme Observations
LowestHighest
ValueObsValueObs
3.122185.49369
3.331525.5533
3.352275.62286
3.47725.70517
3.543655.95465
\n", "
\n", "
\n", "
\n", "\"Plots\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC UNIVARIATE data = phc6089.hem2 NORMAL PLOT;\n", " title 'Univariate Analysis of RBC with NORMAL and PLOT Options';\n", " var rbc;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from the NORMAL and PLOT options. You should see a new section called Tests for Normality that contains the four \"test for normality\" test statistics and corresponding P-values. At the end of the output, you should see the histogram and box plot

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

When you use the UNIVARIATE procedure's ID statement, SAS uses the values of the variable specified in the ID statement to indicate the five largest and five smallest observations rather than the (usually meaningless) observation number. The following UNIVARIATE procedure uses the subject number (subj) to indicate extreme values of red blood cell count (rbc):

\n", "
" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Univariate Analysis of RBC with ID Option

\n", "
\n", "
\n", "

The UNIVARIATE Procedure

\n", "

Variable: rbc

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Moments
N635Sum Weights635
Mean4.43500787Sum Observations2816.23
Std Deviation0.394171Variance0.15537078
Skewness0.29025297Kurtosis0.51198988
Uncorrected SS12588.5073Corrected SS98.505075
Coeff Variation8.88771825Std Error Mean0.0156422
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Basic Statistical Measures
LocationVariability
Mean4.435008Std Deviation0.39417
Median4.410000Variance0.15537
Mode4.600000Range2.83000
  Interquartile Range0.52000
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Tests for Location: Mu0=0
TestStatisticp Value
Student's tt283.5284Pr > |t|<.0001
SignM317.5Pr >= |M|<.0001
Signed RankS100965Pr >= |S|<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Quantiles (Definition 5)
LevelQuantile
100% Max5.95
99%5.41
95%5.12
90%4.92
75% Q34.69
50% Median4.41
25% Q14.17
10%3.97
5%3.82
1%3.55
0% Min3.12
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Extreme Observations
LowestHighest
ValuesubjObsValuesubjObs
3.122200072185.49410063369
3.332100571525.5511008633
3.352200212275.62310092286
3.47110134725.70510026517
3.544100593655.95420074465
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC UNIVARIATE data = phc6089.hem2;\n", " title 'Univariate Analysis of RBC with ID Option';\n", " var rbc;\n", " id subj;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from using the ID statement. In a previous example, the UNIVARIATE output indicated that observation #218 has the smallest red blood cell count in the data set (3.12), while observation #465 has the largest (5.95). Now, because of the use of the subject number as an ID variable (\"id subj\"). SAS reports the more helpful information that subject 220007 has the smallest red blood cell count, while subject 420074 has the largest.

\n", "

You shouldn't be surprised to learn that the UNIVARIATE procedure can do much more than what we can address now. Just as the BY statement can be used in the MEANS and SUMMARY procedures to categorize the observations in the input data set into subgroups, so can a BY statement be used in the UNIVARIATE procedure. And, just as an OUTPUT statement can be used in the MEANS and SUMMARY procedures to create summarized data sets, so can an OUTPUT statement be used in the UNIVARIATE procedure. For more information about the functionality and syntax of the UNIVARIATE procedure, see the SAS Help and Documentation.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarizing Categorical Data\n", "\n", "In this section, we'll investigate the FREQ procedure as a tool for summarizing and analyzing categorical data. The procedure is a descriptive procedure, as well as a statistical procedure. It allows you to produce one-way to n-way frequency and cross tabulation tables. For two-way tables, the FREQ procedure also computes chi-square tests and measures of association. And, for n-way tables, the FREQ procedure also performs stratified analyses, computing statistics within as well as across strata. The FREQ procedure can also be used to output summary statistics, such as counts and percentages, to a SAS data set.\n", "\n", "### A Basic One-Way Table\n", "\n", "By default, the FREQ procedure creates a one-way table that contains the **frequency**, **percent**, **cumulative frequency**, and **cumulative percent** of every value of every variable in the input data set. That every there is italicized with good reason ... the FREQ procedure doesn't care whether the variable is a character variable or a numeric variable. And, if a variable is numeric, the FREQ procedure doesn't care if it is a discrete numeric variable with just a few possible outcomes (number of siblings, say) or a continuous numeric variable with an infinite number of possible outcomes (weight, say). That means then if you rely on the default version of the FREQ procedure, it is possible to create lots and lots and lots of output. That's why we'll skip the default version and will jump right to the more practical version in which you restrict the number of tables SAS creates by using a TABLES statement.\n", "\n", "The FREQ procedure takes the following generic form:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PROC FREQ options;\n", " tables ... /options;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The TABLES statement tells SAS the specific frequency table(s) that you want to create. If you don't include a TABLES statement, then SAS creates a one-way frequency table for every variable in your input data set.\n", "\n", "As you can see, there are two types of options, namely procedure options and table options. Procedure options, such as the typical \"DATA=\" option, must follow the PROC FREQ statement. Table options must be specified after a forward slash (/) in the TABLES statement. In either case, you can specify as many options as you would like.\n", "\n", "Throughout this lesson, we'll use the ICDB background data set to illustrate the FREQ procedure. Be sure to save the data set to a convenient location on your computer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following FREQ procedure illustrates the simplest practical example, namely a one-way frequency table of the variable sex, with no bells or whistles added:

\n", "
" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Frequency Count of SEX

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
sexFrequencyPercentCumulative
Frequency
Cumulative
Percent
1568.78568.78
258291.22638100.00
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LIBNAME phc6089 '/folders/myfolders/SAS_Notes/data/';\n", " \n", "PROC FREQ data=phc6089.back;\n", " title 'Frequency Count of SEX';\n", " tables sex;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the background data set. Then, run the program and review the output. You should see something along the lines of this basic one-way frequency table, in which, as promised, SAS reports the frequency, percent, cumulative frequency, and cumulative percent of each value of the sex variable.

\n", "

This output tells us, for example, that 56 or 8.78% of the subjects in the ICDB Study are male (coded as sex = 1).

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

Again, by default, SAS outputs frequency counts, percents, cumulative frequencies, and cumulative percents. The NOCUM table option suppresses the printing of the cumulative frequencies and cumulative percentages for one-way frequency tables. The following SAS code illustrates the NOCUM table option:

\n", "
" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Frequency Count of SEX: No Cumulative Stats

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
sexFrequencyPercent
1568.78
258291.22
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title 'Frequency Count of SEX: No Cumulative Stats';\n", " tables sex / nocum;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Review the output to convince yourself that indeed the cumulative frequencies and cumulative percentages are not printed in the table. The table contains only the number and percentage of each sex.

\n", "

In any FREQ procedure, you can specify many variables in a TABLES statement. If the list is long, you may be able to use a shortcut to specify the list of variables. If you specify a TABLES statement using a numbered range of variables, such as:

\n", "
tables var1-var4;
\n", "

then SAS will create a one-way frequency table for the four variables named var1, var2, var3, and var4. If instead in your TABLES statement, you specify a range of variables by their position in the data set, such as:

\n", "
tables sex--race;
\n", "

then SAS will create a one-way frequency table for every variable that appears between the sex and race variables in the data set, namely in the case of the background data set, sex, state, country, and race. Recall that if you're not sure of the position of the variables in your data set, you can use the VARNUM option of the CONTENTS procedure to determine the position of the variables in a data set. (Incidentally, that is not a typo in the second TABLES statement ... it takes two dashes to specify a range of variables by their position in the data set.)

\n", "

Rather than specifying many variables in a TABLES statement, you can specify many TABLES statements in a FREQ procedure. However you tell SAS to make multiple tables, you can use the PAGE option to tell SAS to print only one table per page. Otherwise, the FREQ procedure prints multiple tables per page as space permits.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following SAS program illustrates the creation of two one-way frequency tables in conjunction with the PAGE option:

\n", "
" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Frequency Count of SEX and RACE

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
sexFrequencyPercentCumulative
Frequency
Cumulative
Percent
1568.78568.78
258291.22638100.00
\n", "
\n", "
\n", "
\n", "
\n", "

Frequency Count of SEX and RACE

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
raceFrequencyPercentCumulative
Frequency
Cumulative
Percent
120.3120.31
271.1091.41
3294.55385.96
459392.9563198.90
530.4763499.37
620.3163699.69
710.1663799.84
810.16638100.00
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back page;\n", " title 'Frequency Count of SEX and RACE';\n", " tables sex race;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Review the output to convince yourself that indeed SAS creates two one-way frequency tables — one for the categorical variable sex and the other for the categorical variable race. Because the PAGE option was invoked, each table should be printed on a separate page. The first page should contain the frequency table for the sex variable, and the second page should contain the frequency table for the race variable

\n", "

Incidentally, you might also want to notice that, not surprisingly, the order in which the variables appear in the TABLES statement determines the order in which they appear in the output.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

As is the case for many SAS procedures, you can use a BY statement to tell SAS to perform an operation for each level of a BY group. The following program tells SAS to create a one-way frequency table for the variable ed_level for each level of the variable sex:

\n", "
" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Frequency Count of Education Level within Each Level of Sex

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "
\n", "

sex=1

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ed_levelFrequencyPercentCumulative
Frequency
Cumulative
Percent
147.1447.14
2712.501119.64
31221.432341.07
42035.714376.79
51323.2156100.00
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Frequency Count of Education Level within Each Level of Sex

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "
\n", "

sex=2

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ed_levelFrequencyPercentCumulative
Frequency
Cumulative
Percent
171.2071.20
2223.78294.98
322037.8024942.78
422939.3547882.13
510417.87582100.00
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC SORT data=phc6089.back out=s_back;\n", " by sex;\n", "RUN;\n", " \n", "PROC FREQ data=s_back;\n", " title 'Frequency Count of Education Level within Each Level of Sex';\n", " tables ed_level;\n", " by sex;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As is always the case, the SORT procedure merely prepares the background data set for BY-group processing. The SORT procedure tells SAS to sort the phc6089.back data set by sex, and to store the results in a new data set called s_back. Then, as you can see, the FREQ procedure is invoked with a BY statement (\"by sex\") in addition to the TABLES statement (\"tables ed_level\"). Launch and run the SAS program. Review the output to convince yourself that SAS creates two one-way frequency tables of education level (ed_level) — one for males (sex = 1) and one for females (sex = 2).

\n", "

So far, in each of the examples we have looked at, no missing values existed. When they do exist, SAS by default excludes them from your requested frequency tables. Instead, SAS prints the \"Frequency Missing\" below each table. You can instead opt to use the MISSING tables option, in which you tell SAS to instead treat missing values as nonmissing values, and to therefore include them in the calculation of percentages and other statistics. Or you can opt to use the MISSPRINT option to tell SAS to treat missing values as nonmissing values when printing the frequencies but do not include them in the calculation of the statistics.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following SAS program illustrates the MISSING and MISSPRINT options on the variable state in the background data set.

\n", "
" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

One-way Table of State: with MISSING Option

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stateFrequencyPercentCumulative
Frequency
Cumulative
Percent
.4243.304243.30
144.124647.42
311.034748.45
455.155253.61
54546.3997100.00
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title 'One-way Table of State: with MISSING Option';\n", " tables state / missing;\n", " WHERE state <= 5; *To keep the output small;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program, and review the resulting output. The output from the FREQ procedure with the MISSING tables option should look the table shown above. As you can see, the first row tells us that 42 subjects did not report the state in which they live. Because the MISSING option was used, SAS also tells us the 42 subjects comprise 6.58% of the subjects in the data set. SAS also includes the 42 subjects in the calculation of the cumulative percentage.

\n", "

On the other hand, the first few rows of output from the FREQ procedure with the MISSPRINT tables option should look something like this.

\n", "
" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

One-way Table of State: with MISSPRINT Option

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
stateFrequencyPercentCumulative
Frequency
Cumulative
Percent
Frequency Missing = 42
.42...
147.2747.27
311.8259.09
459.091018.18
54581.8255100.00
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title 'One-way Table of State: with MISSPRINT Option';\n", " tables state/missprint;\n", " WHERE state <= 5; *to keep the output small;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As you can see, the first row again tells us that 42 subjects did not report the state in which they live. In this case, however, because the MISSPRINT option was specified, SAS stops there. That is, SAS does not include the subjects in any of its calculations of the percent, cumulative frequency or cumulative percent.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two-way and N-way Tables\n", "\n", "So far, we have only used the FREQ procedure to create one-way frequency tables. However, it is often helpful to create crosstabulation tables in which the frequencies are determined for more than one variable at a time. For example, if we were interested in the percentage of individuals that experience side effects on our new treatment and how this differs between different age groups, then we would need to create a two-way contigency table between the two categorical variables age group (say with the values 18-29, 30-44, 45-54, 55-70, and more than 70) and side effects (yes or no). We'll investigate such tables in this section.\n", "\n", "
\n", "

Example

\n", "

The following FREQ procedure illustrates the simplest example of telling SAS to create a two-way table, for the variables sex and ed_level, with no bells and whistles added.

\n", "
" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Crosstabulation of Education Level and Sex

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
Frequency
\n", "
Percent
\n", "
Row Pct
\n", "
Col Pct
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table of ed_level by sex
ed_levelsex
12Total
1\n", "
\n", "
4
\n", "
0.63
\n", "
36.36
\n", "
7.14
\n", "
\n", "
\n", "
\n", "
7
\n", "
1.10
\n", "
63.64
\n", "
1.20
\n", "
\n", "
\n", "
\n", "
11
\n", "
1.72
\n", "
 
\n", "
 
\n", "
\n", "
2\n", "
\n", "
7
\n", "
1.10
\n", "
24.14
\n", "
12.50
\n", "
\n", "
\n", "
\n", "
22
\n", "
3.45
\n", "
75.86
\n", "
3.78
\n", "
\n", "
\n", "
\n", "
29
\n", "
4.55
\n", "
 
\n", "
 
\n", "
\n", "
3\n", "
\n", "
12
\n", "
1.88
\n", "
5.17
\n", "
21.43
\n", "
\n", "
\n", "
\n", "
220
\n", "
34.48
\n", "
94.83
\n", "
37.80
\n", "
\n", "
\n", "
\n", "
232
\n", "
36.36
\n", "
 
\n", "
 
\n", "
\n", "
4\n", "
\n", "
20
\n", "
3.13
\n", "
8.03
\n", "
35.71
\n", "
\n", "
\n", "
\n", "
229
\n", "
35.89
\n", "
91.97
\n", "
39.35
\n", "
\n", "
\n", "
\n", "
249
\n", "
39.03
\n", "
 
\n", "
 
\n", "
\n", "
5\n", "
\n", "
13
\n", "
2.04
\n", "
11.11
\n", "
23.21
\n", "
\n", "
\n", "
\n", "
104
\n", "
16.30
\n", "
88.89
\n", "
17.87
\n", "
\n", "
\n", "
\n", "
117
\n", "
18.34
\n", "
 
\n", "
 
\n", "
\n", "
Total\n", "
\n", "
56
\n", "
8.78
\n", "
\n", "
\n", "
\n", "
582
\n", "
91.22
\n", "
\n", "
\n", "
\n", "
638
\n", "
100.00
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title 'Crosstabulation of Education Level and Sex';\n", " tables ed_level*sex;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As you can see, to tell SAS to create a two-way table of ed_level and sex, we merely use an asterisk (*) to join the two variables in the TABLES statement.

\n", "

Launch and run the SAS program. Review the output to convince yourself that SAS created the requested two-way table. In general, the values of the variable appearing before the asterisk form the rows of the table, and the values of the variable appearing after the asterisk form the columns of the table. In this case, since ed_level appears before the asterisk in the TABLES statement, its values form the rows of the table. And, since sex appears after the asterisk, its values form the columns of the table.

\n", "

When SAS creates two-way tables, each cell of the table contains, by default, the cell frequency, the cell percentage of the total frequency, the cell percentage of the row frequency, and the cell percentage of the column frequency. You might want to review the numbers in each of the cells of the table to make sure you understand what is what. In the upper left-hand corner of the table, SAS always prints a guide to the numbers appearing in each of the cells in the table. Here, SAS tells us that the first number in cell (i,j) is the number of subjects of ed_level i and sex j; the second number in cell (i,j) is the percentage of subjects of ed_level i and sex j; the third number in cell (i,j) is the percentage of subjects who are sex j given that they are ed_level i; and the fourth number in cell (i,j) is the percentage of subjects who are ed_level i given that they are sex j. For example, for the cell in which ed_level = 4 and sex = 2, SAS tells us that: 229 of the subjects in the data set are ed_level 4 and sex 2; 35.89% of the subjects in the data set are ed_level 4 and sex 2; 91.97% of the subjects in the data set who are ed_level 4 are sex 2; and 39.35% of the subjects in the data set who are sex 2 are ed_level 4.

\n", "

A little note about shortcuts. If you have many different two-way tables to create, you can use a variety of shortcuts. For example, the TABLES statement:

\n", "
tables a*(b c);
\n", "

tells SAS to create a two-way table between variables a and b (a*b) and a two-way table between variables a and c (a*c). The TABLES statement:

\n", "
tables (a b)*(c d);
\n", "

tells SAS to create four two-way tables, namely: a*c, b*c, a*d, and b*d. The TABLES statement:

\n", "
tables (a b c)*d;
\n", "

tells SAS to create three two-way tables, namely: a*d, b*d, and c*d.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

For a frequency analysis of more than two variables, we can use the FREQ procedure to create n-way crosstabulation tables. In that case, a series of two-way tables is created, with a table for each level of the other variable(s). The following program creates a three-way table of sex, job_chng, and ed_level:

\n", "
" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

3-way Table of Sex, Job Change, and Ed. Level

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
Frequency
\n", "
Percent
\n", "
Row Pct
\n", "
Col Pct
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table 1 of job_chng by ed_level
Controlling for sex=1
job_chnged_level
12345Total
0\n", "
\n", "
4
\n", "
7.84
\n", "
8.33
\n", "
100.00
\n", "
\n", "
\n", "
\n", "
6
\n", "
11.76
\n", "
12.50
\n", "
100.00
\n", "
\n", "
\n", "
\n", "
11
\n", "
21.57
\n", "
22.92
\n", "
100.00
\n", "
\n", "
\n", "
\n", "
15
\n", "
29.41
\n", "
31.25
\n", "
88.24
\n", "
\n", "
\n", "
\n", "
12
\n", "
23.53
\n", "
25.00
\n", "
92.31
\n", "
\n", "
\n", "
\n", "
48
\n", "
94.12
\n", "
 
\n", "
 
\n", "
\n", "
1\n", "
\n", "
0
\n", "
0.00
\n", "
0.00
\n", "
0.00
\n", "
\n", "
\n", "
\n", "
0
\n", "
0.00
\n", "
0.00
\n", "
0.00
\n", "
\n", "
\n", "
\n", "
0
\n", "
0.00
\n", "
0.00
\n", "
0.00
\n", "
\n", "
\n", "
\n", "
2
\n", "
3.92
\n", "
66.67
\n", "
11.76
\n", "
\n", "
\n", "
\n", "
1
\n", "
1.96
\n", "
33.33
\n", "
7.69
\n", "
\n", "
\n", "
\n", "
3
\n", "
5.88
\n", "
 
\n", "
 
\n", "
\n", "
Total\n", "
\n", "
4
\n", "
7.84
\n", "
\n", "
\n", "
\n", "
6
\n", "
11.76
\n", "
\n", "
\n", "
\n", "
11
\n", "
21.57
\n", "
\n", "
\n", "
\n", "
17
\n", "
33.33
\n", "
\n", "
\n", "
\n", "
13
\n", "
25.49
\n", "
\n", "
\n", "
\n", "
51
\n", "
100.00
\n", "
\n", "
Frequency Missing = 5
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
Frequency
\n", "
Percent
\n", "
Row Pct
\n", "
Col Pct
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table 2 of job_chng by ed_level
Controlling for sex=2
job_chnged_level
12345Total
0\n", "
\n", "
4
\n", "
0.84
\n", "
0.99
\n", "
80.00
\n", "
\n", "
\n", "
\n", "
14
\n", "
2.95
\n", "
3.47
\n", "
82.35
\n", "
\n", "
\n", "
\n", "
148
\n", "
31.16
\n", "
36.72
\n", "
85.06
\n", "
\n", "
\n", "
\n", "
163
\n", "
34.32
\n", "
40.45
\n", "
83.16
\n", "
\n", "
\n", "
\n", "
74
\n", "
15.58
\n", "
18.36
\n", "
89.16
\n", "
\n", "
\n", "
\n", "
403
\n", "
84.84
\n", "
 
\n", "
 
\n", "
\n", "
1\n", "
\n", "
1
\n", "
0.21
\n", "
1.39
\n", "
20.00
\n", "
\n", "
\n", "
\n", "
3
\n", "
0.63
\n", "
4.17
\n", "
17.65
\n", "
\n", "
\n", "
\n", "
26
\n", "
5.47
\n", "
36.11
\n", "
14.94
\n", "
\n", "
\n", "
\n", "
33
\n", "
6.95
\n", "
45.83
\n", "
16.84
\n", "
\n", "
\n", "
\n", "
9
\n", "
1.89
\n", "
12.50
\n", "
10.84
\n", "
\n", "
\n", "
\n", "
72
\n", "
15.16
\n", "
 
\n", "
 
\n", "
\n", "
Total\n", "
\n", "
5
\n", "
1.05
\n", "
\n", "
\n", "
\n", "
17
\n", "
3.58
\n", "
\n", "
\n", "
\n", "
174
\n", "
36.63
\n", "
\n", "
\n", "
\n", "
196
\n", "
41.26
\n", "
\n", "
\n", "
\n", "
83
\n", "
17.47
\n", "
\n", "
\n", "
\n", "
475
\n", "
100.00
\n", "
\n", "
Frequency Missing = 107
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title '3-way Table of Sex, Job Change, and Ed. Level';\n", " tables sex*job_chng*ed_level;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

As you can see, to tell SAS to create a three-way table of sex, job_chng, and ed_level, we use an asterisk (*) to join the three variables in the TABLES statement. The order of the variables is important. In n-way tables, the last two variables of the TABLES statement become the rows and columns of the two-way tables. Variables that precede the last two variables in the TABLES statement stratify the crosstabulation tables. So, in this case, we should expect SAS to create two two-way tables of job_chng and ed_level, one for when sex = 1 and one for when sex = 2.

\n", "

Launch and run the SAS program, and review the output to convince yourself that SAS created the three-way table as described. You should see that, indeed, SAS created one two-way table of job_chng and ed_level for when sex = 1, and one two-way table of job_chng and ed_level for when sex = 2.

\n", "

It probably goes without saying that, in general, n-way tables can generate lots of output.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cell Statistics\n", "\n", "As you now know, when SAS creates two-way tables, each cell of the table contains, by default, the cell frequency, the (joint) cell percentage, the conditional row percentage, and the conditional column percentage. As usual, you don't have to accept the default ... you can tell SAS to suppress some of the default statistics it displays ... or you can tell SAS to display alternative statistics.\n", "\n", "For n-way crosstabulations, you can suppress the default output in each of the cells by using any (or all) of the following table options:\n", "\n", "* **NOFREQ** suppresses the printing of the cell freqencies\n", "* **NOROW** suppresses the printing of the row percentages\n", "* **NOCOL** suppresses the printing of the column percentages, and\n", "* **NOPERCENT** suppresses the printing of the (joint) cell percentages.\n", "\n", "
\n", "

Example

\n", "

The following SAS program illustrates the NOROW, NOCOL, and NOPERCENT table options:

\n", "
" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Crosstabulation of SEX and RACE: No percents

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
Frequency
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table of race by sex
racesex
12Total
1\n", "
\n", "
0
\n", "
\n", "
\n", "
\n", "
2
\n", "
\n", "
\n", "
\n", "
2
\n", "
\n", "
2\n", "
\n", "
3
\n", "
\n", "
\n", "
\n", "
4
\n", "
\n", "
\n", "
\n", "
7
\n", "
\n", "
3\n", "
\n", "
1
\n", "
\n", "
\n", "
\n", "
28
\n", "
\n", "
\n", "
\n", "
29
\n", "
\n", "
4\n", "
\n", "
51
\n", "
\n", "
\n", "
\n", "
542
\n", "
\n", "
\n", "
\n", "
593
\n", "
\n", "
5\n", "
\n", "
0
\n", "
\n", "
\n", "
\n", "
3
\n", "
\n", "
\n", "
\n", "
3
\n", "
\n", "
6\n", "
\n", "
0
\n", "
\n", "
\n", "
\n", "
2
\n", "
\n", "
\n", "
\n", "
2
\n", "
\n", "
7\n", "
\n", "
1
\n", "
\n", "
\n", "
\n", "
0
\n", "
\n", "
\n", "
\n", "
1
\n", "
\n", "
8\n", "
\n", "
0
\n", "
\n", "
\n", "
\n", "
1
\n", "
\n", "
\n", "
\n", "
1
\n", "
\n", "
Total\n", "
\n", "
56
\n", "
\n", "
\n", "
\n", "
582
\n", "
\n", "
\n", "
\n", "
638
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title 'Crosstabulation of SEX and RACE: No percents';\n", " tables race*sex / norow nocol nopercent;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Review the output to convince yourself that SAS did indeed suppress the default row percentages, column percentages, and cell percentages. Note, too, that SAS changed the guide in the upper left-hand corner of the table to reflect the new situation — the guide now tells us that the lone number reported in each cell is the cell frequency.

\n", "

Instead of suppressing output in an n-way table, we can request additional output in each of the cells by using the following options:

\n", "
    \n", "
  • EXPECTED, to print the expected cell frequencies under the null hypothesis of independence
  • \n", "
  • DEVIATION, to print the deviation of the cell frequency from the expected cell frequency (under the null hypothesis of independence)
  • \n", "
  • CELLCHI2, to print each cell's contribution to the total chi-squared statistic
  • \n", "
\n", "

That is, CELLCHI2 is defined as: CELLCHI2 = (frequency - expected)2expected.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

In creating the two-way table between race and sex, the following FREQ procedure requests that the EXPECTED and CELLCHI2 statistics be printed, while at the same time suppressing the printing of the joint, row, and column percentages:

\n", "
" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Crosstabulation of SEX and RACE: With Expecteds

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
Frequency
\n", "
Expected
\n", "
Cell Chi-Square
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table of race by sex
racesex
12Total
1\n", "
\n", "
0
\n", "
0.1755
\n", "
0.1755
\n", "
\n", "
\n", "
\n", "
2
\n", "
1.8245
\n", "
0.0169
\n", "
\n", "
\n", "
\n", "
2
\n", "
 
\n", "
 
\n", "
\n", "
2\n", "
\n", "
3
\n", "
0.6144
\n", "
9.2624
\n", "
\n", "
\n", "
\n", "
4
\n", "
6.3856
\n", "
0.8912
\n", "
\n", "
\n", "
\n", "
7
\n", "
 
\n", "
 
\n", "
\n", "
3\n", "
\n", "
1
\n", "
2.5455
\n", "
0.9383
\n", "
\n", "
\n", "
\n", "
28
\n", "
26.455
\n", "
0.0903
\n", "
\n", "
\n", "
\n", "
29
\n", "
 
\n", "
 
\n", "
\n", "
4\n", "
\n", "
51
\n", "
52.05
\n", "
0.0212
\n", "
\n", "
\n", "
\n", "
542
\n", "
540.95
\n", "
0.002
\n", "
\n", "
\n", "
\n", "
593
\n", "
 
\n", "
 
\n", "
\n", "
5\n", "
\n", "
0
\n", "
0.2633
\n", "
0.2633
\n", "
\n", "
\n", "
\n", "
3
\n", "
2.7367
\n", "
0.0253
\n", "
\n", "
\n", "
\n", "
3
\n", "
 
\n", "
 
\n", "
\n", "
6\n", "
\n", "
0
\n", "
0.1755
\n", "
0.1755
\n", "
\n", "
\n", "
\n", "
2
\n", "
1.8245
\n", "
0.0169
\n", "
\n", "
\n", "
\n", "
2
\n", "
 
\n", "
 
\n", "
\n", "
7\n", "
\n", "
1
\n", "
0.0878
\n", "
9.4806
\n", "
\n", "
\n", "
\n", "
0
\n", "
0.9122
\n", "
0.9122
\n", "
\n", "
\n", "
\n", "
1
\n", "
 
\n", "
 
\n", "
\n", "
8\n", "
\n", "
0
\n", "
0.0878
\n", "
0.0878
\n", "
\n", "
\n", "
\n", "
1
\n", "
0.9122
\n", "
0.0084
\n", "
\n", "
\n", "
\n", "
1
\n", "
 
\n", "
 
\n", "
\n", "
Total\n", "
\n", "
56
\n", "
\n", "
\n", "
\n", "
582
\n", "
\n", "
\n", "
\n", "
638
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title 'Crosstabulation of SEX and RACE: With Expecteds';\n", " tables race*sex / expected cellchi2 norow nocol nopercent;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Review the output to convince yourself that SAS did indeed suppress the printing of the joint, row, and column percentages, while adding the expected cell frequencies, as well as the cell's contribution to the chi-square statistic, to the cell output. Note, again, that SAS changed the guide in the upper left-hand corner of the table to reflect the new situation — the guide now tells us that the second number in each cell is the expected cell frequency and the third number is the cell's contribution to the chi-square statistic.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changing the Table Format\n", "\n", "When three or more variables are specified, the multiple levels of n-way tables can produce lots of output. Such bulky and complex crosstabulations are often easier to read as a continuous list. To tell SAS to generate list output for crosstabulations rather than the default tables, simply use the LIST tables option.\n", "\n", "
\n", "

Example

\n", "

The following FREQ procedure illustrates the LIST option:

\n", "
" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Crosstabulation of SEX and RACE: In List Format

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
sexraceFrequencyPercentCumulative
Frequency
Cumulative
Percent
1230.4730.47
1310.1640.63
14517.99558.62
1710.16568.78
2120.31589.09
2240.63629.72
23284.399014.11
2454284.9563299.06
2530.4763599.53
2620.3163799.84
2810.16638100.00
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title 'Crosstabulation of SEX and RACE: In List Format';\n", " tables sex*race/list;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program. Review the output to convince yourself that rather than creating a two-way table, SAS prints the requested two-way table as a list. I tend to prefer the LIST format for such cross-tabulations, as long as I'm not in need of the (conditional) row and column percentages (which, as you can see, are not reported in LIST format).

\n", "

Incidentally, the LIST option cannot be used when the TABLES statement requests statistical tests or measures of association to be included in the output. (We'll be discussing such tests and measures later in this lesson.) Also, expected cell frequencies are not printed when the LIST option is specified, even if the EXPECTED option is invoked.

\n", "

An alternative to the LIST option is the CROSSLIST table option. When you add the CROSSLIST option to your TABLES statement, SAS displays crosstabulation tables in the standard Output Delivery System (ODS) column format. The benefit of doing so probably won't make much sense until we learn about ODS later. The option creates a table that has a table definition that you can customize by using the TEMPLATE procedure. For now, let's just be satisfied with looking at an example to see what the output looks like when you use the CROSSLIST option.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following FREQ procedure illustrates the CROSSLIST option:

\n", "
" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Crosstabulation of SEX and RACE: In Crosslist Format

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table of sex by race
sexraceFrequencyPercentRow
Percent
Column
Percent
1100.000.000.00
 230.475.3642.86
 310.161.793.45
 4517.9991.078.60
 500.000.000.00
 600.000.000.00
 710.161.79100.00
 800.000.000.00
 Total568.78100.00 
2120.310.34100.00
 240.630.6957.14
 3284.394.8196.55
 454284.9593.1391.40
 530.470.52100.00
 620.310.34100.00
 700.000.000.00
 810.160.17100.00
 Total58291.22100.00 
Total120.31 100.00
 271.10 100.00
 3294.55 100.00
 459392.95 100.00
 530.47 100.00
 620.31 100.00
 710.16 100.00
 810.16 100.00
 Total638100.00  
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " title 'Crosstabulation of SEX and RACE: In Crosslist Format';\n", " tables sex*race/crosslist;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Launch and run the SAS program, and review the output to familiarize yourself with the appearance of the output when you use the CROSSLIST option.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating Output Datasets\n", "\n", "Just as the UNIVARIATE, MEANS and SUMMARY procedures can be used to create new SAS data sets containing summary statistics of numeric variables, the FREQ procedure can be used to create new SAS data sets containing summary statistics of categorical variables. This can be a very useful tool when manipulating and subsequently analyzing data.\n", "\n", "
\n", "

Example

\n", "

The following FREQ procedure tells SAS to create an output data set which contains the counts and percentages for each combination of the variables sex and race:

\n", "
" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: SEXFREQ

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObssexraceCOUNTPERCENT
11230.4702
21310.1567
314517.9937
41710.1567
52120.3135
62240.6270
723284.3887
82454284.9530
92530.4702
102620.3135
112810.1567
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " tables sex*race / out = sexfreq noprint;\n", "RUN;\n", " \n", "PROC PRINT data = sexfreq;\n", " title 'Output Dataset: SEXFREQ';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The NOPRINT tables option tells SAS to suppress printing the two-way crosstabulation of sex*race. Instead, the results are dumped to the temporary data set sexfreq that is specified in the OUT= tables option. The data set in the OUT= option becomes the current data set. Therefore, since the PRINT procedure does not have a DATA= option to tell SAS the name of the data set to print, SAS prints the current data set, namely sexfreq.

\n", "

Launch and run the SAS program. Review the output to convince yourself that the sexfreq data set contains one record for each combination of sex and race that occurs in the data set. And, note that the data set contains four variables, sex, race, COUNT (whose value is the number of subjects in the data set of that sex and race), and PERCENT (whose value is the percentage of subjects in the data set of that sex and race).

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The SPARSE tables option tells SAS to print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data. This option affects printouts under the LIST option and output data sets only. The following SAS code illustrates use of the SPARSE option in the creation of an output data set called sexfreq:

\n", "
" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

Output Dataset: SEXFREQ with SPARSE option

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObssexraceCOUNTPERCENT
11100.0000
21230.4702
31310.1567
414517.9937
51500.0000
61600.0000
71710.1567
81800.0000
92120.3135
102240.6270
1123284.3887
122454284.9530
132530.4702
142620.3135
152700.0000
162810.1567
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data=phc6089.back;\n", " tables sex*race / out = sexfreq noprint sparse;\n", "RUN;\n", " \n", "PROC PRINT data = sexfreq;\n", " title 'Output Dataset: SEXFREQ with SPARSE option';\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The only difference between this code and the previous code is the inclusion of the SPARSE option. Launch and run the SAS program. Review the output to convince yourself that the sexfreq data set now contains one record for each possible combination of sex and race regardless of whether or not the combination occurs in the data set. That is, since there are 2 sexes and 8 races, the sexfreq data set contains 16 records — one for each of the combinations. For example, there is now a record in the sexfreq data set corresponding to sex = 1 and race = 1 even though there are no subjects in the data set of that sex and race combination (COUNT = 0).

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "For these exercises, we will use the bike lanes dataset (source), Bike_Lanes.csv. You can download this dataset from the course webpage.\n", "\n", "1. How many bike lanes are currently in Baltimore? You can assume that each observation/row is a different bike lane.\n", "2. How many (a) feet and (b) miles of total bike lanes are currently in Baltimore? (The length variable provides the length in feet.)\n", "3. How many types (type) bike lanes are there? Which type (a) occurs the most and (b) has the longest average bike lane length?\n", "4. How many different projects (project) do the bike lanes fall into? Which project category has the longest average bike lane length?\n", "5. What was the average bike lane length per year that they were installed? (Be sure to first set dateInstalled to . if it is equal to zero.)\n", "6. (a) Numerically and (b) graphically describe the distribution of bike lane lengths (length).\n", "7. Describe the distribution of bike lane lengths numerically and graphically after stratifying them by (a) type and then by (b) number of lanes (numLanes).\n" ] } ], "metadata": { "kernelspec": { "display_name": "SAS", "language": "sas", "name": "sas" }, "language_info": { "codemirror_mode": "sas", "file_extension": ".sas", "mimetype": "text/x-sas", "name": "sas" } }, "nbformat": 4, "nbformat_minor": 2 }