{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Statistical Analysis in SAS\n", "\n", "Now we are going to cover how to perform a variety of basic statistical tests in SAS.\n", "\n", "* Proportion tests\n", "* Chi-squared\n", "* Fisher’s Exact Test\n", "* Correlation\n", "* T-tests/Rank-sum tests\n", "* One-way ANOVA/Kruskal-Wallis\n", "* Linear Regression\n", "* Logistic Regression\n", "* Poisson Regression\n", "\n", "Note: We will be glossing over the statistical theory and “formulas” for these tests. There are plenty of resources online for learning more about these tests if you have not had a course covering this material. You will only be required to write code to fit or perform these test but will not be expected to interpret the results for this course.\n", "\n", "## Proportion Tests\n", "\n", "To conduct a test for one proportion, we can use PROC FREQ. To get this test, we use the BINOMIAL option in the TABLES statement. As options to BINOMIAL, we can specify\n", "\n", "* p= - the null value for the hypothesis test\n", "* level= - which group to use as a \"success\"\n", "* CORRECT - uses a continuity correction for calculating the p-value (can be useful for small sample sizes)\n", "* CL= - can select different types of CI such as WALD, EXACT, and LOGIT.\n", "\n", "
\n", "

Example

\n", "

In the following example, we use a summarized dataset, where we have the counts of the \"successes\" and \"failures\". In this case, we are interested in the proportion of smokers, so we have a count of smokers and a count of non-smokers.

\n", "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
smkstatusFrequencyPercentCumulative
Frequency
Cumulative
Percent
N1753.131753.13
Y1546.8832100.00
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Binomial Proportion
smkstatus = Y
Proportion0.4688
ASE0.0882
95% Lower Conf Limit0.2802
95% Upper Conf Limit0.6573
  
Exact Conf Limits 
95% Lower Conf Limit0.2909
95% Upper Conf Limit0.6526
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Test of H0: Proportion = 0.5
The asymptotic confidence limits and test
include a continuity correction.
ASE under H00.0884
Z-0.1768
One-sided Pr < Z0.4298
Two-sided Pr > |Z|0.8597
\n", "
\n", "
\n", "

Sample Size = 32

\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA smoke;\n", " INPUT smkstatus $ count;\n", " DATALINES;\n", "Y 15\n", "N 17\n", ";\n", "RUN;\n", "\n", "PROC FREQ data = smoke;\n", " TABLES smkstatus / binomial(p = 0.5 level = \"Y\" CORRECT) alpha = 0.05;\n", " WEIGHT count;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note the use of the WEIGHT statement to specify the counts for Y and N. Without this statement SAS would read our data as having 1 Y and 1 N.

\n", "

The estimated proportion is 0.4688. The (asymptotic) 95% CI is (0.2802, 0.6573) and the two sided (continuity corrected) p-value for testing $H_0: p=0.5$ vs $H_a: p\\neq 0.5$ is 0.8597.

\n", "

Alternatively, we could have had the data listed out for each individual as follows.

\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
smkstatusFrequencyPercentCumulative
Frequency
Cumulative
Percent
N1753.131753.13
Y1546.8832100.00
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Binomial Proportion
smkstatus = Y
Proportion0.4688
ASE0.0882
95% Lower Conf Limit0.2802
95% Upper Conf Limit0.6573
  
Exact Conf Limits 
95% Lower Conf Limit0.2909
95% Upper Conf Limit0.6526
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Test of H0: Proportion = 0.5
The asymptotic confidence limits and test
include a continuity correction.
ASE under H00.0884
Z-0.1768
One-sided Pr < Z0.4298
Two-sided Pr > |Z|0.8597
\n", "
\n", "
\n", "

Sample Size = 32

\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA smoke2;\n", " DO i = 1 to 15;\n", " smkstatus = \"Y\";\n", " OUTPUT;\n", " END;\n", " DO i = 1 to 17;\n", " smkstatus = \"N\";\n", " OUTPUT;\n", " END;\n", " DROP i;\n", "RUN;\n", "\n", "PROC FREQ data = smoke2;\n", " TABLES smkstatus / binomial(p = 0.5 level = \"Y\" CORRECT) alpha = 0.05;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chi-squared Test\n", "\n", "To test for an association between two categorical variables, we could perform a chi-square test of independence. Again, we will use PROC FREQ with a tables statement. For 2x2 tables, a chi-square test is automatically performed, but for larger tables, we can request is by providing the CHISQ option to the tables statement. Another useful option to also specify is the EXPECTED option which provided the expected cell counts under the null hypothesis of independence. These expected cell counts are needed to assess whether or not the chi-square test is appropriate.\n", "\n", "
\n", "

Example

\n", "

The following example uses the Kaggle car auction dataset to test for an association between online sales and a car being a bad buy.

\n", "
" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
Frequency
\n", "
Expected
\n", "
Percent
\n", "
Row Pct
\n", "
Col Pct
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table of IsBadBuy by IsOnlineSale
IsBadBuyIsOnlineSale
01Total
0\n", "
\n", "
62375
\n", "
62389
\n", "
85.47
\n", "
97.45
\n", "
87.68
\n", "
\n", "
\n", "
\n", "
1632
\n", "
1618.1
\n", "
2.24
\n", "
2.55
\n", "
88.46
\n", "
\n", "
\n", "
\n", "
64007
\n", "
 
\n", "
87.70
\n", "
 
\n", "
 
\n", "
\n", "
1\n", "
\n", "
8763
\n", "
8749.1
\n", "
12.01
\n", "
97.63
\n", "
12.32
\n", "
\n", "
\n", "
\n", "
213
\n", "
226.91
\n", "
0.29
\n", "
2.37
\n", "
11.54
\n", "
\n", "
\n", "
\n", "
8976
\n", "
 
\n", "
12.30
\n", "
 
\n", "
 
\n", "
\n", "
Total\n", "
\n", "
71138
\n", "
97.47
\n", "
\n", "
\n", "
\n", "
1845
\n", "
2.53
\n", "
\n", "
\n", "
\n", "
72983
\n", "
100.00
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Statistics for Table of IsBadBuy by IsOnlineSale

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
StatisticDFValueProb
Chi-Square10.99780.3178
Likelihood Ratio Chi-Square11.01540.3136
Continuity Adj. Chi-Square10.92740.3356
Mantel-Haenszel Chi-Square10.99780.3179
Phi Coefficient -0.0037 
Contingency Coefficient 0.0037 
Cramer's V -0.0037 
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Fisher's Exact Test
Cell (1,1) Frequency (F)62375
Left-sided Pr <= F0.1679
Right-sided Pr >= F0.8498
  
Table Probability (P)0.0177
Two-sided Pr <= P0.3324
\n", "
\n", "
\n", "

Sample Size = 72983

\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FILENAME cardata '/folders/myfolders/SAS_Notes/data/kaggleCarAuction.csv';\n", "\n", "PROC IMPORT datafile = cardata out = cars dbms = CSV replace;\n", " getnames = yes;\n", " guessingrows = 1000;\n", "RUN;\n", "\n", "PROC FREQ data = cars;\n", " TABLES isbadbuy*isonlinesale / chisq expected;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The chi-square test results in a p-value of 0.3178, or if we use the chi-square test with continuity correction, then we get a p-value of 0.3356.

\n", "

In the 2x2 case, as in this example, we may also want measures of effert such as the risk difference, relative risk and odds ratio. We can obtain these using the RISKDIFF, RELRISK, and OR options which will request all three measures with confidence intervals.

\n", "
" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
Frequency
\n", "
Percent
\n", "
Row Pct
\n", "
Col Pct
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table of IsBadBuy by IsOnlineSale
IsBadBuyIsOnlineSale
01Total
0\n", "
\n", "
62375
\n", "
85.47
\n", "
97.45
\n", "
87.68
\n", "
\n", "
\n", "
\n", "
1632
\n", "
2.24
\n", "
2.55
\n", "
88.46
\n", "
\n", "
\n", "
\n", "
64007
\n", "
87.70
\n", "
 
\n", "
 
\n", "
\n", "
1\n", "
\n", "
8763
\n", "
12.01
\n", "
97.63
\n", "
12.32
\n", "
\n", "
\n", "
\n", "
213
\n", "
0.29
\n", "
2.37
\n", "
11.54
\n", "
\n", "
\n", "
\n", "
8976
\n", "
12.30
\n", "
 
\n", "
 
\n", "
\n", "
Total\n", "
\n", "
71138
\n", "
97.47
\n", "
\n", "
\n", "
\n", "
1845
\n", "
2.53
\n", "
\n", "
\n", "
\n", "
72983
\n", "
100.00
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Statistics for Table of IsBadBuy by IsOnlineSale

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Column 1 Risk Estimates
 RiskASE95%
Confidence Limits
Exact 95%
Confidence Limits
Difference is (Row 1 - Row 2)
Row 10.97450.00060.97330.97570.97330.9757
Row 20.97630.00160.97310.97940.97290.9793
Total0.97470.00060.97360.97590.97360.9758
Difference-0.00180.0017-0.00510.0016  
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Column 2 Risk Estimates
 RiskASE95%
Confidence Limits
Exact 95%
Confidence Limits
Difference is (Row 1 - Row 2)
Row 10.02550.00060.02430.02670.02430.0267
Row 20.02370.00160.02060.02690.02070.0271
Total0.02530.00060.02410.02640.02420.0264
Difference0.00180.0017-0.00160.0051  
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Odds Ratio and Relative Risks
StatisticValue95% Confidence Limits
Odds Ratio0.92900.80401.0735
Relative Risk (Column 1)0.99820.99471.0016
Relative Risk (Column 2)1.07450.93311.2373
\n", "
\n", "
\n", "

Sample Size = 72983

\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data = cars;\n", " TABLES isbadbuy*isonlinesale / RISKDIFF RELRISK OR;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

For the risk difference, SAS provides two tables that compare the conditional row proportions in the first column and the conditional row proportions in the second column. Similarly, for the relative risk, we get a relative risk for the first and the second column. This allows us to pick the one that matters to us depending on which column corresponds to the outcome of interest.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fisher's Exact Test\n", "\n", "An alternative way to test for an association between two categorical variables is Fisher's exact test. This test is a nonparametric test that makes no assumption other than that we have a random sample. Note, however, that this comes with a price. The more levels our variables have and the more observations we have will increase the computing time needed to perform this test. For 2x2 tables, this test is usally very quick, but for 5x5 tables, depending on how much data and what computer you are using, this test may take hours to complete.\n", "\n", "For 2x2 tables, this test is automatically output. For larger tables, if you want this test, then you will need to specify the FISHER option in the TABLES statement.\n", "\n", "
\n", "

Example

\n", "

The following SAS program uses Fisher's exact test to test for an association between a car being a bad buy and buying the car online.

\n", "
" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The FREQ Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "
\n", "
Frequency
\n", "
Percent
\n", "
Row Pct
\n", "
Col Pct
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table of IsBadBuy by IsOnlineSale
IsBadBuyIsOnlineSale
01Total
0\n", "
\n", "
62375
\n", "
85.47
\n", "
97.45
\n", "
87.68
\n", "
\n", "
\n", "
\n", "
1632
\n", "
2.24
\n", "
2.55
\n", "
88.46
\n", "
\n", "
\n", "
\n", "
64007
\n", "
87.70
\n", "
 
\n", "
 
\n", "
\n", "
1\n", "
\n", "
8763
\n", "
12.01
\n", "
97.63
\n", "
12.32
\n", "
\n", "
\n", "
\n", "
213
\n", "
0.29
\n", "
2.37
\n", "
11.54
\n", "
\n", "
\n", "
\n", "
8976
\n", "
12.30
\n", "
 
\n", "
 
\n", "
\n", "
Total\n", "
\n", "
71138
\n", "
97.47
\n", "
\n", "
\n", "
\n", "
1845
\n", "
2.53
\n", "
\n", "
\n", "
\n", "
72983
\n", "
100.00
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

Statistics for Table of IsBadBuy by IsOnlineSale

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
StatisticDFValueProb
Chi-Square10.99780.3178
Likelihood Ratio Chi-Square11.01540.3136
Continuity Adj. Chi-Square10.92740.3356
Mantel-Haenszel Chi-Square10.99780.3179
Phi Coefficient -0.0037 
Contingency Coefficient 0.0037 
Cramer's V -0.0037 
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Fisher's Exact Test
Cell (1,1) Frequency (F)62375
Left-sided Pr <= F0.1679
Right-sided Pr >= F0.8498
  
Table Probability (P)0.0177
Two-sided Pr <= P0.3324
\n", "
\n", "
\n", "

Sample Size = 72983

\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC FREQ data = cars;\n", " TABLES isbadbuy*isonlinesale / FISHER;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The p-value for Fisher's exact test is 0.3324.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation\n", "\n", "SAS's CORR procedure can perform correlation analysis by providing both the parametric Pearson's correlation and the nonparametric Spearman's rank correlation coefficients and hypothesis tests. The default correlation output is Pearson's. To request the Spearman's rank correlation, add the SPREAMAN option to the PROC CORR statement.\n", "\n", "
\n", "

Example

\n", "

Let's look at some examples using PROC CORR using the Charm City Circulator bus ridership dataset. The following SAS program will find the Pearson correlation and hypothesis test results for the correlation between the average daily ridership between the orange and purple bus lines.

\n", "
" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The CORR Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
2 Variables:orangeAverage purpleAverage
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Simple Statistics
VariableNMeanStd DevSumMinimumMaximum
orangeAverage113630331228344567106927
purpleAverage99340171407398881608090
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
 orangeAveragepurpleAverage
orangeAverage\n", "
\n", "
1.00000
\n", "
 
\n", "
1136
\n", "
\n", "
\n", "
\n", "
0.91954
\n", "
<.0001
\n", "
993
\n", "
\n", "
purpleAverage\n", "
\n", "
0.91954
\n", "
<.0001
\n", "
993
\n", "
\n", "
\n", "
\n", "
1.00000
\n", "
 
\n", "
993
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FILENAME busdata '/folders/myfolders/SAS_Notes/data/Charm_City_Circulator_Ridership.csv';\n", "\n", "PROC IMPORT datafile = busdata out = circ dbms = CSV replace;\n", " getnames = yes;\n", " guessingrows = 1000;\n", "RUN;\n", "\n", "PROC CORR data = circ;\n", " VAR orangeAverage purpleAverage;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

We can also get a correlation matrix for multiple variables at the same time. The following example also uses the NOMISS option to only use complete observations instead of pairwise complete observations when calculating the correlations. Here we get the correlation matrix between average ridership counts between all four of the orange, purple, banner, and green bus lines.

\n", "
" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The CORR Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
4 Variables:orangeAverage purpleAverage greenAverage bannerAverage
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Simple Statistics
VariableNMeanStd DevSumMinimumMaximum
orangeAverage27038591095104189006927
purpleAverage27045521297122893508090
greenAverage2702090556.0035356421303879
bannerAverage270827.26852436.0487222336304617
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Pearson Correlation Coefficients, N = 270
Prob > |r| under H0: Rho=0
 orangeAveragepurpleAveragegreenAveragebannerAverage
orangeAverage\n", "
\n", "
1.00000
\n", "
 
\n", "
\n", "
\n", "
\n", "
0.90788
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
0.83958
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
0.54470
\n", "
<.0001
\n", "
\n", "
purpleAverage\n", "
\n", "
0.90788
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
1.00000
\n", "
 
\n", "
\n", "
\n", "
\n", "
0.86656
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
0.52135
\n", "
<.0001
\n", "
\n", "
greenAverage\n", "
\n", "
0.83958
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
0.86656
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
1.00000
\n", "
 
\n", "
\n", "
\n", "
\n", "
0.45334
\n", "
<.0001
\n", "
\n", "
bannerAverage\n", "
\n", "
0.54470
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
0.52135
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
0.45334
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
1.00000
\n", "
 
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC CORR data = circ NOMISS;\n", " VAR orangeAverage purpleAverage greenAverage bannerAverage;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

If we don't want all pairwise correlations, but instead only specific pairs, then we can use the WITH statement as in the following example.

\n", "
" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The CORR Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
2 With Variables:greenAverage bannerAverage
2 Variables:orangeAverage purpleAverage
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Simple Statistics
VariableNMeanStd DevSumMinimumMaximum
greenAverage2702090556.0035356421303879
bannerAverage270827.26852436.0487222336304617
orangeAverage27038591095104189006927
purpleAverage27045521297122893508090
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Pearson Correlation Coefficients, N = 270
Prob > |r| under H0: Rho=0
 orangeAveragepurpleAverage
greenAverage\n", "
\n", "
0.83958
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
0.86656
\n", "
<.0001
\n", "
\n", "
bannerAverage\n", "
\n", "
0.54470
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
0.52135
\n", "
<.0001
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC CORR data = circ NOMISS;\n", " VAR orangeAverage purpleAverage;\n", " WITH greenAverage bannerAverage;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get Spearman's rank correlation instead of Pearson's correlation, add the SPEARMAN option to the PROC CORR statement.\n", "\n", "
\n", "

Example

\n", "

The following SAS program produces Spearman's rank correlation coefficient and associated p-value for the hypothesis test of the correlation is 0 between the average daily ridership counts betwen the orange and purple bus lines.

\n", "
" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The CORR Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
2 Variables:orangeAverage purpleAverage
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Simple Statistics
VariableNMeanStd DevMedianMinimumMaximum
orangeAverage113630331228296806927
purpleAverage99340171407422308090
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Spearman Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
 orangeAveragepurpleAverage
orangeAverage\n", "
\n", "
1.00000
\n", "
 
\n", "
1136
\n", "
\n", "
\n", "
\n", "
0.91455
\n", "
<.0001
\n", "
993
\n", "
\n", "
purpleAverage\n", "
\n", "
0.91455
\n", "
<.0001
\n", "
993
\n", "
\n", "
\n", "
\n", "
1.00000
\n", "
 
\n", "
993
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC CORR data = circ SPEARMAN;\n", " VAR orangeAverage purpleAverage;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## T-Tests\n", "\n", "T-tests can be performed in SAS with the TTEST procedure including\n", "\n", "* one sample t-test\n", "* paired t-test\n", "* Two sample t-test\n", "\n", "
\n", "

Example

\n", "

In this example, we will test if the average daily ridership on the orange bus line is greater than 3000 using a one sample t-test.

\n", "
" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The TTEST Procedure

\n", "

 

\n", "

Variable: orangeAverage

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
NMeanStd DevStd ErrMinimumMaximum
11363033.21227.636.421706926.5
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Mean95% CL MeanStd Dev95% CL Std Dev
3033.22973.2Infty1227.61179.11280.3
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DFt ValuePr > t
11350.910.1814
\n", "
\n", "
\n", "
\n", "\"Summary\n", "
\n", "
\n", "
\n", "
\n", "\"Q-Q\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC TTEST data = circ H0 = 3000 SIDE = U;\n", " VAR orangeAverage;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The H0= option specifies the null value in the t-test and the SIDE= option specifies whether you want a less than (L), greater than (U), or not equal to (2) test. The default values are 0 for the null hypothesis value and two sided (2) for the alternative hypothesis. The output provides some summary statistics, the p-value for the test, confidence interval and a histogram and QQ plot to assess the normality assumption.

\n", "

From the output, we find the p-value to be 0.1814. Since we requested a one-side test, we get a one-sided confidence interval. To get our usual (two-sided) confidence interval, we need to request a two-sided test.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a two sample t-test, we need to have the data formatted in two columns:\n", "\n", "* A data column that contains the quantitative data for both groups\n", "* A grouping variable column that indicates the group for the data value in that row.\n", "\n", "In PROC TTEST, we put the data variable in the VAR statement and the grouping variable in the CLASS statement to get a two sample t-test.\n", "\n", "
\n", "

Example

\n", "

In the following SAS program, we perform a two-sample t-test between the orange and purple bus lines' average ridership counts. We will first have to transform the data to meet the required data format for PROC TTEST.

\n", "
" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The TTEST Procedure

\n", "

 

\n", "

Variable: count

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
groupMethodNMeanStd DevStd ErrMinimumMaximum
orange 11363033.21227.636.421706926.5
purple 9934016.91406.744.638808089.5
Diff (1-2)Pooled -983.81314.157.0906  
Diff (1-2)Satterthwaite -983.8 57.6122  
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
groupMethodMean95% CL MeanStd Dev95% CL Std Dev
orange 3033.22961.73104.61227.61179.11280.3
purple 4016.93929.34104.51406.71347.41471.4
Diff (1-2)Pooled-983.8-1095.7-871.81314.11275.81354.9
Diff (1-2)Satterthwaite-983.8-1096.8-870.8   
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
MethodVariancesDFt ValuePr > |t|
PooledEqual2127-17.23<.0001
SatterthwaiteUnequal1984-17.08<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Equality of Variances
MethodNum DFDen DFF ValuePr > F
Folded F99211351.31<.0001
\n", "
\n", "
\n", "
\n", "\"Summary\n", "
\n", "
\n", "
\n", "
\n", "\"Q-Q\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA circ_sub;\n", " SET circ;\n", " count = orangeAverage;\n", " group = \"orange\";\n", " OUTPUT;\n", " count = purpleAverage;\n", " group = \"purple\";\n", " OUTPUT;\n", " KEEP count group;\n", "RUN;\n", "\n", "PROC TTEST data = circ_sub;\n", " VAR count;\n", " CLASS group;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The SAS output contains summary statistics for each group, confidence intervals for each group mean, confidence intervals for the difference of the two means, hypothesis tests for the difference of the two means, and the F test for equality of variances. The Pooled row corresponds to the two sample t-test which assumes the population variances are equal between the two groups while the Satterthwaite assumes that the population variances are unequal.

\n", "

Note that the data here are really matched pairs data, since we have average ridership counts matched by date between the two bus lines. We will explore the paired t-test next.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform a paired t-test, we need to use the PAIRED statement. In this case, SAS assumes the data from each group are in two separate columns where observations in the same row correspond to the matched pairs.\n", "\n", "
\n", "

Example

\n", "

The following SAS program performs a paired t-test betwen the average ridership counts between the orange and purple bus lines.

\n", "
" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The TTEST Procedure

\n", "

 

\n", "

Difference: orangeAverage - purpleAverage

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
NMeanStd DevStd ErrMinimumMaximum
993-764.1572.318.1613-2998.02504.5
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Mean95% CL MeanStd Dev95% CL Std Dev
-764.1-799.8-728.5572.3548.2598.6
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
DFt ValuePr > |t|
992-42.08<.0001
\n", "
\n", "
\n", "
\n", "\"Summary\n", "
\n", "
\n", "
\n", "
\n", "\"Profiles\n", "
\n", "
\n", "
\n", "
\n", "\"Agreement\n", "
\n", "
\n", "
\n", "
\n", "\"Q-Q\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC TTEST data = circ;\n", " PAIRED orangeAverage*purpleAverage;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nonparametric Alternatives to the T-Tests\n", "\n", "In the case that we have a small sample size and the data cannot be assumed to be from populations that are Normally distributed, we need to use a nonparametric test. For the t-tests we have the following possible alternative tests:\n", "\n", "* The sign test or the Wilcoxon signed rank test as alternative to the one sample t-test or the paired t-test.\n", "* The Wilcoxon rank sum test as an alternative to the two sample t-test.\n", "\n", "To perform a Wilcoxon rank sum test, we use PROC NPAR1WAY.\n", "\n", "
\n", "

Example

\n", "

In the following example, we use PROC NPAR1WAY to perform Wilcoxon rank sum test to compare median daily ridership counts between the orange and purple bus lines.

\n", "
" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The NPAR1WAY Procedure

\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Wilcoxon Scores (Rank Sums) for Variable count
Classified by Variable group
groupNSum of
Scores
Expected
Under H0
Std Dev
Under H0
Mean
Score
Average scores were used for ties.
orange1136982529.501209840.014150.2115864.90273
purple9931284855.501057545.014150.21151293.91289
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Wilcoxon Two-Sample Test
StatisticZPr > ZPr > |Z|t Approximation
Pr > ZPr > |Z|
Z includes a continuity correction of 0.5.
128485616.0641<.0001<.0001<.0001<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Kruskal-Wallis Test
Chi-SquareDFPr > ChiSq
258.05551<.0001
\n", "
\n", "
\n", "
\n", "\"Box\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC NPAR1WAY data = circ_sub WILCOXON;\n", " VAR count;\n", " CLASS group;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to perform a sign test or Wilcoxon signed rank test, we must first calculate the paired differences between the matched pairs, and then pass the differences to PROC UNIVARIATE.\n", "\n", "
\n", "

Example

\n", "

The following SAS program performs uses PROC UNIVARIATE to obtain the sign test and Wilcoxon signed rank test p-values for testing for a median difference in ridership counts between the orange and purple bus lines.

\n", "
" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The UNIVARIATE Procedure

\n", "

Variable: diff

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Moments
N993Sum Weights993
Mean-764.14401Sum Observations-758795
Std Deviation572.297901Variance327524.888
Skewness0.73397459Kurtosis2.88082288
Uncorrected SS904733341Corrected SS324904688
Coeff Variation-74.893985Std Error Mean18.1613249
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Basic Statistical Measures
LocationVariability
Mean-764.144Std Deviation572.29790
Median-788.000Variance327525
Mode0.000Range5503
  Interquartile Range732.50000
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Tests for Location: Mu0=0
TestStatisticp Value
Student's tt-42.0753Pr > |t|<.0001
SignM-424.5Pr >= |M|<.0001
Signed RankS-227467Pr >= |S|<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Quantiles (Definition 5)
LevelQuantile
100% Max2504.5
99%912.5
95%131.5
90%-72.0
75% Q3-426.5
50% Median-788.0
25% Q1-1159.0
10%-1433.0
5%-1596.0
1%-1980.0
0% Min-2998.0
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Extreme Observations
LowestHighest
ValueObsValueObs
-2998.05521360.5171
-2649.59991418.5672
-2365.06351933.5743
-2300.55532484.0674
-2049.51882504.5709
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Missing Values
Missing
Value
CountPercent Of
All ObsMissing Obs
.15313.35100.00
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA circ_diff;\n", " SET circ;\n", " diff = orangeAverage - purpleAverage;\n", "RUN;\n", "\n", "PROC UNIVARIATE data = circ_diff;\n", " VAR diff;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

PROC UNIVARIATE provides lots of default output. The p-values for the sign test and signed rank test can be found in the test for location table.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One-way ANOVA and the Kruskal-Wallis Test\n", "\n", "When we wish to compare means between more than two independent groups, we can perform a one-way ANOVA or in the small sample case a Kruskal-Wallis test. A on-way ANOVA can be performed in SAS by using PROC GLM.\n", "\n", "
\n", "

Example

\n", "

The following SAS program performs a one-way ANOVA to test for equality of mean ridership counts between the orange, purple, and green bus lines.

\n", "
" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Class Level Information
ClassLevelsValues
group3green orange purple
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Number of Observations Read3438
Number of Observations Used2614
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "

 

\n", "

Dependent Variable: count

\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFSum of SquaresMean SquareF ValuePr > F
Model21442596863721298432490.02<.0001
Error261138433714881471992  
Corrected Total26135285968351   
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
R-SquareCoeff VarRoot MSEcount Mean
0.27291137.827401213.2573207.349
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType I SSMean SquareF ValuePr > F
group21442596863721298432490.02<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType III SSMean SquareF ValuePr > F
group21442596863721298432490.02<.0001
\n", "
\n", "
\n", "
\n", "\"Fit\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA circ_aov;\n", " SET circ;\n", " count = orangeAverage;\n", " group = \"orange\";\n", " OUTPUT;\n", " count = purpleAverage;\n", " group = \"purple\";\n", " OUTPUT;\n", " count = greenAverage;\n", " group = \"green\";\n", " OUTPUT;\n", " KEEP count group;\n", "RUN;\n", "\n", "PROC GLM data = circ_aov;\n", " CLASS group;\n", " MODEL count = group;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

If instead we wanted to perform a Kruskal-Wallis test, we would use PROC NPAR1WAY.

\n", "
" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The NPAR1WAY Procedure

\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Wilcoxon Scores (Rank Sums) for Variable count
Classified by Variable group
groupNSum of
Scores
Expected
Under H0
Std Dev
Under H0
Mean
Score
Average scores were used for ties.
orange11361395578.001485320.0019128.08821228.50176
purple9931720911.501298347.5018728.85881733.04280
green485301315.50634137.5015000.4361621.26907
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Kruskal-Wallis Test
Chi-SquareDFPr > ChiSq
729.06682<.0001
\n", "
\n", "
\n", "
\n", "\"Box\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC NPAR1WAY data = circ_aov WILCOXON;\n", " CLASS group;\n", " VAR count;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear Regression\n", "\n", "In SAS, there are two procedures that can be used to fit a linear regression model:\n", "\n", "* PROC REG\n", "* PROc GLM\n", "\n", "PROC REG will give you most of the standard output, but the model statement requires all variables to be calculated in a prior DATA step, such as interaction terms. PROC GLM, however, allows you to calculate interaction terms on the fly in the MODEL statement. Generally, I prefer PROC GLM for this reason, but either PROC will work and can be used to get all the standard regression output.\n", "\n", "Another reason I prefer PROC GLM over PROC REG is that PROC REG does not have a CLASS statement, so you must do all the dummy coding for categorical variables manually in a DATA step when using PROC REG.\n", "\n", "Let's look at a few examples using both PROCs.\n", "\n", "
\n", "

Example

\n", "

The first example fits a simple linear regression model with a single binary predictor. Note that in this case, the t-test for the slope is equivalent to a two sample t-test.

\n", "
" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The REG Procedure

\n", "

Model: MODEL1

\n", "

Dependent Variable: count

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Number of Observations Read2292
Number of Observations Used2129
Number of Observations with Missing Values163
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Analysis of Variance
SourceDFSum of
Squares
Mean
Square
F ValuePr > F
Model1512793031512793031296.93<.0001
Error212736732325591726955  
Corrected Total21284186025589   
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Root MSE1314.13647R-Square0.1225
Dependent Mean3492.00892Adj R-Sq0.1221
Coeff Var37.63268  
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Parameter Estimates
VariableDFParameter
Estimate
Standard
Error
t ValuePr > |t|
Intercept14016.9345441.7028696.32<.0001
grp_bin1-983.7734557.09059-17.23<.0001
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The REG Procedure

\n", "

Model: MODEL1

\n", "

Dependent Variable: count

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"Panel\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"Scatter\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\"Scatterplot\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA circ_sub;\n", " SET circ_sub;\n", " IF group = \"orange\" THEN grp_bin = 1;\n", " ELSE grp_bin = 0;\n", "RUN;\n", "\n", "PROC REG data = circ_sub;\n", " MODEL count = grp_bin;\n", "RUN;" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Class Level Information
ClassLevelsValues
group2orange purple
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Number of Observations Read2292
Number of Observations Used2129
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "

 

\n", "

Dependent Variable: count

\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFSum of SquaresMean SquareF ValuePr > F
Model1512793031512793031296.93<.0001
Error212736732325591726955  
Corrected Total21284186025589   
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
R-SquareCoeff VarRoot MSEcount Mean
0.12250137.632681314.1363492.009
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType I SSMean SquareF ValuePr > F
group1512793030.6512793030.6296.93<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType III SSMean SquareF ValuePr > F
group1512793030.6512793030.6296.93<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ParameterEstimate Standard
Error
t ValuePr > |t|
Intercept4016.934542B41.7028603296.32<.0001
group orange-983.773450B57.09058700-17.23<.0001
group purple0.000000B...
\n", "
\n", "
\n", "

Note:The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

\n", "
\n", "
\n", "
\n", "\"Panel\n", "
\n", "
\n", "
\n", "
\n", "\"Fit\n", "
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC GLM data = circ_sub PLOTS=DIAGNOSTICS;\n", " CLASS group(ref = 'purple');\n", " MODEL count = group / solution;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note that we get the same output from both procedures, but with PROC REG we had to manually code the group variable as a 0/1 dummy variable, whereas in PROC GLM we could use the CLASS statement with a ref= statement to select the reference category. We also need to request the residual diagnostic plots in PROC GLM as this is not default output.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have seen how to get the same output from both PROC GLM and PROC REG, we will use PROC GLM for all the remaining examples to avoid needing to use a DATA step to calculate dummy variables and interaction terms.\n", "\n", "
\n", "

Example

\n", "

In the following SAS program, we will fit linear regression models with more than one predictor using the Kaggle car auction dataset. First, let's fit a simple linear regression model and build on to it by adding more variables.

\n", "
" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Number of Observations Read72983
Number of Observations Used72983
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "

 

\n", "

Dependent Variable: VehOdo

\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFSum of SquaresMean SquareF ValuePr > F
Model11.5863773E121.5863773E128313.88<.0001
Error729811.3925561E13190810767.21  
Corrected Total729821.5511938E13   
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
R-SquareCoeff VarRoot MSEVehOdo Mean
0.10226819.3194813813.4371500.00
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType I SSMean SquareF ValuePr > F
VehicleAge11.5863773E121.5863773E128313.88<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType III SSMean SquareF ValuePr > F
VehicleAge11.5863773E121.5863773E128313.88<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ParameterEstimateStandard
Error
t ValuePr > |t|
Intercept60127.24037134.8017988446.04<.0001
VehicleAge2722.9411729.863207891.18<.0001
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC GLM data = cars;\n", " MODEL VehOdo = VehicleAge;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Now let's add another varialbe, in this case, the binary variable IsBadBuy. This variable is alread a 0/1 dummy variable, so we don't need to put it in a class statement (but we could if we wanted to and still get the same output by choosing the matching reference category).

\n", "
" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Number of Observations Read72983
Number of Observations Used72983
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "

 

\n", "

Dependent Variable: VehOdo

\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFSum of SquaresMean SquareF ValuePr > F
Model21.5998928E127999463793474196.37<.0001
Error729801.3912045E13190628187.46  
Corrected Total729821.5511938E13   
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
R-SquareCoeff VarRoot MSEVehOdo Mean
0.10313919.3102313806.8271500.00
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType I SSMean SquareF ValuePr > F
VehicleAge11.5863773E121.5863773E128321.84<.0001
IsBadBuy1135154811181351548111870.90<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType III SSMean SquareF ValuePr > F
VehicleAge11.4941599E121.4941599E127838.08<.0001
IsBadBuy1135154811181351548111870.90<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ParameterEstimateStandard
Error
t ValuePr > |t|
Intercept60141.77139134.7483412446.33<.0001
VehicleAge2680.3275830.274912588.53<.0001
IsBadBuy1329.00242157.83509518.42<.0001
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC GLM data = cars;\n", " MODEL VehOdo = VehicleAge IsBadBuy;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Note that when adding multiple predictors in the MODEL statement, they are separated by a space instead of a + symbol. To add an interaction, we can create the individual interaction term using * for multiplication while still including the main effect terms or we can use the shorthand | to create all three at the same time.

\n", "
" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Number of Observations Read72983
Number of Observations Used72983
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GLM Procedure

\n", "

 

\n", "

Dependent Variable: VehOdo

\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFSum of SquaresMean SquareF ValuePr > F
Model31.5998931E125332977021092797.54<.0001
Error729791.3912045E13190630794.79  
Corrected Total729821.5511938E13   
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
R-SquareCoeff VarRoot MSEVehOdo Mean
0.10313919.3103713806.9171500.00
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType I SSMean SquareF ValuePr > F
VehicleAge11.5863773E121.5863773E128321.73<.0001
IsBadBuy1135154811181351548111870.90<.0001
VehicleAge*IsBadBuy1347632.54102347632.541020.000.9659
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
SourceDFType III SSMean SquareF ValuePr > F
VehicleAge11.2937202E121.2937202E126786.52<.0001
IsBadBuy1166239836716623983678.720.0031
VehicleAge*IsBadBuy1347632.54068347632.540680.000.9659
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ParameterEstimateStandard
Error
t ValuePr > |t|
Intercept60139.69756143.2332682419.87<.0001
VehicleAge2680.8371932.542191482.38<.0001
IsBadBuy1347.28218456.23388972.950.0031
VehicleAge*IsBadBuy-3.7895388.7403782-0.040.9659
\n", "
\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC GLM data = cars;\n", " MODEL VehOdo = VehicleAge IsBadBuy VehicleAge*IsBadBuy;\n", " *MODEL VehOdo = VehicleAge|IsBadBuy;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the residuals and predicted values, use the OUTPUT statement. We can even get predicted values for new data by adding rows to the dataset and setting the response variable to missing.\n", "\n", "
\n", "

Example

\n", "

In the following example, we will extract the residuals and the predicted values using the OUTPUT statement. We will also add an additional new point to the dataset to get a new predicted value.

\n", "
" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "
\n", "\"The\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ObsIsBadBuyVehicleAgeVehOdoresidfitted
7298416..77549.27
7298505..73543.88
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATA new;\n", " INPUT VehOdo VehicleAge IsBadBuy;\n", " DATALINES;\n", ". 6 1\n", ". 5 0\n", ";\n", "RUN;\n", "\n", "DATA cars_new;\n", " SET cars(keep = VehOdo VehicleAge IsBadBuy) \n", " new;\n", "RUN;\n", "\n", "PROC GLM data = cars_new noprint;\n", " MODEL VehOdo = VehicleAge IsBadBuy VehicleAge*IsBadBuy;\n", " OUTPUT out=res_pred residuals = resid predicted = fitted;\n", "RUN;\n", "\n", "PROC SGPLOT data = res_pred;\n", " HISTOGRAM resid;\n", "RUN;\n", "\n", "PROC PRINT data = res_pred (FIRSTOBS=72984);\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The missing values for the response, VehOdo, in the new observations keep these rows from being used to fit the model, but since we have values for all the predictors in the model a predicted value is still calculated in the OUTPUT dataset. Recall, the predicted values are found by plugging in the predictor values into the fitted regression equation. For example, for the first new data value:

\n", " $$\\widehat{y}=60139.7 + 1347.28 + 2680.84*6 -3.79*6=77549.28$$\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic Regression\n", "\n", "Generalized Linear Models (GLMs) allow for fitting regressions for non-continuous/normal outcomes. The glm has similar syntax to the lm command. Logistic regression is one example.\n", "\n", "In a (simple) logistic regression model, we have a binary response Y and a predictor x. It is assumed that given the predictor, $Y\\sim\\text{Bernoulli(p(x))}$ where $p(x)=P(Y=1|x)$ and\n", "\n", "$$\\log\\left(\\dfrac{P(Y=1|x)}{1-P(Y=1|x)}\\right)=\\beta_0+\\beta_1x$$\n", "\n", "That is the log-odds of success changes linearly with x. It then follows that $e^{\\beta_1}$ is the odds ratio of success for a one unit increase in x.\n", "\n", "In SAS, there are two procedures that can be used to fit a logistic regression model\n", "\n", "* PROC LOGISTIC\n", "* PROC GENMOD\n", "\n", "Generally, I use PROC LOGISTIC as it is made specifically for logistic regression and provides many extras that PROC GENMOD does not." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Example

\n", "

The following example uses PROC LOGISTIC to fit a logistic regression model with IsBadBuy as the binary response and VehOdo and VehicleAge as predictors.

\n", "
" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The LOGISTIC Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Model Information
Data SetWORK.CARS
Response VariableIsBadBuy
Number of Response Levels2
Modelbinary logit
Optimization TechniqueFisher's scoring
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Number of Observations Read72983
Number of Observations Used72983
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Response Profile
Ordered
Value
IsBadBuyTotal
Frequency
1064007
218976
\n", "
\n", "

Probability modeled is IsBadBuy='1'.

\n", "
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Model Fit Statistics
CriterionIntercept OnlyIntercept and Covariates
AIC54423.30752352.263
SC54432.50552379.857
-2 Log L54421.30752346.263
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Testing Global Null Hypothesis: BETA=0
TestChi-SquareDFPr > ChiSq
Likelihood Ratio2075.04432<.0001
Score2108.27912<.0001
Wald2025.37792<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Analysis of Maximum Likelihood Estimates
ParameterDFEstimateStandard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept1-3.77820.06383505.9533<.0001
VehOdo18.341E-68.526E-795.6991<.0001
VehicleAge10.26810.006771567.3289<.0001
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Association of Predicted Probabilities and Observed Responses
Percent Concordant64.4Somers' D0.288
Percent Discordant35.6Gamma0.288
Percent Tied0.0Tau-a0.062
Pairs574526832c0.644
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Parameter Estimates and Wald Confidence Intervals
ParameterEstimate95% Confidence Limits
Intercept-3.7782-3.9033-3.6531
VehOdo8.341E-66.67E-60.000010
VehicleAge0.26810.25480.2814
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Odds Ratio Estimates and Wald Confidence Intervals
EffectUnitEstimate95% Confidence Limits
VehOdo1.00001.0001.0001.000
VehicleAge1.00001.3071.2901.325
\n", "
\n", "
\n", "
\n", "\"Plot\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC LOGISTIC data = cars;\n", " MODEL isbadbuy(event='1') = vehodo vehicleage / CLPARM=WALD CLODDS=WALD;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

The CLPARM= and CLODDS= options request confidence intervals for the parameter estimates and corresponding odds ratios.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Poisson Regression\n", "\n", "Poisson regression is used for count responses. This model assumes that (in the case of a single predictor) that $Y|x\\sim\\text{Poisson}(\\lambda(x))$, where $\\lambda(x)=E[Y|x]$, and for the case of a single predictor\n", "\n", "$$\\log(E[Y|x])=\\beta_0+\\beta_1x.$$\n", "\n", "Then $e^{\\beta_1}$ represents the rate ratio for a one unit increase in x. To fit such a model, we will use PROC GENMOD.\n", "\n", "
\n", "

Example

\n", "

The following SAS program fits a Poisson regression model to the count response of the daily ridership count on the orange bus line with day of the week as the predictor.

\n", "
" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "SAS Output\n", "\n", "\n", "\n", "
\n", "
\n", "

The SAS System

\n", "
\n", "
\n", "

The GENMOD Procedure

\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Model Information
Data SetWORK.CIRC
DistributionPoisson
Link FunctionLog
Dependent VariableorangeBoardings
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Number of Observations Read1146
Number of Observations Used1079
Missing Values67
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Class Level Information
ClassLevelsValues
day7Monday Saturday Sunday Thursday Tuesday Wednesday Friday
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Criteria For Assessing Goodness Of Fit
CriterionDFValueValue/DF
Deviance1072465776.5310434.4930
Scaled Deviance1072465776.5310434.4930
Pearson Chi-Square1072425148.2927396.5936
Scaled Pearson X21072425148.2927396.5936
Log Likelihood 23002386.843 
Full Log Likelihood -238139.4730 
AIC (smaller is better) 476292.9459 
AICC (smaller is better) 476293.0505 
BIC (smaller is better) 476327.8324 
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Algorithm converged.
\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Analysis Of Maximum Likelihood Parameter Estimates
Parameter DFEstimateStandard
Error
Wald 95% Confidence LimitsWald Chi-SquarePr > ChiSq
Intercept 18.22790.00138.22548.23053.954E7<.0001
dayMonday1-0.19640.0019-0.2002-0.192610163.2<.0001
daySaturday1-0.26910.0020-0.2730-0.265218119.2<.0001
daySunday1-0.68970.0023-0.6942-0.685290824.2<.0001
dayThursday1-0.15230.0019-0.1561-0.14856213.44<.0001
dayTuesday1-0.17190.0019-0.1757-0.16817860.04<.0001
dayWednesday1-0.13960.0019-0.1434-0.13595260.55<.0001
dayFriday00.00000.00000.00000.0000..
Scale 01.00000.00001.00001.0000  
\n", "
\n", "

Note:The scale parameter was held fixed.

\n", "
\n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PROC GENMOD data = circ;\n", " CLASS day(ref='Friday');\n", " MODEL orangeBoardings = day / dist = Poisson link = log;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

In the model statement, we need to specify the dist=Poisson option to specify that the response is assumed to be Poisson and that we are using the log link via the link=log option.

\n", "

In the case that an offset is desired in a Poisson regression model, we can use the OFFSET= option in the model statement. Note that when using this option, we must take the log() of the offset value ourselves.

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "These exercises will use the child mortality dataset, indicatordeadkids35.csv, and the Kaggle car auction dataset, kaggleCarAuction.csv. Modify the following code to read in this dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "FILENAME cardata '/folders/myfolders/SAS_Notes/data/kaggleCarAuction.csv';\n", "\n", "PROC IMPORT datafile = cardata out = cars dbms = CSV replace;\n", " getnames = yes;\n", " guessingrows = 1000;\n", "RUN;\n", "\n", "FILENAME mortdat '/folders/myfolders/SAS_Notes/data/indicatordeadkids35.csv';\n", "\n", "PROC IMPORT datafile = mortdat out = mort dbms = CSV replace;\n", " getnames = yes;\n", " guessingrows = 500;\n", "RUN;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Compute the correlation between the `1980`, `1990`, `2000`, and `2010` mortality data. Just display the result to the screen. Then compute using the NOMMISS option. (Note: The column names are numbers, which are invalid standard SAS names, so to refer to the variable 1980 in your code use '1980'n.)\n", "2. \n", " a. Compute the correlation between the `Myanmar`, `China`, and `United States` mortality data. Store this correlation matrix in an object called `country_cor` using ODS OUTPUT.\n", " b. Extract the Myanmar-US correlation from the correlation matrix.\n", "3. Is there a difference between mortality information from `1990` and `2000`? Run a paired t-test and a Wilcoxon signed rank test to assess this. Hint: to extract the column of information for `1990`, use '1990'n.\n", "4. Using the cars dataset, fit a linear regression model with vehicle cost (`VehBCost`) as the outcome and vehicle age (`VehicleAge`) and whether it's an online sale (`IsOnlineSale`) as predictors as well as their interaction.\n", "5. Create a variable called `expensive` in the `cars` data that indicates if the \n", "vehicle cost is over `$10,000`. Use a chi-squared test to assess if there is a\n", "relationship between a car being expensive and it being labeled as a \"bad buy\" (`IsBadBuy`).\n", "6. Fit a logistic regression model where the outcome is \"bad buy\" status and predictors are the `expensive` status and vehicle age (`VehicleAge`). Request confidence intervals for the odds ratios." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "SAS", "language": "sas", "name": "sas" }, "language_info": { "codemirror_mode": "sas", "file_extension": ".sas", "mimetype": "text/x-sas", "name": "sas" } }, "nbformat": 4, "nbformat_minor": 2 }