Intro to PROC SQL

A relational database consists of data stored in tables where the columns represent varaibles and the rows represent observations. SQL (structured query language) is a language for navigating and manipulating the database and tables. Today, we will introduce some basic SQL and compare it with DATA and PROC step code to perform the same actions.

The most used SQL statement is the SELECT statement, so we will start with SELECT and its substatements. The basic syntax is as follows:

PROC SQL;
SELECT 
 FROM 
 WHERE 
 GROUP BY columns
 HAVING expressions
 ORDER BY columns;
QUIT;

Not all of these statements need to be used each time, but they must be used in this order.

In [1]:
LIBNAME mydata "H:\BiostatCourses\PublicHealthComputing\Lectures\Week9MultipleReg\SAS";

PROC CONTENTS DATA=mydata.hers;
RUN;
SAS Connection established. Subprocess id is 5696

Out[1]:
SAS Output

SAS Output

The SAS System

The CONTENTS Procedure

The CONTENTS Procedure

MYDATA.HERS

Attributes

Data Set Name MYDATA.HERS Observations 2763
Member Type DATA Variables 37
Engine V9 Indexes 0
Created 10/29/2017 18:06:34 Observation Length 280
Last Modified 10/29/2017 18:06:34 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation WINDOWS_64    
Encoding wlatin1 Western (Windows)    

Engine/Host Information

Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 12
First Data Page 1
Max Obs per Page 233
Obs in First Data Page 216
Number of Data Set Repairs 0
ExtendObsCounter YES
Filename H:\BiostatCourses\PublicHealthComputing\Lectures\Week9MultipleReg\SAS\hers.sas7bdat
Release Created 9.0401M3
Host Created X64_7PRO

Variables

Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
18 BMI Num 8 BEST12. BEST32.
23 BMI1 Num 8 BEST12. BEST32.
36 DBP Num 8 BEST12. BEST32.
29 HDL Num 8 BEST12. BEST32.
33 HDL1 Num 8 BEST12. BEST32.
1 HT Char 15 $15. $15.
28 LDL Num 8 BEST12. BEST32.
32 LDL1 Num 8 BEST12. BEST32.
35 SBP Num 8 BEST12. BEST32.
30 TG Num 8 BEST12. BEST32.
34 TG1 Num 8 BEST12. BEST32.
20 WHR Num 8 BEST12. BEST32.
25 WHR1 Num 8 BEST12. BEST32.
2 age Num 8 BEST12. BEST32.
37 age10 Num 8 BEST12. BEST32.
14 diabetes Char 3 $3. $3.
15 dmpills Char 3 $3. $3.
6 drinkany Char 3 $3. $3.
7 exercise Char 3 $3. $3.
9 globrat Char 9 $9. $9.
21 glucose Num 8 BEST12. BEST32.
26 glucose1 Num 8 BEST12. BEST32.
12 htnmeds Char 3 $3. $3.
16 insulin Char 3 $3. $3.
11 medcond Num 8 BEST12. BEST32.
4 nonwhite Char 3 $3. $3.
8 physact Char 20 $20. $20.
10 poorfair Char 3 $3. $3.
3 raceth Char 16 $16. $16.
5 smoking Char 3 $3. $3.
13 statins Char 3 $3. $3.
27 tchol Num 8 BEST12. BEST32.
31 tchol1 Num 8 BEST12. BEST32.
19 waist Num 8 BEST12. BEST32.
24 waist1 Num 8 BEST12. BEST32.
17 weight Num 8 BEST12. BEST32.
22 weight1 Num 8 BEST12. BEST32.
In [2]:
/*
PROC PRINT DATA=mydata.hers (OBS=5);
VAR BMI HT age smoking;
RUN;
*/

PROC SQL OUTOBS=5;
SELECT BMI, HT, age, smoking
FROM mydata.hers
QUIT;
Out[2]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

BMI HT age smoking
23.69000053 placebo 70 no
28.62000084 placebo 62 no
42.50999832 hormone therapy 69 no
24.38999939 placebo 64 yes
21.89999962 placebo 65 no

The SELECT statment chooses which columns (varaibles) you want from the table(s) listed in the FROM statement. The OUTOBS=5 limits the number of rows produced in the created view to 5 rows. We can also add labels and apply formats.

In [4]:
/*
DATA mydata.hers;
SET mydata.hers;
LABEL BMI="Body Mass Index"
      HT="Hormone Therapy"
      age="Age (yrs)
      smoking="Smoking Status";
FORMAT BMI 7.2;
RUN;

PROC PRINT DATA=mydata.hers (OBS=5);
VAR BMI HT age smoking;
RUN;
*/

PROC SQL OUTOBS=5;
SELECT BMI LABEL="Body Mass Index" FORMAT=7.2,
       HT LABEL="Hormone Therapy",
       age LABEL="Age (yrs)",
       smoking LABEL="Smoking Status"
 FROM mydata.hers;
QUIT;
Out[4]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

Body Mass Index Hormone Therapy Age (yrs) Smoking Status
23.69 placebo 70 no
28.62 placebo 62 no
42.51 hormone therapy 69 no
24.39 placebo 64 yes
21.90 placebo 65 no

We can also create new variables.

In [5]:
/*
DATA mydata.hers;
SET mydata.hers;
IF BMI <= 18.5 THEN BMICat='Low';
ELSE IF 18.5 < BMI <=24.9 THEN BMICat='Normal';
ELSE IF 24.9 < BMI <= 29.9 THEN BMICat='Overweight';
ELSE BMICat='Obese';
age10=age/10;
RUN;

PROC PRINT DATA=mydata.hers (OBS=5);
VAR BMI BMICat age age10;
RUN;
*/

PROC SQL OUTOBS=5;
SELECT BMI,
    CASE
     WHEN BMI <= 18.5 THEN 'Low'
     WHEN 18.5 < BMI <=24.9 THEN 'Normal'
     WHEN 24.9 < BMI <= 29.9 THEN 'Overweight'
     ELSE 'Obese'
    END AS BMICat,
    age,
    age/10 AS age10
 FROM mydata.hers;
QUIT;
Out[5]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

BMI BMICat age age10
23.69000053 Normal 70 7
28.62000084 Overweight 62 6.2
42.50999832 Obese 69 6.9
24.38999939 Normal 64 6.4
21.89999962 Normal 65 6.5

To subset data, we can use the WHERE clause.

In [6]:
/*
PROC PRINT DATA=mydata.hers (OBS=5);
VAR BMI age HT;
WHERE smoking="yes" AND drinkany="yes";
RUN;
*/

PROC SQL OUTOBS=5;
SELECT BMI, AGE, HT
FROM mydata.hers
WHERE smoking="yes" AND drinkany="yes";
QUIT;
Out[6]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

BMI age HT
24.38999939 64 placebo
45.68000031 62 hormone therapy
20.47999954 58 placebo
21.51000023 65 placebo
27.59000015 52 placebo

PROC SQL can also sort data using the ORDER by clause.

In [7]:
/*
PROC SORT DATA=mydata.hers;
VAR smoking drinkany DBP desc;
RUN;

PROC PRINT DATA=mydata.hers (OBS=10);
VAR smoking drinkany dbp;
RUN;
*/

PROC SQL OUTOBS=10;
SELECT SMOKING, DRINKANY, DBP
 FROM mydata.hers
 ORDER BY smoking, drinkany, DBP desc;
QUIT;
Out[7]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

smoking drinkany DBP
no   81
no   80
no no 100
no no 100
no no 100
no no 100
no no 100
no no 100
no no 100
no no 99

Creating Tables

In [17]:
/*
DATA Site1;
INPUT PatientId SBP DBP SEX $;
DATALINES;
100 117 75 Male
101 118 80 Female
102 110 71 Female
;
RUN;

DATA Site2;
INPUT PatientId SBP DBP;
DATALINES;
103 125 75
104 106 79
105 127 82
;
RUN;
*/

PROC SQL;
CREATE TABLE Site1
    (PatientId num,
     SBP num,
     DBP num,
     Sex CHAR(7));
INSERT INTO SITE1
    values(100,117,75,"Male")
    values(101,118, 80,"Female")
    values(102,110,71,"Female");
CREATE TABLE Site2
    (PatientId num,
     SBP num,
     DBP num);
INSERT INTO SITE2
    SET PatientId=103, SBP=125, DBP=75
    SET PatientId=104, SBP=106, DBP=79
    SET PatientId=105, SBP=127, DBP=82;   
SELECT * FROM SITE1;
SELECT * FROM SITE2;
QUIT;
Out[17]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

PatientId SBP DBP Sex
100 117 75 Male
101 118 80 Female
102 110 71 Female

The SAS System

Query Results

PatientId SBP DBP
103 125 75
104 106 79
105 127 82

The following creates a new table from Site1 and Site2 by stacking them.

In [28]:
/*
DATA MasterSite;
SET Site1 Site2;
RUN;

PROC PRINT DATA = MasterSite;
RUN;
*/

PROC SQL;
CREATE TABLE MasterSite AS
SELECT COALESCE(Site1.PatientId,Site2.PatientId) AS PatientId, 
       COALESCE(Site1.SBP,Site2.SBP) AS SBP,
       COALESCE(Site1.DBP,Site2.DBP) AS DBP,
       SEX
FROM Site1 FULL JOIN Site2
ON Site1.PatientId=Site2.PatientId;

SELECT * FROM MASTERSITE;
QUIT;
Out[28]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

PatientId SBP DBP Sex
100 117 75 Male
101 118 80 Female
102 110 71 Female
103 125 75  
104 106 79  
105 127 82  
In [33]:
DATA Site3;
INPUT PatientId SBP DBP Visit MMDDYY8.;
DATALINES;
100 117 75 12/15/16
101 118 80 12/10/16
102 110 71 12/3/16
;
RUN;

DATA Site4;
INPUT PatientId Weight Height;
DATALINES;
100 125 60
101 140 67
105 180 70
;
RUN;

/*
PROC SORT DATA = Site3;
BY PatientId;
RUN;

PROC SORT DATA = Site4;
BY PatientId;
RUN;


DATA Master2;
MERGE Site3 Site4;
BY PatientId;
RUN;

PROC PRINT DATA = Master2;
TITLE2 "Merged Dataset";
RUN;
*/

PROC SQL;
SELECT *
FROM Site3, Site4
WHERE Site3.PatientId=Site4.PatientId;

SELECT COALESCE(Site3.PatientId,Site4.PatientId) AS PatientId,
       Site3.SBP,Site3.DBP,Site3.Visit FORMAT=MMDDYY8.,
       SITE4.Weight, Site4.Height
FROM SITE3 FULL JOIN SITE4
ON Site3.PatientId=Site4.PatientId;
QUIT;
Out[33]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

PatientId SBP DBP Visit PatientId Weight Height
100 117 75 20803 100 125 60
101 118 80 20798 101 140 67

The SAS System

Query Results

PatientId SBP DBP Visit Weight Height
100 117 75 12/15/16 125 60
101 118 80 12/10/16 140 67
102 110 71 12/03/16 . .
105 . . . 180 70

Aggregating Data

In [35]:
DATA PATIENTS; 
INFILE "H:\BiostatCourses\PublicHealthComputing\Lectures\Week5DataManagement\SAS\patients.txt" PAD; 
INPUT @1  PATNO    $3. 
      @4  GENDER   $1. 
      @5  VISIT    MMDDYY10. 
      @15 HR       3. 
      @18 SBP      3. 
      @21 DBP      3. 
      @24 DX       $3. 
      @27 AE       $1.; 
LABEL PATNO   = "PATIENT NUMBER" 
      GENDER  = "GENDER" 
      VISIT   = "VISIT DATE" 
      HR      = "HEART RATE" 
      SBP     = "SYSTOLIC BLOOD PRESSURE" 
      DBP     = "DIASTOLIC BLOODPRESSURE" 
      DX      = "DIAGNOSIS CODE" 
      AE      = "ADVERSE EVENT?"; 
FORMAT VISIT MMDDYY10.; 
RUN; 

/*
PROC FREQ DATA = Patients;
TABLE GENDER / NOCUM NOPERCENT;
RUN;
*/

PROC SQL;
SELECT GENDER, COUNT(GENDER)
FROM PATIENTS
GROUP BY GENDER;
QUIT;
Out[35]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

GENDER  
  0
2 1
F 14
M 13
X 1
In [38]:
/*
PROC SORT DATA=Patiets;
VAR GENDER;
RUN;

PROC MEANS DATA=Patients MEAN;
VAR SBP HR;
BY GENDER;
WHERE DBP > 60;
RUN;
*/

PROC SQL;
SELECT GENDER, MEAN(SBP) AS AvgSBP, MEAN(HR) as AvgHR
FROM PATIENTS
WHERE DBP > 60
GROUP BY GENDER;
QUIT;
Out[38]:
SAS Output

SAS Output

The SAS System

The SQL Procedure

Query Results

GENDER AvgSBP AvgHR
  102 72
2 108 74
F 149.6364 170.5
M 134.6 65
X 190 68
In [ ]: