SQL stands for “Structured Query Language”. It’s a language created for interacting with relational databases. SAS implements a version of this inside proc sql. This allows us to leverage the power of SQL to solve our problems.
proc sql; create table out.projdata2 as select a,b,c,d, /* vars from source */ a^2 as aa, /* New vars */ b^2 as bb, mean(c) as cbar /* Summary statistics */ from out.projdata; /* Input Dataset */ quit; |
SQL is guided by an ANSI standard, but it is up to the implementers to make their respective syntaxes match ANSI standard or not. The SQL syntax in proc SQL differs somewhat from other implementations of SQL.
- Variables – Variables, or columns in the data, are imported using the select function. This can occur a couple ways. First, if we want to select all the variables in a dataset, then we can use the asterisk (*). Else, we call variables by name. Case doesn’t matter, but the first instance a variable name is used will cause SAS to remember that case for all output of the variable. So if you want the output to look pretty, use the most appropriate case when you first reference the variable.
We can declare new variables by setting up some function of other variables, and giving them a name using as. We can do more than just line-level functions. Using SQL, we can make summary statistics, by-group summary statistics, indices, and more. We’ll go into by-group processing later on. - Joins – Joins are methods of merging two datasets. Typically speaking, we use a key, being a variable or set of variables that appear in both datasets on which we will merge. Joins are an integral feature of SQL, and are further separated by the ways in which they can be done. ANSI Standards specify 5 possible join types: inner, outer left, outer right, full outer, and cross.
- Inner Joins allow us to capture only rows that appear in both datasets, based on the key.
- Outer Left Joins keep all rows that were on the left dataset in the join, and only update those rows for which the key was in the right dataset as well. If a given row was in the left but not in the right, all the added columns will hold null values. Outer right joins work the same way, just keeping the right table intact instead of the left.
- Full Outer Joins keep all rows from both datasets. For a data row in 1 set that doesn’t have a matching row in the other set, the missing data is assigned null values.
- Cross Joins are an interesting beast. They do not use a key, rather they create a Kronecker Cross-Product of the two datasets, multiplying every row in one dataset with every row in the other dataset. For instance, a 10 row set cross-joined with a 20 row set would produce an output set of 200 rows. The functionality is there, but I’ve never had cause to use this type of join.