ERDC has developed several SAS macros to assist us in the cleaning and linking of education data. Since other states and other entities might find them useful, we are making these SAS macros available to the public.
Items to note:
- The software is free.
- ERDC releases all rights to these macros. We place all ERDC macros listed on this page into the public domain. If you do use our macros, we would appreciate it if you would acknowledge the original contributor of the material.
- CreateResearchlds.sas - Creates a research identifier based on, but completely masks, some other confidential identifier such as SSN, Student_Id, or Person_Id. - Oct. 14, 2011
- ProcFreqCubed.sas - Analyzes a SAS dataset by running PROC FREQ against every column in the dataset. This is a very useful macro for characterizing all the variables in a dataset, especially “wide” datasets where it might be onerous to manually run PROC FREQ against each variable. (Version 1.4) - Updated May 20, 2015
- SQUEEZE.sas – Not developed by ERDC. A useful macro for optimizing space required to store variables in a SAS dataset.
- GetLibraryMetadata.sas - Creates a report in the form of a data set that contains user-selected metadata types for each data set in a library. Example metadata types include the record count (RecordCount), the creation date (crdate), the number of columns (nvar) and the column names (ColumnNames). Possible metadata types include all those available in SAS's Dictionary Table. (Version 1.13) - Updated June 23, 2014
- CopyDataSetsToWorkbook.sas – Copies all the data sets in a SAS library to an Excel workbook. Which data sets that are copied can be optionally filtered using a regular expression filter. (Version 0.9) - Sept. 14, 2015
Data Cleaning Macros
- DeleteSuperfluousRecords.sas - Deletes superfluous records from a data set. A superfluous record here means a record that contains information that is a strict subset of another record. A typical use is with name data, for example where one record has a first name, last name and a complete middle name, and other records have the same first and last names, but they only have the middle initial or no initial at all. In this case, the record with the complete middle name has more complete information than the other records, so this macro will delete the other records. – (Version 0.91) Updated Nov. 25, 2013
- ConvertFlags.sas - Converts all flags in all character fields in a dataset from one defined set of two values to another defined set of two values. Can be used, for example, to convert all Y/N flags to 1/0 flags, or vise versa. - Dec. 21, 2011
- DenullifyDataset.sas - Sets to empty all text fields in a dataset are equal to a given value. For example, this macro can be used to set all instances of "NULL" to the empty string. (Version 1.0) - May 8, 2015
- StandardizeNames.sas – Applies a set of business rules to standardize the first, middle and last names in a dataset. For example, the macro converts all names to upper case, strips out generational suffixes such as JR or III from the name fields and places these in a “Suffix” field, etc. (Version 1.3) – Updated Nov. 7, 2014
- StripDownAllCharVariables.sas - Applies the SAS function strip() to all variables in each character field in a dataset. This results in all leading and trailing spaces being removed from all character variables. – (Version 1.01) Updated June 23, 2014
- RemoveUnusedColumns.sas – Analyzed an inputted data set for columns that have no (null) data. It then outputs a second data set in which the columns without data have been removed. By using the optional “AlwaysKeep” parameter, certain columns can always be kept, even when they do not contain data. (Version 0.9) - Mar. 3, 2014
- StackAllDataSets.sas: Stacks (appends) all data sets in a library into a single data set. This macro determines the maximum length of each variable. These lengths are used in each variables definition in the outputted data set. As a result, the macro does not truncate data. If the data sets attempting to be stacked have variables with different type definitions (char/numeric), the macro will instead generate a data set report delineating each variable's type incompatibilities. (Version 0.96) - Dec. 31, 2014
- RenameVariablesInLibref.sas – For all data sets in a libref, renames all variables that have old name/new name mappings in a renaming Excel worksheet. You can use this macro to standardize variable names across multiple data sets. For example, some of your data sets could have a “LastName” variable, and others a “PersonLastName” field. With the concordant mappings in the pertinent renaming worksheet, you could standardize this field to be something else, such as “LAST_NAME”. Optionally, the renamings can be based on variable labels. This macro will also work with DBMS systems such as SQL Server. (Version 0.93) – June 3, 2015
- RenameVariablesInDataSet.sas - Similar to RenameVariablesInLibref.sas, but operates on a single data set. (Version 0.92) – June 3, 2015
General Linking Macros
- DivineDisparateDedupLinks.sas - Compares two methods that were used to deduplicate (consolidate) either the same data set, or two data sets that share an overlapping key. Outputs a data set of instances where the two methods did not identically deduplicate the same keys. (Version 1.0) - Dec. 24, 2013
- ExamineDedupKeyValue.sas - For a single record, displays how two different methods for de-duplication linked a single record to other records. Used in conjunction with DivineDisparateDeDupLinks.sas. - Oct. 6, 2011
- LinkViaNameChange.sas – A set of macros that uses external name change data to match two sets of person ids that contain different names. The external name change data needs to have been arranged into old name/new name pairings. Examples of name change data sources include marriage data, divorce data and court name change data. Though there is some example test data at the bottom of this file, non-test external name change data is not provided. However, any given state will typically have either a Vital Records Office or a Center for Health Statistics from which record level marriage and divorce data could be obtained. (Version 0.91) - Updated Feb. 20, 2013
- CreateCrosslinkID.sas – This program resolves many-to-many relationships between two record linkage identifiers into a single identifier, the “CrosslinkID.” These many-to-many relationships can occur under certain conditions even when using software that performs one-to-many record linkages.
Say you have two files that you have linked together through an identity matching/record linkage process. This process creates a data set containing two columns, where IDs from the first data set are linked to IDs in the second data set. If there is a one-to-one correspondence between the two sets of IDs, then you could use the file as is. Similarly, if there was a one-to-many or many-to-one correspondence, you could again use the data set as is. But if there is a many-many correspondence between the two IDs fields, then you will need to extract a "CrosslinkID" using this macro that resolves the many-to-many relationships. The result is a data set containing three columns:
- The ID from the first data set
- The ID from the second data set
Do not use CreateCrosslinkID.sas against a single data source! If you want to create a unique identifier, such as a person ID, for a single data source, use %CreateConsolidationID() instead. %CreateConsolidationID() is contained in CrosslinkID_ConsolidationID_Suite.sas below. (Version 1.0) – July 2, 2015
- CrosslinkID_ConsolidationID_Suite.sas – This source file contains four macros that can be used to deduplicate and consolidated record linkages:
- %CreateCrosslinkID() – See description for CreateCrosslinkID.sas above.
- %CombineCrosslinkIDs() – Combines the linking variables between two sets of overlapping or non-overlapping data sets, and creates a single CrosslinkID
- %CreateConsolidationID() – Whereas %CreateCrosslinkID()will resolve many-to-many relationships between two data sets, %CreateConsolidationID() resolves many-to-many linking variable relationships in a single partly deduplicated data set.
Say you have partly deduplicated a data set through an identity matching/record linkage process. The result is a series of one-to-many relationships between the deduplication ID and the ID (for example the record number) of the data set that was deduplicated. But if these one-to-many relationships overlap, this would mean more than one deduplication ID is linked to a single ID (record number). This then would result in a many-to-many relationship, in which case you will need to extract out the "ConsolidationID" using this macro. The result is a data set containing two columns:
- ID ((often the record number of the data set that was deduplicated).
An example of when you would use %CreateConsolidationID() if you had used Link Plus to deduplicate a file, and you needed to merge two or more of the resulting manual review files.
- %CombineConsolidationID() – Combines the linking variables between two overlapping or non-overlapping data sets, and creates a single ConsolidationID.
In addition to the four macros listed above, the program listing also contains a help macro, %_CreateCrosslinkID_RecurseFunc(). To gain further insight into what each macro does, please see sample data and macro calls below each macro listing in the source file. (Version 0.9) – May 28, 2014
LINK PLUS RELATED MACROS*
- LinkPlusToolkit.sas – Combines all the Link Plus program files listed below into a single program file. Running this program will compile every Link Plus macro into SAS. – Oct. 9, 2012
*Link Plus is a free probabilistic record linkage program developed at CDC's Division of Cancer Prevention and Control. It is a general purpose program that can be used for deduplicating and linking any data set. These SAS macros serve to enhance the capabilities of Link Plus.