/*------------------------------------------------------------------------------ File: StackAllDataSets.sas Author: John Sabel john.sabel@ofm.wa.gov Washington State Education Research & Data Center Creation date: 2014-07-07 Version: 0.97 Purpose Stacks (appends) all data sets in a library into a single data set. This macro determines the maximum length for each variable. These lengths are used in each variable's definition in the outputted data set. As a result, the macro does not truncate data. If the data sets attempting to be stacked have variables with different type definitions (char/numeric), the macro will instead generate a data set report delineating each variable's type incompatibilities. The macro includes a set of filters to limit which data sets and/or which variables are stacked. Parameters LibnameIn: Name of library containing data sets to be merged. DsnOut: The name of the outputted data set. RegExDsnFilter_1: (Optional) A Perl Regular Expression used to filter the data sets to merge in the "LibnameIn" library. The default is /./, which acts as no filter at all, so by default all data sets in a library will be merged. When you do use this regular expression filter consider using the "I" option in order to make the filter case insensitive. SourceColumn: (Optional) When set to Y, creates the "SourceDataset" variable in DsnOut. This variable contains the original source data set name for each record. Default is Y. OnlyTenRecords: (Optional) When set to Y (actually to not N), will only stack ten records from each data set. Allows you to get a sense of what the macro will do before you try to stack numerous large complete data sets together. Default is N. RegExColFilter_1: (Optional) A Perl regular expression used to filter which variables (columns) get stacked. The default, /./, is no filter at all (unfiltered). Debug: (Optional) When set to Y, intermediate data sets are not deleted. Also, this macro outputs to the Results Viewer the variable names, types lengths as well as data set names. Note If one or more variable name has different types (i.e. num and char) in different data sets that are trying to be stacked, this macro will instead create the data set "SchzoidVariables" which will provide information concerning the type incompatilities. If this happens, you will need to resolve the type incompatibilites before trying to run this macro again. Signature %StackAllDataSets(LibnameIn=, DsnOut=, RegExDsnFilter_1=/./I, RegExColFilter_1=/./I, SourceColumn=Y, OnlyTenRecords=N, Debug=N) Revisions: Date Version Author Comments 2014-08-20 0.91 John Added "Source" to DsnOut and added input parameter "SourceColumn". 2014-08-21 0.92 John Removed DISTINCT keyword from final DsnOut SQL statement. Also, deleting temp table error. Also, improved the reordering of DsnOut variables. 2014-08-26 0.93 John Redid sort of SchiziodVariables to improve table presentation. Also, removed UPCASE statements from SQL where clauses involving LIBNAMEs and MEMNAMEs. 2014-09-16 0.94 John Fixed lengths of DsnOut variables bug. Added parameter "OnlyTenRecords". Added parameter check for LibnameIn. 2014-11-07 0.95 John For all "MEMNAME NOT IN" clauses, added adddtional temporary tables so that macro can disregard any temp tables that were generated from previous runs where Debug parameter was set to Y. Renamed DsnOut's "Source" variable to "SourceDataset" because sometimes source data might already already have a pre-existing "Source" column. 2014-12-31 0.96 John Added code to check if given the input parameters there were any columns that could be stacked. If not, the macro terminates. Also, added code to add labels, formats and informats to DsnOut for the variables in the data sets being stacked. 2017-09-19 0.97 John Previously, though the maximum length for each set of variable names was used in DsnOut, the other metadata (formats, etc) was based on the first data set that had each variable. As a result, though data was never actually truncated, too short formats would be assigned to variables, and so the display of these variabls would be truncated. Now it has been corrected. Changed name of data set when there are type incompatibilities from SchizoidVariables to ConfusedVariables. ------------------------------------------------------------------------------*/ %macro StackAllDataSets(LibnameIn=, DsnOut=, RegExDsnFilter_1=/./I, RegExColFilter_1=/./I, SourceColumn=Y, OnlyTenRecords=N, Debug=N); %put --- Start of %upcase(&sysmacroname) macro; %local NumVars; %local NumDataSets; %local err or; %local LibnameExists; %let err=ERR; %let or=OR; %if &Debug.~=N %then %do; %put _local_; %end; /* Helping to make repeated macro calls stateless...*/ %if %sysfunc(exist(&DsnOut.)) %then %do; proc sql; DROP TABLE &DsnOut.; quit; %end; proc sql noprint; SELECT DISTINCT LIBNAME INTO :Foo FROM DICTIONARY.TABLES WHERE LIBNAME = "%upcase(&LibnameIn.)" ; %let LibnameExists=&sqlobs.; quit; %if &LibnameExists.=0 %then %do; %put &ERR.&OR.: LibnameIn &LibnameIn. does not exist. Macro will now terminate.; %goto exit; %end; proc sql; CREATE TABLE ConfusedVariables AS SELECT DISTINCT COLUMNS.Name, COLUMNS.MEMNAME, COLUMNS.TYPE, COLUMNS.LENGTH, COLUMNS.VARNUM FROM DICTIONARY.COLUMNS INNER JOIN ( SELECT Name, COUNT(Type) AS TypeCount FROM ( SELECT DISTINCT Name, Type FROM DICTIONARY.COLUMNS WHERE LIBNAME = "%upcase(&LibnameIn.)" AND MEMNAME NOT IN ( '__STACKALL_VARS_01', '__STACKALL_VARS_01A', '__STACKALL_VARS_01B', '__STACKALL_VARS_02', '__STACKALL_DSNOUT', '__STACKALL_TEMP', 'CONFUSEDVARIABLES') AND prxmatch("&RegExDsnFilter_1", MEMNAME) > 0 AND prxmatch("&RegExColFilter_1", NAME) > 0 ) TypeCheck GROUP BY Name HAVING TypeCount > 1 ) MultiTypes ON COLUMNS.Name = MultiTypes.Name WHERE COLUMNS.LIBNAME = "%upcase(&LibnameIn.)" AND MEMNAME NOT IN ( '__STACKALL_VARS_01', '__STACKALL_VARS_01A', '__STACKALL_VARS_01B', '__STACKALL_VARS_02', '__STACKALL_DSNOUT', '__STACKALL_TEMP', 'CONFUSEDVARIABLES') AND prxmatch("&RegExDsnFilter_1", COLUMNS.MEMNAME) > 0 AND prxmatch("&RegExColFilter_1", COLUMNS.NAME) > 0 ORDER BY Name,MEMNAME ; quit; %if &sqlobs.=0 %then %do; proc sql; DROP TABLE ConfusedVariables; quit; %end; %if &sqlobs.>0 %then %do; %put &err&or: (&sysmacroname.) One or more variables are defined as both numeric and character in data sets that macro was attempting to stack.; %put &err&or: (&sysmacroname.) See ConfusedVariables data set for more information.; %goto exit; %end; /* Fetch a list of all variables and other information for all data sets that are to be stacked. */ proc sql; CREATE TABLE __STACKALL_VARS_01 AS SELECT DISTINCT Memname, Name, Length, VarNum, Type, label, format, informat FROM DICTIONARY.COLUMNS WHERE LIBNAME = "%upcase(&LibnameIn.)" AND MEMNAME NOT IN ('__STACKALL_VARS_01', '__STACKALL_VARS_01A', '__STACKALL_VARS_01B', '__STACKALL_VARS_02', '__STACKALL_DSNOUT', '__STACKALL_TEMP', 'CONFUSEDVARIABLES') AND prxmatch("&RegExDsnFilter_1", MEMNAME) > 0 AND prxmatch("&RegExColFilter_1", NAME) > 0 ORDER BY Name, memname; ; quit; %if &sqlobs=0 %then %do; %put WARN%str(ING): (&sysmacroname.) There are no columns to stack. Macro will now terminate.; %goto exit; %end; /* */ data __STACKALL_VARS_01a; set __STACKALL_VARS_01; by Name memname; if first.Name then output; run; proc sort data=__STACKALL_VARS_01a; by memname varnum; run; data __STACKALL_VARS_01aa; length VarOrder 4; set __STACKALL_VARS_01a; VarOrder = _n_; run; /* Go to __STACKALL_VARS_01 to determine the maximum length of each NAME (variable). This will be used to help define the schema of DsnOut so that no data will be truncated and also to use the formats, labels, etc. that associated with the variables with the longest length. */ proc sql; CREATE TABLE __STACKALL_VARS_01b AS SELECT Name, MAX(Length) AS Length FROM __STACKALL_VARS_01 GROUP BY Name ; quit; /* Ordering by memname and then name ensures that the order of the variables in DsnOut will follow the order in the first in alphabetic order data set brought in, with additonal variables appearing in the order of subsequent data sets in alphabetic order. */ proc sql; CREATE TABLE __STACKALL_VARS_01c AS SELECT DISTINCT VarOrder ,ColumnMetadata.* FROM __STACKALL_VARS_01 ColumnMetadata INNER JOIN __STACKALL_VARS_01b MaxLength ON ColumnMetadata.Name = MaxLength.Name AND ColumnMetadata.Length = MaxLength.Length INNER JOIN __STACKALL_VARS_01aa VarOrder ON ColumnMetadata.Name = VarOrder.Name ORDER BY VarOrder ; quit; /* More than one data set could share variables with the same names and lengths. We need to winnow it and eliminate the duplicate variable (names), so we just arbitrarily choose the first variable of each possible set of variables with the same name (and length). */ data __STACKALL_VARS_02; set __STACKALL_VARS_01c; by VarOrder; if first.VarOrder then output; run; /* For when Debug=Y, and you want to examine this table. */ /* proc sort data=__STACKALL_VARS_02;*/ /* by Memname VarNum;*/ /* run;*/ proc sql %if &Debug.=N %then %do; noprint; %end; %else %do; ; title "Variable Names, Types and Maximum Lengths"; title2 "Library: %upcase(&LibnameIn.), RegEx Table Filter: &RegExDsnFilter_1., , RegEx Column Filter: &RegExColFilter_1."; %end; SELECT VarOrder, Name, Type, VarNum, Length AS MaximumVariableLength, label, format, informat INTO :VarOrder1-:VarOrder999 ,:Names1-:Names999 ,:Types1-:Types999 ,:VarNums1-:VarNums999 ,:Lengths1-:Lengths999 ,:Labels1-:Labels999 ,:Formats1-:Formats999 ,:Informats1-:Informats999 FROM __STACKALL_VARS_02 ORDER BY VarOrder ; %Let NumVars=&sqlobs.; quit; /* Create the lis of data sets to be stacked, and the number of data sets to be stacked. */ proc sql %if &Debug.=N %then %do; noprint; %end; %else %do; ; title "Data Set Names"; title2 "Library: %upcase(&LibnameIn.), RegEx Table Filter: &RegExDsnFilter_1., , RegEx Column Filter: &RegExColFilter_1."; %end; SELECT DISTINCT MEMNAME INTO :MemNames1-:MemNames999 FROM Dictionary.COLUMNS WHERE LIBNAME = "%upcase(&LibnameIn.)" AND MEMTYPE = 'DATA' AND MEMNAME NOT IN ('__STACKALL_VARS_01', '__STACKALL_VARS_01A', '__STACKALL_VARS_01B', '__STACKALL_VARS_02', '__STACKALL_DSNOUT', '__STACKALL_TEMP', 'CONFUSEDVARIABLES') AND prxmatch("&RegExDsnFilter_1", MEMNAME) > 0 AND prxmatch("&RegExColFilter_1", NAME) > 0 ; %let NumDataSets=&sqlobs.; quit; proc sql noprint; CREATE TABLE __STACKALL_DSNOUT ( SourceDataset char(32), %do I = 1 %to &NumVars.; &&Names&I %if &&Types&I=num %then NUM(&&Lengths&I); %else CHAR(&&Lengths&I); LABEL = "&&Labels&I" %if &&Formats&I ~= %then FORMAT=&&Formats&I; %if &&Informats&I ~= %then INFORMAT=&&Informats&I; %if &I.~=&NumVars. %then ,; %end; ) ; quit; %do I = 1 %to &NumDataSets.; %if &OnlyTenRecords.=N %then %do; proc append base=__STACKALL_DSNOUT data=&LibnameIn..&&MemNames&I force; run; %end; %else %do; data __STACKALL_TEMP; set &LibnameIn..&&MemNames&I (obs=10); run; proc append base=__STACKALL_DSNOUT data=__STACKALL_TEMP force; run; %end; proc sql; UPDATE __STACKALL_DSNOUT SET SourceDataset="&&MemNames&I.." WHERE SourceDataset IS NULL ; quit; %end; proc sql; CREATE TABLE &DsnOut. AS SELECT * FROM __STACKALL_DSNOUT %if &SourceColumn.~=Y %then (drop=SourceDataset);; quit; %if &Debug.=N %then %do; proc sql; DROP TABLE __STACKALL_VARS_01 ,__STACKALL_VARS_01A ,__STACKALL_VARS_01AA ,__STACKALL_VARS_01B ,__stackall_vars_01c ,__STACKALL_VARS_02 ,__STACKALL_DSNOUT; quit; %if &OnlyTenRecords.=Y %then %do; proc sql; DROP TABLE __STACKALL_TEMP; quit; %end; %end; %exit: %if &Debug.~=N %then %do; %put _user_; %end; title ""; title2 ""; %put --- End of %upcase(&sysmacroname) macro; %mend StackAllDataSets; /* data NameTestData1; input @1 FirstName $10. @11 LastName $10. @21 BirthDate mmddyy10.; format FirstName $20. BirthDate yymmddd10.; informat BirthDate mmddyy10.; cards; Jason Long 10/21/1950 Audrey Flytrap 12/25/2014 ; run; data NameTestData2; input FirstName $1-10 LastName $12-21 ShoeSize 24-25; cards; Johnjacob Longfellow 11 Dominique Luvidillawa 7 ; run; data PerhapsUnrelatedData; input RPM MPH; cards; 1298 62 4890 145 ; run; * This will merge every data set in the work directory. Since PerhapsUnrelatedData has different variables compared to NameTestData1 and NameTestData2, the ordering of the variables will be odd.; %StackAllDataSets(LibnameIn=work, DsnOut=TestData_Out1, RegExDsnFilter_1=/^(Name|Perhaps)/I, SourceColumn=Y, Debug=N) data BadName; input FirstName; cards; 1 2 3 ; run; * Data set "NameBad defines FirstName as numeric. So rerunning %StackAllDataSets() will cause this a collision of type definitions, resulting in the creation of the ConfusedVariables data set report.; %StackAllDataSets(LibnameIn=work, DsnOut=TestData_Out1, SourceColumn=Y, Debug=N) * This will merge only those data sets whose names begin with "Name". Note that the FirstName and LastName variables in NameTestData1 only have a length of 8, while the same variables in NameTestData2 have a length of 10. The resulting outputted data set uses the larger of the two lengths, 10.; %StackAllDataSets(LibnameIn=work, DsnOut=TestData_Out2, RegExDsnFilter_1=/^Name/I, Debug=N) */