/*------------------------------------------------------------------------------ File: CrosslinkID_ConsolidationID_Suite.sas Author: John Sabel john.sabel@ofm.wa.gov Washington State Education Research & Data Center Creation date: 5/28/14 Version: 0.91 Purpose Contains four macros that can be used to consolidate, deduplicate, identity matching variables in data sets. The four macros are as follows: 1) %CreateCrosslinkID() – Resolves many-to-many relationships for two data sets that have been linked together. Say you have two files that you have linked together through an identity matching/record linkage process. This process creates a data set containing two columns, where IDs from the first data set are linked to IDs in the second data set. If there is a one-to-one correspondence between the two sets of IDs, then you could use the file as is. Similarly, if there was a one-to-many or many-to-one correspondence, you could again use the data set as is. But if there is a many-many correspondence between the two IDs fields, then you will need to extract a "CrosslinkID" using this macro that resolves the many-to-many relationships. The result is a data set containing three columns: (1) CrosslinkID, (2) the ID from the first data set, (3) the ID from the second data set. 2) %CombineCrosslinkIDs() – Combines the linking variables between two sets of overlapping or non-overlapping data sets, and creates a single CrosslinkID. 3) %CreateConsolidationID() – Whereas %CreateCrosslinkID()will resolve many-to-many relationships between two linked data sets, CreateConsolidationID.sas resolves many-to-many linking variable relationships in a single partly deduplicated data set. Say you have partly deduplicated a data set through an identity matching/record linkage process. The result is a series of one-to- many relationships between the deduplication ID and the ID (the record number for example) of the data set that was deduplicated. But if these one-to-many relationships overlap, this would mean more than one deduplication ID is linked to a single ID (record number). This then would result in a many-to-many relationship, in which case you will need to extract out the "ConsolidationID using this macro. The result is a data set containing two columns: (1) ConsolidationID, (2) ID (often the record number of the data set that was deduplicated). 4) %CombineConsolidationID() – Combines the linking variables between two overlapping or non-overlapping data sets, and creates a single ConsolidationID. In addition to the four macros listed above, the program listing also contains a help macro, %_CreateCrosslinkID_RecurseFunc(). To gain further insight into what each macro does, please see sample data and macro calls below each macro listing in the source file. Revisions: Date Version Author Comments 2014-9-9 0.91 John Updated CreateConsolidationID() to version 0.91. ------------------------------------------------------------------------------*/ /*------------------------------------------------------------------------------ File: CreateCrosslinkID.sas Author: John Sabel john.sabel@ofm.wa.gov Washington State Education Research & Data Center Creation date: 2/7/14 Version: 0.91 Purpose Resolves many-to-many relationships between two identifiers into a single identifier, the "CrossLinkID". When linking data sets together, many identity matching packages utilize the use of one-to-many matches. For example an ID=1 might match to ID=A and ID=B, and ID=2 might match to ID=C and ID=D. But it could happen where along with ID=1, ID=2 also matchesID=B, converting these two one-to-many matches into a single many-to-many match. Because of the crosslinking between these IDs, they must represent the same "thing", therefore the same "CrossLinkID" should assigned to each of these IDs. That is what this program does. For further explanation, see example code at bottom of source code listing. In the outputted data set, this program also produces some summary statistics showing how much each ID has been deduplicated in the process of creating the crosslink id. This file contains two macros: 1) %CreateCrosslinkID() - The main macro. It is called by the user. 2) %_CreateCrosslinkID_RecurseFunc() - A helper macro, initially called by %CreateCrosslinkID(), and then called recursively by itself. This macro should not be called by the user. Parameters DsnIn: The name of the data set containing two ID columns. CrosslinkID: The name of the outputted crosslink ID. It is a numeric field. ID_A: The name of the first ID. It can either be a character or a numeric field. ID_B: The name of the second ID. It can either be a character or a numeric field. DsnOut: The name of the outputted data set. Along with the Crosslink_ID ID_A, and ID_B parameter fields, it also contains two fields. 1) &ID_A._Count - A count of the number ID_A's per CrosslinkID. This shows to what extent &ID_A. has been deduplicated. 2) &ID_B._Count - A count of the number ID_B's per CrosslinkID. This shows to what extent &ID_B. has been deduplicated. Debug (Optional) When Debug=Y, the intermediate data sets _TempTable, _ID_A_Counts, and _ID_B_Counts are not deleted. Signature %CreateCrosslinkID(DsnIn=, DsnOut=, CrosslinkID=, ID_A=, ID_B=, Debug=N) Revisions: Date Version Author Comments 2014-04-15 0.91 John To %CreateCrosslinkID() added a proc sort with NODUPLICATES clause. To %_CreateCrosslinkID_RecurseFunc() added to existing proc sort statement a NODUPLICATES clause. ------------------------------------------------------------------------------*/ %macro CreateCrosslinkID(DsnIn=, CrosslinkID=, ID_A=, ID_B=, DsnOut=, Debug=N); %put --- Start of %upcase(&sysmacroname) macro; proc sql; CREATE TABLE _TempTable AS SELECT DISTINCT . AS &CrosslinkID., &ID_A., &ID_B. FROM &DsnIn. ORDER BY &ID_A., &ID_B. ; quit; data _TempTable (drop=_CrosslinkID_Temp); set _TempTable; by &ID_A.; retain _CrosslinkID_Temp 0; if first.&ID_A. then _CrosslinkID_Temp = _CrosslinkID_Temp + 1; &CrosslinkID. = _CrosslinkID_Temp; run; %_CreateCrosslinkID_RecurseFunc(Dsn=_TempTable, ID_A=&ID_B., ID_B=&ID_A., CrosslinkID=&CrosslinkID., Debug=&Debug.) proc sort In=_TempTable out=_TempTable_Deduped noduplicates; by &ID_A. &CrosslinkID.; run; proc sql; CREATE TABLE _ID_A_Counts AS SELECT &CrosslinkID., COUNT(&CrosslinkID.) AS &ID_A._Count FROM ( SELECT DISTINCT &CrosslinkID., &ID_A. FROM _TempTable_Deduped ) GROUP BY &CrosslinkID. ; quit; proc sql; CREATE TABLE _ID_B_Counts AS SELECT &CrosslinkID., COUNT(&CrosslinkID.) AS &ID_B._Count FROM ( SELECT DISTINCT &CrosslinkID., &ID_B. FROM _TempTable_Deduped ) GROUP BY &CrosslinkID. ; quit; proc sql; CREATE TABLE &DsnOut. AS SELECT _TempTable_Deduped.*, &ID_A._Count, &ID_B._Count FROM _TempTable_Deduped INNER JOIN _ID_A_Counts ON _TempTable_Deduped.&CrosslinkID. = _ID_A_Counts.&CrosslinkID. INNER JOIN _ID_B_Counts ON _TempTable_Deduped.&CrosslinkID. = _ID_B_Counts.&CrosslinkID. ORDER BY &CrosslinkID., &ID_A., &ID_B. ; quit; proc sql; DROP TABLE _TempTable, _TempTable_Deduped, _ID_A_Counts, _ID_B_Counts; quit; %put --- End of %upcase(&sysmacroname) macro; %mend CreateCrosslinkID; /*----------------------------------------------------------------------------- macro: %_CreateCrosslinkID_RecurseFunc() - A helper macro called by %CreateCrosslinkID(). ------------------------------------------------------------------------------*/ %macro _CreateCrosslinkID_RecurseFunc(Dsn=, CrosslinkID=, ID_A=, ID_B=, Debug=N); %put --- Start of %upcase(&sysmacroname) macro; /* Initially assume this call will result in no changes. */ %let RecurseAgain=No; proc sort data = &Dsn. noduplicates; by &ID_A. &CrosslinkID.; run; data &Dsn. (drop= _GroupCrosslinkID %if &Debug.=N %then %do;_&CrosslinkID._Old %end;); set &Dsn.; by &ID_A. &CrosslinkID.; retain _GroupCrosslinkID; if first.&ID_A. then _GroupCrosslinkID = &CrosslinkID.; _&CrosslinkID._Old = &CrosslinkID.; if &CrosslinkID. ~= _GroupCrosslinkID then do; &CrosslinkID. = _GroupCrosslinkID; call symputx('RecurseAgain', 'Yes', 'F'); end; run; /* If a change has occurred then make recursive call to this function. Flop ID_A and ID_B */ %if &RecurseAgain.=Yes %then %_CreateCrosslinkID_RecurseFunc(Dsn=&Dsn., ID_A=&ID_B., ID_B=&ID_A., CrosslinkID=&CrosslinkID. ,Debug=&Debug.); %put --- End of %upcase(&sysmacroname) macro; %mend _CreateCrosslinkID_RecurseFunc; /* * Sample data and macro calls for %CreateCrosslinkID(). ; * Contains a single many-to-many relationship between every ID in data set; data SampleData_1; input A B $; datalines; 1 A 2 B 2 C 3 D 4 D ; run; %CreateCrosslinkID(DsnIn=SampleData_1, CrosslinkID=Crosslink_ID, ID_A=A, ID_B=B, DsnOut=SampleData_1_Out, Debug=N) * Represents one many-to-many relationship, one one-to-many relationship, and one many-to-one relationship; data SampleData_2; input A B $; datalines; 1 A 1 B 2 B 2 C 3 D 3 E 3 F 4 G 5 G ; run; %CreateCrosslinkID(DsnIn=SampleData_2, DsnOut=SampleData_2_Out, CrosslinkID=Crosslink_ID, ID_A=A, ID_B=B, Debug=N) */ /*------------------------------------------------------------------------------ File: CombineCrosslinkIDs.sas Author: John Sabel john.sabel@ofm.wa.gov Washington State Education Research & Data Center Creation date: 4/28/14 Version: 0.9 Purpose Combines two data sets that contain pairs of linking identifiers and then produces a singleCrosslinkID that spans both data sets. The data sets can either have non-overlapping or overlapping pairs of identifiers. The macro is called "CombineCrosslinkIDs.sas" because ostensibly you could have first created a CrosslinkID for each of the two data sets, and now you are seeking to combine these CrosslinkIDs. But the way the macro actual works is is simply combines the linking identifiers of both data sets, and creates afresh a new CrosslinkID. It does not make use of any previous CrosslinkIDs that may have been produced. This file contains two macros: 1) %CreateCrosslinkID() - The main macro. It is called by the user. 2) %_CreateCrosslinkID_RecurseFunc() - A helper macro, initially called by %CreateCrosslinkID(), and then called recursively by itself. This macro should not be called by the user. Parameters DsnIn_1: The name of the data set containing the first pairs of IDs. DsnIn_2: The name of the data set containing the second pairs of IDs. CrosslinkID: The name of the outputted crosslink ID. It is a numeric field. ID_A: The name of the first ID. It can either be a character or a numeric field. ID_B: The name of the second ID. It can either be a character or a numeric field. DsnOut: The name of the outputted data set. Along with the Crosslink_ID ID_A, and ID_B parameter fields, it also contains two fields. 1) &ID_A._Count - A count of the number ID_A's per CrosslinkID. This shows to what extent &ID_A. has been deduplicated. 2) &ID_B._Count - A count of the number ID_B's per CrosslinkID. This shows to what extent &ID_B. has been deduplicated. Debug (Optional) When Debug=Y, the intermediate data sets _TempTable, _ID_A_Counts, and _ID_B_Counts are not deleted. Signature %CombineCrosslinkIDs(DsnIn_1=, DsnIn_2=, DsnOut=, CrosslinkID=, ID_A=, ID_B=, Debug=N) ------------------------------------------------------------------------------*/ %macro CombineCrosslinkIDs(DsnIn_1=, DsnIn_2=, DsnOut=, CrosslinkID=, ID_A=, ID_B=, Debug=N); %put --- Start of %upcase(&sysmacroname) macro; proc sql; CREATE TABLE __TempCombinedCrosslink AS SELECT DISTINCT &ID_A., &ID_B. FROM &DsnIn_1. UNION SELECT DISTINCT &ID_A., &ID_B. FROM &DsnIn_2. ; quit; %CreateCrosslinkID(DsnIn=__TempCombinedCrosslink, CrosslinkID=&CrosslinkID., ID_A=&ID_A., ID_B=&ID_B., DsnOut=&DsnOut., Debug=Y) %if &Debug.=N %then %do; proc sql; DROP TABLE __TempCombinedCrosslink; quit; %end; %put --- End of %upcase(&sysmacroname) macro; %mend CombineCrosslinkIDs; /* * Each of these test data sets contain the variable "Crosslink_ID" so you can; * more easily see how the pairs of IDs are originally related.; data TestData_CombineCrosslinkIDs_1; input Crosslink_ID A B $; datalines; 1 1 A 2 2 B 2 2 C 3 3 D 4 3 D ; run; * The first record combines records containing B,C and D from the data set above,; * TestData_CombineCrosslinkIDs_1, into a what will be a single set. The call to; * %CombineCrosslinkIDs() below will create this single, linked set.; data TestData_CombineCrosslinkIDs_2; input Crosslink_ID A B $; datalines; 100 3 C 101 5 E ; run; %CombineCrosslinkIDs(DsnIn_1=TestData_CombineCrosslinkIDs_1, DsnIn_2=TestData_CombineCrosslinkIDs_1, DsnOut=TestData_CombineCrosslinkIDs_Ot, CrosslinkID=NewCrosslinkID, ID_A=A, ID_B=B, Debug=N) */ /*------------------------------------------------------------------------------ File: CreateConsolidationID.sas Author: John Sabel john.sabel@ofm.wa.gov Washington State Education Research & Data Center Creation date: 5/28/14 Version: 0.91 Purpose Resolves many-to-many relationships in a single data set into a single consolidated (deduplicated) identifier, the "ConsolidationID". Deduplicating a data set typically involves linking its record numbers together. Each record linkage implies that the two records involved are members of the same entity. Where there are several linkages, a network of linkages is created. This macro resolves this network of linkages, and associates each record number a ConsolidationID. So from this potentially many-to-many relationship, a one-to-many relationship is created. One "one" side is the ConsolidationID. On the "many" side is the recordnum (the ID_Out parameter). This macro uses the macro "CreateCrosslinkID.sas". Parameters DsnIn: The name of the data set containing two ID columns. ConsolidationID: The name of the outputted consolidation ID. It is a numeric field. ID_Out The name of the outputted ID associated with the ConsolidationID. It is composed of the values in ID_A and ID_B. ID_A: The name of the first ID. It can either be a character or a numeric field, but both ID_A and ID_B must be of the same type. ID_B: The name of the second ID. It can either be a character or a numeric field, but both ID_A and ID_B must be of the same type. DsnOut: The name of the outputted data set. Debug (Optional) When Debug=Y, the intermediate data sets _TempTable. Signature %CreateConsolidationID(DsnIn=, DsnOut=, ConsolidationID=, ID_Out=, ID_A=, ID_B=, Debug=N) Revisions: Date Version Author Comments 2014-9-9 0.91 John Previous version pretty much simply called %CreateCrosslinkID(), but this caused for example, the following two records, (ID_A=2, ID_B=1) and (ID_A=1, ID_B=2) to get two ConsolidationIDs. Fixed this sort of problem. ------------------------------------------------------------------------------*/ %macro CreateConsolidationID(DsnIn=, ConsolidationID=, ID_Out=, ID_A=, ID_B=, DsnOut=, Debug=N); %put --- Start of %upcase(&sysmacroname) macro; %local HasCommutativeIDs; %local NumOrChar; %CreateCrosslinkID(DsnIn=&DsnIn., CrosslinkID=&ConsolidationID., ID_A=&ID_A., ID_B=&ID_B., DsnOut=__CreateConsolidationID_Temp_1, Debug=&Debug.) /* Get type ("char" or "num" of a representative ID column. */ proc sql noprint; SELECT UPCASE(TYPE), LENGTH INTO :NumOrChar, :length FROM DICTIONARY.COLUMNS WHERE LIBNAME = "WORK" AND MEMNAME = "%upcase(__CreateConsolidationID_Temp_1)" AND UPCASE(NAME) = "%upcase(&ID_A.)" ; quit; %do %until(&HasCommutativeIDs.=0); proc sql; CREATE TABLE __CreateCon_CommutativeID_Count AS SELECT CommutativeID, COUNT(CommutativeID) AS CommutativeID_Count FROM ( SELECT DISTINCT &ConsolidationID., &ID_A. AS CommutativeID FROM __CreateConsolidationID_Temp_1 UNION SELECT DISTINCT &ConsolidationID., &ID_B. FROM __CreateConsolidationID_Temp_1 ) GROUP BY CommutativeID HAVING CommutativeID_Count > 1 ; %let HasCommutativeIDs=&sqlobs.; quit; /* I don't know if I only need to make one pass, or if under some conditions, more than one pass could be needed. If and until I figure it out, I'll be safe and code for more than one pass. */ %if &HasCommutativeIDs.> 0 %then %do; proc sql; CREATE TABLE __CreateConsolidationID_Temp_2 AS SELECT Confounded.Master, Temp_1.* FROM __CreateConsolidationID_Temp_1 Temp_1 LEFT JOIN ( SELECT __CreateCon_CommutativeID_Count.CommutativeID as Master, InnerTemp_1.&ConsolidationID. FROM __CreateConsolidationID_Temp_1 InnerTemp_1 INNER JOIN __CreateCon_CommutativeID_Count ON InnerTemp_1.&ID_A. = __CreateCon_CommutativeID_Count.CommutativeID OR InnerTemp_1.&ID_B. = __CreateCon_CommutativeID_Count.CommutativeID ) Confounded ON Temp_1.&ConsolidationID. = Confounded.&ConsolidationID. ORDER BY Master, &ConsolidationID. ; quit; data __CreateConsolidationID_Temp_1; length _ConsolidationID_Temp 8.; set __CreateConsolidationID_Temp_2; by Master; retain _ConsolidationID_Temp; %if &NumOrChar.=NUM %then %do; if Master = . then output; %end; %else %do; if Master = ' ' then output; %end; else do; if first.Master then _ConsolidationID_Temp = &ConsolidationID.; &ConsolidationID. = _ConsolidationID_Temp; output; end; run; %end; %end; proc sql; CREATE TABLE &DsnOut. AS SELECT DISTINCT &ConsolidationID., &ID_A. AS &ID_Out. label=' ' FROM __CreateConsolidationID_Temp_1 UNION SELECT DISTINCT &ConsolidationID., &ID_B. AS &ID_Out. FROM __CreateConsolidationID_Temp_1 ORDER BY &ConsolidationID., &ID_Out. ; quit; %if &Debug.=N %then %do; proc sql; DROP TABLE __CreateCon_CommutativeID_Count, __CreateConsolidationID_Temp_1, __CreateConsolidationID_Temp_2; quit; %end; %put --- End of %upcase(&sysmacroname) macro; %mend CreateConsolidationID; /* data SampleDedupedData; input RecordNumA RecordNumB; datalines; 1 1 1 2 3 3 4 3 5 5 5 6 6 6 7 5 10 10 11 10 11 12 12 10 12 5 101 100 100 101 999 999 ; run; %CreateConsolidationID(DsnIn=SampleDedupedData, DsnOut=SampleDedupedData_Out, ConsolidationID=ConsolidationID, ID_Out=RecordNum, ID_A=RecordNumA, ID_B=RecordNumB, Debug=N) * End of sample data and macro calls for %CreateConsolidationID(). ; */ /*------------------------------------------------------------------------------ File: CombineConsolidationIDs.sas Author: John Sabel john.sabel@ofm.wa.gov Washington State Education Research & Data Center Creation date: 5/28/14 Version: 0.9 Purpose Combines two data sets that have been deduplicated. The data sets can be overlapping or not overlapping, but their deduplication ID must have the same name. This name is passed into the macro as the "ConsolidationID" parameter. Parameters DsnIn_1: The name of the data set containing the first pairs of IDs. DsnIn_2: The name of the data set containing the second pairs of IDs. ConsolidationID: The name of the deduplication ID, both in the inputted data sets and in the outputted data set. It is a numeric field field. ID: The name of the record-level, single identity-level ID DsnOut: The name of the outputted data set. Debug (Optional) When Debug=Y, the intermediate data sets are not deleted, Signature %CombineCrosslinkIDs(DsnIn_1=, DsnIn_2=, DsnOut=, ConsolidationID=, ID=, Debug=N) ------------------------------------------------------------------------------*/ %macro CombineConsolidationIDs(DsnIn_1=, DsnIn_2=, DsnOut=, ConsolidationID=, ID=, Debug=N); proc sql; CREATE TABLE __Schema_DsnIn_1 LIKE &DsnIn_1.; quit; proc sql noprint; SELECT UPCASE(TYPE), LENGTH INTO :ID_Type, :ID_Length FROM DICTIONARY.COLUMNS WHERE UPCASE(LIBNAME) = "WORK" AND UPCASE(MEMNAME) = "__SCHEMA_DSNIN_1" AND NAME = "&ID." ; quit; proc sort data=&DsnIn_1; by &ConsolidationID. &ID.; run; data __Tuples_DsnIn_1 (keep=__ID_1 __ID_2); set &DsnIn_1; by &ConsolidationID. &ID.; %if &ID_Type.=CHAR %then %do; length __ID_1 $&ID_Length.; %end; %else %do; length __ID_1 &ID_Length.; %end; retain __ID_1; if first.&ConsolidationID. then __ID_1 = &ID.; __ID_2 = &ID.; /* if __ID_1 = __ID_2 then delete;*/ run; proc sort data=&DsnIn_2; by &ConsolidationID. &ID.; run; data __Tuples_DsnIn_2 (keep=__ID_1 __ID_2); set &DsnIn_2; by &ConsolidationID. &ID.; retain __ID_1; if first.&ConsolidationID. then __ID_1 = &ID.; __ID_2 = &ID.; /* if __ID_1 = __ID_2 then delete;*/ run; proc sql; CREATE TABLE __Tuples_Combined AS SELECT __ID_1, __ID_2 FROM __Tuples_DsnIn_1 UNION SELECT __ID_1, __ID_2 FROM __Tuples_DsnIn_2 ; quit; %CreateConsolidationID(DsnIn=__Tuples_Combined, ConsolidationID=&ConsolidationID., ID_Out=&ID., ID_A=__ID_1, ID_B=__ID_2, DsnOut=&DsnOut., Debug=N) %if &Debug.=N %then %do; proc sql; DROP TABLE __Schema_DsnIn_1, __Tuples_DsnIn_1, __Tuples_DsnIn_2, __Tuples_Combined; quit; %end; %put --- End of %upcase(&sysmacroname) macro; %mend CombineConsolidationIDs; /* data ConsolSampleData_1; input DedupID RecordNum; datalines; 1 100 1 199 2 111 2 125 2 130 2 135 3 141 3 142 888 888 ; run; data ConsolSampleData_2; input DedupID RecordNum; datalines; 1 199 1 222 2 2500 2 2600 3 273 3 288 4 299 5 141 5 142 999 999 ; run; %CombineConsolidationIDs(DsnIn_1=ConsolSampleData_1, DsnIn_2=ConsolSampleData_2, DsnOut=ConsolSampleData_Combined2, ConsolidationID=DedupID, ID=RecordNum, Debug=Y) */