/*----------------------------------------------------------------------------- File: CreateResearchIds.txt Author: John Sabel, based on work by Ann Lima State of Washington Education Research and Data Center Creation date: 10/14/11 Version: 1.01 Purpose Illustrates how to create random research identifiers for datasets that substitute for an original confidential identifier. The original confidential identifier can be things such as a Student_Id or a Person_Id. But in the example below, a research identifier created based a confidential farm indentifier. The sample dataset contains information about three farms, some of which is confidential. The confidential information must be stripped or masked. Each farm has a confidential Farm_Id. Each farm has nonconfidential animals and each animal has a confidential farmyard date attached. Though the FarmYardDates are confidential, the ordering of the FarmYardDates is not, and needs to be preserved in the final research dataset. In the example, a Research_Id is created based on the Farm_Id. The resulting sort order of the Research_Ids is random, and bears no relationship to the original Farm_Ids. Though the sort order of the resulting Research_Ids is random, the records represented by each Research_Id are ordered based on the FarmYardDate. The confidential Farm_Id and FarmYardDate are then stripped from the final research dataset. -----------------------------------------------------------------------------*/ data SampleData; input @1 Farm_Id @4 Animal $ @13 FarmYardDate mmddyy10.; format FarmYardDate mmddyy10.; datalines; 11 Cow 10/01/1971 27 Pig 08/04/1995 27 Sheep 09/21/1976 27 Cat 07/22/2011 39 Turkey 11/24/2011 39 Duck 01/01/2010 401 Donkey 04/30/1965 ; run; /* Order by Farm_Id and FarmYardDate */ proc sort data=SampleData; by Farm_Id FarmYardDate; run; /* Attach random numbers, one random number per Farm_Id. Also, attach the DateOrder based on sorted FarmYardDates. */ data __TempData; set SampleData; by Farm_Id; retain DateOrder _RandomNumber; if first.Farm_Id then do; DateOrder = 1; _RandomNumber = ranuni(123456); /* Change this random number seed. */ end; else do; DateOrder = DateOrder + 1; end; run; /* Sort by the random number attached to each Farm_ID, and within each Farm_Id, by the DateOrder of the FarmYardDates. */ proc sort data=__TempData; by _RandomNumber DateOrder; run; /* Create new Research_Ids based on the now random order of groups of records associated with each Farm_Id, and strip out the Farm_Id. */ data SampleData_WithResearchId (drop = _RandomNumber); length Research_Id 8 DateOrder 8 Animal $8; set __TempData (drop = Farm_Id FarmYardDate); by _RandomNumber; retain Research_Id 0; if first._RandomNumber then Research_Id = Research_Id + 1; run;