******************************************************************************* ***************Wave 2 datasets merge - By different levels ******************* ******************************************************************************* ******Merge #1 -- family single level merges***** **Merging geographically linked data with family derived variables first** *I chose to do it this way rather than using the longitudinal family dataset first as this had a much higher family n than the other 2 (19,243 observations), indicating it also included families which were not within the "achieved sample" while the other two only seemed to have families from the "achieved sample". Wanted to make sure they matched up ok first -- then see what the effect of adding in the longitudinal family dataset would do. *Check 1: uniqueness of ID (only relevant via MCSID in datasets with one line per family, will have to use a PID or something else in other types) ****No MCSIDs are the same cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_geographically_linked_data.dta", clear sort MCSID by MCSID: assert _N==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_family_derived.dta" sort MCSID by MCSID: assert _N==1 clear *Check 2: Examining any non-matches per MCSID: Are any non-matches expected?/ What do the non-matches look like? *Began by keeping just one observation for every value of MCSID (this step is probably not needed as there is one ID per family and i've already seen there are no duplicate MCSIDs where there shouldn't be). *Merged on MCSID, keeping (1) the records that match and (2) the records from the master that do not match -- does every value in dataset one appear in 2? *Do this for each of the common variables across all datasets -- only seems to be MCSID in this instance (the country variable seemed to only match 4 families - one from each country!). cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_geographically_linked_data.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\JM_mcs2_family_derived.dta", keep(master match) ****There was n=14 not matched that were in the master (geographically linked) and n=0 not matched from the merged on dataset (family DV) list MCSID if _merge==1 clear ****Judging by the observations found in each of the datasets -- this makes sense that 14 are missing. There was an "achieved sample" of 15,590 families - so looks like data was acquired for all families for the geographically linked data, and only 14 less families were able to have no data to created DVs -- their observations (15,576 observations) show this 14 missing were expected. *Looking from GLD data -- all present and data looks ok use "H:\0000.MCS\July merge datasets\JM_mcs2_geographically_linked_data.dta" list if MCSID == "M13234R" list if MCSID == "M15310U" list if MCSID == "M19649L" list if MCSID == "M20925E" list if MCSID == "M21294A" list if MCSID == "M22110F" list if MCSID == "M22583L" list if MCSID == "M22911G" list if MCSID == "M24550K" list if MCSID == "M25458U" list if MCSID == "M26360P" list if MCSID == "M27117J" list if MCSID == "M28676Q" list if MCSID == "M33835W" *Looking from DV dataset -- not even present here clear use "H:\0000.MCS\July merge datasets\JM_mcs2_family_derived.dta" list if MCSID == "M13234R" list if MCSID == "M15310U" list if MCSID == "M19649L" list if MCSID == "M20925E" list if MCSID == "M21294A" list if MCSID == "M22110F" list if MCSID == "M22583L" list if MCSID == "M22911G" list if MCSID == "M24550K" list if MCSID == "M25458U" list if MCSID == "M26360P" list if MCSID == "M27117J" list if MCSID == "M28676Q" list if MCSID == "M33835W" clear *Making the merge between geographically linked and family DV cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_geographically_linked_data.dta", clear merge 1:1 MCSID using "H:\0000.MCS\July merge datasets\JM_mcs2_family_derived.dta" *Check 3: Do the combined results look like they go together? -- Look at random sample ****hard to determine within wave, but nothing seems glaringly obvious gen u = uniform() sort u list in 1/5 drop u rename _merge merge_GLD_FDV *SFL = Single Family Line save "MCS2_SFL_Jmerge.dta" **Merging the longitudinal family dataset with the geographically linked data-family derived variables merged dataset second** *Check 1: uniqueness of ID (only relevant via MCSID in datasets with one line per family, will have to use a PID or something else in other types) ****No MCSIDs are the same cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs_longitudinal_family_file.dta", clear sort MCSID by MCSID: assert _N==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\MCS2_SFL_Jmerge.dta" sort MCSID by MCSID: assert _N==1 clear *Check 2: Examining any non-matches per MCSID: Are any non-matches expected?/ What do the non-matches look like? *Began by keeping just one observation for every value of MCSID (this step is probably not needed as there is one ID per family and i've already seen there are no duplicate MCSIDs where there shouldn't be). *Merged on MCSID, keeping (1) the records that match and (2) the records from the master that do not match -- does every value in dataset one appear in 2? *Do this for each of the common variables across all datasets -- only seems to be MCSID in this instance (the country variable seemed to only match 4 families - one from each country!). cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs_longitudinal_family_file.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\MCS2_SFL_Jmerge.dta", keep(master match) ****There was n=3,653 not matched that were in the master (longitudinal family) and n=0 not matched from the merged on dataset (merged geographical-family DV) list MCSID if _merge==1 *As there was a lot of missing in the master, I did a rough check where I got the MCSIDs of 1 in 100 families and checked to see if they were in any of the prior data list MCSID in 1/100 if _merge==1 clear ****Judging by the observations found in each of the datasets -- this makes sense that 3,653 are missing. There was an "achieved sample" of 15,590 families which is reflected in the MCS2_SFL_merge data. So missing families seen in the longitudinal family file likely have all who were in wave 1 and dropped out, as well as ones contacted but did not become an achieved sample at wave 1. ****Looking from LFD data -- none of the data missing in the merge seems to be in either of the single or merged dataset. use "H:\0000.MCS\July merge datasets\JM_mcs2_geographically_linked_data.dta" list if MCSID == "M10032V" list if MCSID == "M10052Z" list if MCSID == "M10053A" list if MCSID == "M10073E" list if MCSID == "M10080D" list if MCSID == "M10081E" list if MCSID == "M10085J" list if MCSID == "M10088M" list if MCSID == "M10097N" list if MCSID == "M10108Y" list if MCSID == "M10124Y" list if MCSID == "M10135B" clear use "H:\0000.MCS\July merge datasets\JM_mcs2_family_derived.dta" list if MCSID == "M10032V" list if MCSID == "M10052Z" list if MCSID == "M10053A" list if MCSID == "M10073E" list if MCSID == "M10080D" list if MCSID == "M10081E" list if MCSID == "M10085J" list if MCSID == "M10088M" list if MCSID == "M10097N" list if MCSID == "M10108Y" list if MCSID == "M10124Y" list if MCSID == "M10135B" clear use "H:\0000.MCS\July merge datasets\MCS2_SFL_Jmerge.dta" list if MCSID == "M10032V" list if MCSID == "M10052Z" list if MCSID == "M10053A" list if MCSID == "M10073E" list if MCSID == "M10080D" list if MCSID == "M10081E" list if MCSID == "M10085J" list if MCSID == "M10088M" list if MCSID == "M10097N" list if MCSID == "M10108Y" list if MCSID == "M10124Y" list if MCSID == "M10135B" clear *Making the merge between geographically linked and family DV cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs_longitudinal_family_file.dta", clear merge 1:1 MCSID using "MCS2_SFL_Jmerge.dta" rename _merge merge_LFD_GLDFDV *Check 3: Do the combined results look like they go together? -- Look at random sample ****hard to determine within wave, but nothing seems glaringly obvious gen u = uniform() sort u list in 1/5 *Also checked by NOCM variable as both the LFD and GLD-FDV datasets had versions of this variable. *There were some missing data points for BDNOCM00 in some Ps, so checked this too alongside the merge variables ****All seemed to match up ok, and the merged variables also seemed to verify this list MCSID NOCMHH BDNOCM00 NOCM in 1/100 list MCSID NOCMHH BDNOCM00 NOCM merge_GLD_FDV merge_LFD_GLDFDV in 1/100 drop u *SFL = Single Family Line save "MCS2_SFL_Jmerge.dta", replace ******Merge #2 -- CM only level merges***** *Check 1: uniqueness of ID or CM number ****No MCSIDs or CMNUMs are the same cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_cm_cognitive_assessment", clear sort MCSID CMNUM by MCSID CMNUM: assert _N==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_cm_derived.dta" sort MCSID CMNUM by MCSID CMNUM: assert _N==1 clear *Check 2: Examining any non-matches per MCSID and CM number: Are any non-matches expected?/ What do the non-matches look like? ****All instances were merged -- Expected as both had the same number of observations -- n= 15,778 use "H:\0000.MCS\July merge datasets\JM_mcs2_cm_cognitive_assessment", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\JM_mcs2_cm_derived.dta", keep(master match) *Same check via CM number ****All instances merged use "H:\0000.MCS\July merge datasets\JM_mcs2_cm_cognitive_assessment", clear sort CMNUM by CMNUM: keep if _n==1 merge 1:m CMNUM using "H:\0000.MCS\July merge datasets\JM_mcs2_cm_derived.dta", keep(master match) *Making the merge between geographically linked and family DV cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_cm_cognitive_assessment", clear merge 1:1 MCSID CMNUM using "H:\0000.MCS\July merge datasets\JM_mcs2_cm_derived.dta" rename _merge merge_CMCA_CMDV *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues (checked to see if all of the 2nd CM members were such) gen u = uniform() sort u list in 1/5 list in 1/2500 if CMNUM ==2 list in 1/15778 if CMNUM == 2 drop u *CMOL = CM Only Levels save "MCS2_CMOL_Jmerge.dta" ******Merge #3 -- Responder only level merges***** *Check 1: uniqueness of ID or person number-MCSIDs ****No MCSIDs or person number-MCSIDs are the same cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_derived.dta", clear sort MCSID PnumID by MCSID PnumID: assert _N==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_cm_interview" sort MCSID PnumID by MCSID PnumID: assert _N==1 clear *Check 2: Examining any non-matches per MCSID and Person number: Are any non-matches expected?/ What do the non-matches look like? ****Not matched: 81 (25,908 matched) -- indicates that 81 families included in parent DV did not appear in parent cm interview. cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_derived.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\JM_mcs2_parent_cm_interview", keep(master match) list MCSID if _merge==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_cm_interview", clear *Checked if any of these these were present at all in the parent CM interview dataset ****These definately don't exist in the parent CM interview list if MCSID == "M10064D" list if MCSID == "M10551P" list if MCSID == "M10946E" list if MCSID == "M11442P" list if MCSID == "M11596G" list if MCSID == "M11756E" list if MCSID == "M11845E" list if MCSID == "M13008H" list if MCSID == "M13048R" list if MCSID == "M13598S" list if MCSID == "M13823G" list if MCSID == "M14087C" list if MCSID == "M14246Z" list if MCSID == "M14472H" list if MCSID == "M14856V" list if MCSID == "M14887C" list if MCSID == "M15194J" list if MCSID == "M15443G" list if MCSID == "M16704N" list if MCSID == "M17042B" list if MCSID == "M17914A" list if MCSID == "M18213G" list if MCSID == "M18557D" list if MCSID == "M18964Q" list if MCSID == "M19074T" list if MCSID == "M19196C" list if MCSID == "M19496M" list if MCSID == "M19619E" list if MCSID == "M19860M" list if MCSID == "M20030Y" list if MCSID == "M20368X" list if MCSID == "M20435R" list if MCSID == "M20452S" list if MCSID == "M20607T" list if MCSID == "M21214J" list if MCSID == "M21305L" list if MCSID == "M21331N" list if MCSID == "M21606W" list if MCSID == "M22254W" list if MCSID == "M22276C" list if MCSID == "M22926P" list if MCSID == "M23015M" list if MCSID == "M23516D" list if MCSID == "M23881X" list if MCSID == "M23950T" list if MCSID == "M24131U" list if MCSID == "M24323A" list if MCSID == "M24823R" list if MCSID == "M25175L" list if MCSID == "M25277R" list if MCSID == "M25362M" list if MCSID == "M25940Z" list if MCSID == "M26545W" list if MCSID == "M27499M" list if MCSID == "M27986X" list if MCSID == "M28174X" list if MCSID == "M28252U" list if MCSID == "M29678W" list if MCSID == "M29797C" list if MCSID == "M30108J" list if MCSID == "M30232M" list if MCSID == "M30894V" list if MCSID == "M31328Y" list if MCSID == "M31432X" list if MCSID == "M31587R" list if MCSID == "M32263C" list if MCSID == "M32357H" list if MCSID == "M32444F" list if MCSID == "M32695Y" list if MCSID == "M32796C" list if MCSID == "M33090D" list if MCSID == "M33292M" list if MCSID == "M33295Q" list if MCSID == "M33542L" list if MCSID == "M33637S" list if MCSID == "M34266P" list if MCSID == "M34572W" list if MCSID == "M35085Q" list if MCSID == "M35409Q" list if MCSID == "M35411J" list if MCSID == "M35683G" *Same check via Person MCSID number ****25,907 unique individuals (main or partner) were able to be matched. Not matched: 2,517 *There is such a large amount of Ps here; so picked the last 20 to see what the issue is potentially. ***None of these had any data in the parent cm interview dataset. ***Seemed to be a majority of partners (and some main responders). Likely partners did not take part in interview as majority of those with codes in this have -1 (or equivalent) except for the 'read to' items for the HLE cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_derived.dta", clear tab PNUM sort PnumID by PnumID: keep if _n==1 merge 1:m PnumID using "H:\0000.MCS\July merge datasets\JM_mcs2_parent_cm_interview", keep(master match) list PnumID if _merge==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_cm_interview", clear list if PnumID == "M35743B2" list if PnumID == "M35751B7" list if PnumID == "M35755F2" list if PnumID == "M35738E2" list if PnumID == "M35742A2" list if PnumID == "M35743B2" list if PnumID == "M35751B7" list if PnumID == "M35755F2" list if PnumID == "M35761D2" list if PnumID == "M35778N2" list if PnumID == "M35788Q2" list if PnumID == "M35793M2" list if PnumID == "M35796Q2" list if PnumID == "M35808B2" list if PnumID == "M35825C2" list if PnumID == "M35833C2" list if PnumID == "M35834D2" list if PnumID == "M35859N2" list if PnumID == "M35865L2" list if PnumID == "M35869Q2" clear *decided to investigate any which had PNUMs after 2 and likely not natural parents cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_derived.dta", clear merge 1:1 MCSID PnumID using "H:\0000.MCS\July merge datasets\JM_mcs2_parent_cm_interview" *When seeing the merge results, there was n=1 only had data in the cm interview dataset (M23687X) - checked this out ****This was the only family to have 3 people in this merged datset (person numbers 1, 4 and 5; both 4 and 5 were eligible and interviewed as partner responders and natural father. Looks like a human error -- and I will need to merge these two). list MCSID if _merge ==2 *Tried to see the ns for people who were not main eligible & main actual responders by unmerged ****n=140 who WERE eligible to be main responder only had data in the parent DV dataset ****n=2,377 who WERE NOT eligible to be main responder (all partner in person/proxy, n=1 included from parent cm interview data) only had data in the parent DV dataset tab ELIG_New _merge *Also checked their actual interview outcomes ****No one who actually responded as a main was unmatched. ****n=223 proxy partners were not matched, nor were those (n=2293) who were not interviewed. There were none of these types in the matched set of data. ****n=2 partners were not matched. On closer inspection this is the M23687X PNUMs 4 and 5. This is more evidence to indicate this is actually the same person. ****This findings lead me to believe they just didn't include people who weren't interviewed or proxy partners in the cm interview dataset. Therefore, it seems like there are no issues with the dataset to be merged. tab BRESP00 _merge list MCSID PnumID if _merge ==1 & BRESP00 ==2 list MCSID PnumID if _merge ==2 & BRESP00 ==2 *I had a look at the HH grid dataset quickly to try and determine what Person number of the person split in M23687X by matching up the eligibility, response and relationship to CM data. It seems to be the person with PNUM 4. I went and manually edited the dataset in the parent DV dataset so this matched up cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_hhgrid", clear clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_derived.dta", clear replace BPNUM00 = 4 in 14972 replace PnumID = "M23687X4" in 14972 replace PNUM = 4 in 14972 save "H:\0000.MCS\July merge datasets\JM_mcs2_parent_derived.dta", replace *Making the merge between responder dv and responder cm interview cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_parent_derived.dta", clear merge 1:1 MCSID PnumID using "H:\0000.MCS\July merge datasets\JM_mcs2_parent_cm_interview" rename _merge merge_PDV_PCM *Check 3: Seems like the split family member has now merged properly. Do a quick final check on this with the syntax used before ****Yep, just the proxies and no interview responders now not merged. tab BRESP00 merge_PDV_PCM list MCSID PnumID if merge_PDV_PCM ==1 & BRESP00 ==2 list MCSID PnumID if merge_PDV_PCM ==2 & BRESP00 ==2 *Check 4: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u *ROL = Resp Only Levels save "MCS2_ROL_Jmerge.dta" ******************************************************************************* ****** Wave 2 datasets merge - All datasets (already merged by level) ******** ******************************************************************************* ******************************************************************************** *****Merge #1 HH and CM only level data***** *HH = n=69,204 total observations; n=15,778 CM observations *CMOL = n=15,778 CM observations *I first need to make a variable I can merge on which is unique to each person -- a CM-based PID (already prepared in HH grid) cd "H:\0000.MCS\July merge datasets" use "MCS2_CMOL_Jmerge.dta", clear *Making a unique identifier for CMs for merging this and the HHgrid tab BCNUM00 sort MCSID gen CMnumID = MCSID + string(BCNUM00) if (BCNUM00 > 0) save "MCS2_CMOL_Jmerge.dta", replace *Check 1 ****All that were matched were the expected numbers. (n=53,426 not matched -- all the non-CMs) cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_hhgrid", clear merge 1:1 MCSID CMnumID using "MCS2_CMOL_Jmerge.dta" sort MCSID CMnumID by MCSID CMnumID: assert _N==1 clear ****Check 2 *began by using my sample data and keeping just one observation for every value of personid. I don't care which observation I keep, I just need to keep one and only one. Then I merged on personid, keeping (1) the records that match and (2) the records from the master that do not match. I have no interest in the resulting dataset; I just wanted to see the table merge would report. TO CHECK IF EVERY VALUE OF MSCID THAT APPEARS IN FIRST DATASET ALSO APPEARS IN SECOND DATASET *Do this for each of the common variables across all datasets *For MCSID individually ****All ok -- expected cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_hhgrid", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "MCS2_CMOL_Jmerge.dta", keep(master match) *For CMnumID individually ****All ok -- again expected to have this many not merged as not in dataset because not CMs cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_hhgrid", clear sort CMnumID by CMnumID: keep if _n==1 merge 1:m CMnumID using "MCS2_CMOL_Jmerge.dta", keep(master match) clear **Completing the merge between HHgrid and CM level data** cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs2_hhgrid", clear merge 1:1 MCSID CMnumID using "MCS2_CMOL_Jmerge.dta" rename _merge merge_HH_CMOL2 *Check 3: Do the combined results look like they go together? -- Look at random sample ****seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u save "MCS2_Final_Jmerge.dta" *Doing this so that I know the wave where the merge variable came from rename merge_CMCA_CMDV merge_CMCA_CMDV2 *Adding a code to CM only variables (with a -10 value) so that this data doesn't show up missing, and we know data isn't present for a person because they are not a CM replace BCNUM00 = -10 if BCNUM00==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define CNUMX -10 "Not applicable - Not CM" 1 "1st Cohort Member of the family" 2 "2nd Cohort Member of the family" 3 "3rd Cohort Member of the family", replace label values BCNUM00 CNUMX tab BCNUM00 replace BCASAG00 =-10 if BCASAG00 ==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define BCASX -10 "Not applicable - Not CM" -7 "Not carried out" -1 "Not applicable" 1 "Ages 2:6-2:8" 2 "Ages 2:9-2:11" 3 "Ages 3:0-3:2" 4 "Ages 3:3-3:5" 5 "Ages 3:6-3:8" 6 "Ages 3:9-3:11" 7 "Ages 4:0-4:2" 8 "Ages 4:3-4:5" 9 "4:6-4:8", replace label values BCASAG00 BCASX tab BCASAG00 replace BCRKNAGE =-10 if BCRKNAGE ==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define BCRKNAX -10 "Not applicable - Not CM" -7 "Not carried out" -1 "Not applicable" 1 "Ages 2:6:0-2:8:30" 2 "Ages 2:9:0-2:11:30" 3 "Ages 3:0:0-3:2:30" 4 "Ages 3:3:0-3:5:30" 5 "Ages 3:6:0-3:8:30" 6 "Ages 3:9:0-3:11:30" 7 "Ages 4:0:0-4:2:30" 8 "Ages 4:3:0-4:5:30" 9 "Ages 4:6:0-4:8:30", replace label values BCRKNAGE BCRKNAX tab BCRKNAGE replace BCCASS00 =-10 if BCCASS00 ==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define BCCASX -10 "Not applicable - Not CM" -1 "Not applicable" 0 "Child Assessments not completed" 1 "Both Child Assessments completed" 2 "Child Assessments BRAKEN only completed" 3 "Child Assessments BAS only completed", replace label values BCCASS00 BCCASX tab BCCASS00 replace BDSRCS00 =-10 if BDSRCS00 ==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define BDSRCX -10 "Not applicable - Not CM" -8 "Refused" -7 "Not carried out" -4 "Sub-test not completed" -3 "Age unknown" -2 "No interview date" -1 "Not applicable", replace label values BDSRCS00 BDSRCX tab BDSRCS00 replace BDSRCN00 =-10 if BDSRCN00 ==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define BDSRCX -10 "Not applicable - Not CM" -8 "Refused" -7 "Not carried out" -4 "Sub-test not completed" -3 "Age unknown" -2 "No interview date" -1 "Not applicable" 1 "Very delayed" 2 "Delayed" 3 "Average" 4 "Advanced" 5 "Very advanced", replace label values BDSRCN00 BDSRCX tab BDSRCN00 replace BDBAST00 =-10 if BDBAST00 ==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define BDBATX -10 "Not applicable - Not CM" -8 "Ended early" -7 "Not carried out" -6 "Not administered" -1 "CM N/A", replace label values BDBAST00 BDBATX tab BDBAST00 replace BDC06E00 =-10 if BDC06E00 ==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define BDC06X -10 "Not applicable - Not CM" -9 "Refusal" -8 "Don't know'" -1 "Not applicable" 1 "White" 2 "Mixed" 3 "Indian" 4 "Pakistani and Bangledeshi" 5 "Black or Black British" 6 "Other Ethnic group (inc. Chinese, Other", replace label values BDC06E00 BDC06X tab BDC06E00 replace BEBDTOT =-10 if BEBDTOT ==. & BPNUM00 == 1 | BPNUM00 == 2 | BPNUM00 == 3 | BPNUM00 == 4 | BPNUM00 == 5 | BPNUM00 == 6 | BPNUM00 == 7 | BPNUM00 == 8 | BPNUM00 == 9 | BPNUM00 == 10 | BPNUM00 == 11 | BPNUM00 == 12 | BPNUM00 == 13 | BPNUM00 == 14 | BPNUM00 == 15 | BPNUM00 == 16 | BPNUM00 == 17 label define BEBDTX -10 "Not applicable - Not CM" -9 "Refusal" -8 "Don't know'" -1 "Not applicable", replace label values BEBDTOT BEBDTX tab BEBDTOT replace merge_CMCA_CMDV2 = -1 if merge_CMCA_CMDV2 !=3 label define merge_CMCA_CMDV2X -1 "Data not present in this merge" 3 "matched", replace label values merge_CMCA_CMDV2 merge_CMCA_CMDV2X tab merge_CMCA_CMDV2 save "MCS2_Final_Jmerge.dta", replace *****Merge #2 HH-CML and Responder only level data***** *69,204 in HH-CML *28,424 in RL *Check 1: Checks that merge numbers for the full datset by the chosen variables match up as expected, and there are no duplicates ****All that were matched were the expected numbers. (n=40,780 not matched -- all the non-responders (CM and non-CMs)) cd "H:\0000.MCS\July merge datasets" use "MCS2_Final_Jmerge.dta", clear merge 1:1 MCSID PnumID using "MCS2_ROL_Jmerge.dta" sort MCSID PnumID by MCSID PnumID: assert _N==1 clear ****Check 2: Checks if every value of MSCID and person ID appears in both datasets individually *Do this for each of the common variables across all datasets *For MCSID individually ****All but 1 MCSID did not match (from HH-CML dataset) - "M27117J". cd "H:\0000.MCS\July merge datasets" use "MCS2_Final_Jmerge.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "MCS2_ROL_Jmerge.dta", keep(master match) clear ****This one participant does not exist in the ROL dataset cd "H:\0000.MCS\July merge datasets" use "MCS2_ROL_Jmerge.dta", clear list if MCSID == "M27117J" clear ****4 members of this family, 3 non-CMs, 1 CM. No interview seems to have been completed -- hence why the ROL dataset does not have this family in. cd "H:\0000.MCS\July merge datasets" use "MCS2_Final_Jmerge.dta", clear list if MCSID == "M27117J" *For PnumID individually ****All ok -- expected (the M27117J was one of the families only present in master as expected) cd "H:\0000.MCS\July merge datasets" use "MCS2_Final_Jmerge.dta", clear sort PnumID by PnumID: keep if _n==1 merge 1:m PnumID using "MCS2_ROL_Jmerge.dta", keep(master match) **Completing the merge between HH-CML and Responder level data** cd "H:\0000.MCS\July merge datasets" use "MCS2_Final_Jmerge.dta", clear merge 1:1 MCSID PnumID using "MCS2_ROL_Jmerge.dta" rename _merge merge_HHCM_ROL2 rename merge_PDV_PCM merge_PDV_PCM2 *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u sort MCSID BPNUM00 save "MCS2_Final_Jmerge.dta", replace *Recoding the "non-merged" responses cd "H:\0000.MCS\July merge datasets" use "MCS2_Final_Jmerge.dta", clear replace BDDRES00 =-10 if BDDRES00 ==. & merge_HHCM_ROL2 ==1 label define BDDRES00 -10 "Not a responder at wave 2", add tab BDDRES00 replace BDDREL00 =-10 if BDDREL00 ==. & merge_HHCM_ROL2 ==1 label define BDDREL00 -10 "Not a responder at wave 2", add tab BDDREL00 replace BDDNVQ00 =-10 if BDDNVQ00 ==. & merge_HHCM_ROL2 ==1 label define BDDNVQ00 -10 "Not a responder at wave 2", add tab BDDNVQ00 *HLE variables with missing just given -1 (these just get deleted). If I need to differentiate between these and the original -1, I can just tab by the merge variable (i.e. tab BPOFRE00CM1 merge_HHCM_ROL2) replace BPOFRE00CM1 =-1 if BPOFRE00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPREEL00CM1 =-1 if BPREEL00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPREOF00CM1 =-1 if BPREOF00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPTOLI00CM1 =-1 if BPTOLI00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPOFLI00CM1 =-1 if BPOFLI00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPALPH00CM1 =-1 if BPALPH00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPOFAB00CM1 =-1 if BPOFAB00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPNUMB00CM1 =-1 if BPNUMB00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPOFCO00CM1 =-1 if BPOFCO00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPSONG00CM1 =-1 if BPSONG00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPOFSO00CM1 =-1 if BPOFSO00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPDRAW00CM1 =-1 if BPDRAW00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPPAMA00CM1 =-1 if BPPAMA00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPOFRE00CM2 =-1 if BPOFRE00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPREEL00CM2 =-1 if BPREEL00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPREOF00CM2 =-1 if BPREOF00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPTOLI00CM2 =-1 if BPTOLI00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPOFLI00CM2 =-1 if BPOFLI00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPALPH00CM2 =-1 if BPALPH00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPOFAB00CM2 =-1 if BPOFAB00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPNUMB00CM2 =-1 if BPNUMB00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPOFCO00CM2 =-1 if BPOFCO00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPSONG00CM2 =-1 if BPSONG00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPOFSO00CM2 =-1 if BPOFSO00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPDRAW00CM2 =-1 if BPDRAW00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPPAMA00CM2 =-1 if BPPAMA00CM2 ==. & merge_HHCM_ROL2 ==1 *Same with the long-term health questions replace BPCLSI00CM1 =-1 if BPCLSI00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPCLSL00CM1 =-1 if BPCLSL00CM1 ==. & merge_HHCM_ROL2 ==1 replace BPCLSI00CM2 =-1 if BPCLSI00CM2 ==. & merge_HHCM_ROL2 ==1 replace BPCLSL00CM2 =-1 if BPCLSL00CM2 ==. & merge_HHCM_ROL2 ==1 *Same with the speech/language concerns replace BPSPLM0ACM1 =-1 if BPSPLM0ACM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0BCM1 =-1 if BPSPLM0BCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0CCM1 =-1 if BPSPLM0CCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0DCM1 =-1 if BPSPLM0DCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0ECM1 =-1 if BPSPLM0ECM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0FCM1 =-1 if BPSPLM0FCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0GCM1 =-1 if BPSPLM0GCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0HCM1 =-1 if BPSPLM0HCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0ICM1 =-1 if BPSPLM0ICM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0JCM1 =-1 if BPSPLM0JCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0KCM1 =-1 if BPSPLM0KCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0LCM1 =-1 if BPSPLM0LCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0MCM1 =-1 if BPSPLM0MCM1 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0ACM2 =-1 if BPSPLM0ACM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0BCM2 =-1 if BPSPLM0BCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0CCM2 =-1 if BPSPLM0CCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0DCM2 =-1 if BPSPLM0DCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0ECM2 =-1 if BPSPLM0ECM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0FCM2 =-1 if BPSPLM0FCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0GCM2 =-1 if BPSPLM0GCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0HCM2 =-1 if BPSPLM0HCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0ICM2 =-1 if BPSPLM0ICM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0JCM2 =-1 if BPSPLM0JCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0KCM2 =-1 if BPSPLM0KCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0LCM2 =-1 if BPSPLM0LCM2 ==. & merge_HHCM_ROL2 ==1 replace BPSPLM0MCM2 =-1 if BPSPLM0MCM2 ==. & merge_HHCM_ROL2 ==1 save "MCS2_Final_Jmerge.dta", replace *****Merge #2 HH-CML-RL and Family level data***** *69,204 in HH-CML-RL *19,243 in FL *Check 1: Checks that merge numbers for the full datset by MCSID match up as expected. Judge the amount of 'contradictions' represents the number of families in the HH-CML-RL dataset ****15,579 families notes as contradictions -- this matches up to the # of unique families in the HH-CML-RL dataset cd "H:\0000.MCS\July merge datasets" use "MCS2_Final_Jmerge.dta", clear merge 1:1 MCSID using "MCS2_SFL_Jmerge.dta" sort MCSID by MCSID: assert _N==1 clear *Also did it the other way around -- which makes more sense as the SFL dataset has more families *Just stated "MCSID does not uniquely identify observations in the using data" - which happens because some families only have one line, while the ones added in from HH-CML-RL have more than one line per family. cd "H:\0000.MCS\July merge datasets" use "MCS2_SFL_Jmerge.dta", clear merge 1:1 MCSID using "MCS2_Final_Jmerge.dta" sort MCSID by MCSID: assert _N==1 clear *Check 2: Checks if every value of MSCID appears in both datasets (supporting the above check) *Did it the other way around because the SFL data had more families *Did not match 3,664 families from master -- makes sense as that is the deficit of families between datasets. cd "H:\0000.MCS\July merge datasets" use "MCS2_SFL_Jmerge.dta", clear sort MCSID merge 1:m MCSID using "MCS2_Final_Jmerge.dta" list MCSID if _merge ==1 clear *Took bottom 20 just to make sure -- they didn't exist. cd "H:\0000.MCS\July merge datasets" use "MCS2_Final_Jmerge.dta", clear sort MCSID list if MCSID == "M35730W" list if MCSID == "M35740Y" list if MCSID == "M35746E" list if MCSID == "M35752C" list if MCSID == "M35756G" list if MCSID == "M35771F" list if MCSID == "M35792L" list if MCSID == "M35811W" list if MCSID == "M35821Y" list if MCSID == "M35830Z" list if MCSID == "M35857L" list if MCSID == "M35860F" list if MCSID == "M35864K" list if MCSID == "M35870H" list if MCSID == "M35873L" clear **Completing the merge between HH-CML and Responder level data** *Rather than doing a 1:1 merge (as it wouldn't let me, and doesn't make sense), I did a 1:m merge -- this gives each of the family members the SFL data on their line-- So will need to keep this in mind when doing any exploration/analyses with the SFL variables. cd "H:\0000.MCS\July merge datasets" use "MCS2_SFL_Jmerge.dta", clear merge 1:m MCSID using "MCS2_Final_Jmerge.dta" sort MCSID *Renames the SFL merges so they indicate they are from wave 2 rename merge_GLD_FDV merge_GLD_FDV2 rename merge_LFD_GLDFDV merge_LFD_GLDFDV2 *Saved the merge variable for this merge rename _merge merge_HHCMR_SFL2 *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u sort MCSID PnumID save "MCS2_Final_Jmerge.dta", replace ******************************************************************************* ***************Wave 3 datasets merge - By different levels ******************* ******************************************************************************* ******Merge #1 -- family single level merges***** **Merging geographically linked data with family derived variables** *The longitudinal family dataset is not being merged here because its already in the wave 2 file. *Check 1: uniqueness of ID (only relevant via MCSID in datasets with one line per family, will have to use a PID or something else in other types) ****No MCSIDs are the same cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_geographically_linked_data.dta", clear sort MCSID by MCSID: assert _N==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_family_derived.dta" sort MCSID by MCSID: assert _N==1 clear *Check 2: Examining any non-matches per MCSID: Are any non-matches expected?/ What do the non-matches look like? *Began by keeping just one observation for every value of MCSID (this step is probably not needed as there is one ID per family and i've already seen there are no duplicate MCSIDs where there shouldn't be). *Merged on MCSID, keeping (1) the records that match and (2) the records from the master that do not match -- does every value in dataset one appear in 2? *Do this for each of the common variables across all datasets -- only seems to be MCSID in this instance cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_geographically_linked_data.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\JM_mcs3_family_derived.dta", keep(master match) ****There was n=10 not matched that were in the master (geographically linked) and n=0 not matched from the merged on dataset (family DV) list MCSID if _merge==1 clear ****Judging by the observations found in each of the datasets -- this makes sense that 10 are missing. These 10 are the same ones at time 2, with the exception of 4 families ("M20925E" "M27117J" "M22110F" "M28676Q"). There was an "achieved sample" of 15,246 families - so looks like data was acquired for all families for the geographically linked data, and only 10 less families were able to have no data to create DVs. *Checked the 4 which were missed off from wave 2 ****"M22110F" "M20925E" have NO data at t3 -- but just seem to have been removed from the dataset at this point ****"M28676Q" "M27117J" have data in both datasets at t3 (WILL NEED TO CHECK IF THEY WERE IN T1 AND THEN DIDN'T DO T2) --So that is why they are not present as fully missing now. use "H:\0000.MCS\July merge datasets\JM_mcs3_geographically_linked_data.dta", clear list if MCSID == "M20925E" list if MCSID == "M27117J" list if MCSID == "M22110F" list if MCSID == "M28676Q" clear use "H:\0000.MCS\July merge datasets\JM_mcs3_family_derived.dta", clear list if MCSID == "M20925E" list if MCSID == "M27117J" list if MCSID == "M22110F" list if MCSID == "M28676Q" clear *Checked the missing families were consistent ****All have data in GLD as expected from the results use "H:\0000.MCS\July merge datasets\JM_mcs3_geographically_linked_data.dta", clear list if MCSID == "M13234R" list if MCSID == "M15310U" list if MCSID == "M19649L" list if MCSID == "M21294A" list if MCSID == "M22583L" list if MCSID == "M22911G" list if MCSID == "M24550K" list if MCSID == "M25458U" list if MCSID == "M26360P" list if MCSID == "M33835W" *None present in FDV as expected from the results clear use "H:\0000.MCS\July merge datasets\JM_mcs3_family_derived.dta", clear list if MCSID == "M13234R" list if MCSID == "M15310U" list if MCSID == "M19649L" list if MCSID == "M21294A" list if MCSID == "M22583L" list if MCSID == "M22911G" list if MCSID == "M24550K" list if MCSID == "M25458U" list if MCSID == "M26360P" list if MCSID == "M33835W" clear *Making the merge between geographically linked and family DV cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_geographically_linked_data.dta", clear merge 1:1 MCSID using "H:\0000.MCS\July merge datasets\JM_mcs3_family_derived.dta" *Check 3: Do the combined results look like they go together? -- Look at random sample ****hard to determine within wave, but nothing seems glaringly obvious gen u = uniform() sort u list in 1/5 drop u rename _merge merge_GLD_FDV3 *SFL = Single Family Line save "MCS3_SFL_Jmerge.dta", replace ******Merge #2 -- CM only level merges***** *Check 1: uniqueness of ID or CM number ****No MCSIDs or CMNUMs are the same cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_cognitive_assessment", clear sort MCSID CMNUM by MCSID CMNUM: assert _N==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_derived.dta" sort MCSID CMNUM by MCSID CMNUM: assert _N==1 clear cd "H:\0000.MCS\July merge datasets" use "JM_mcs3_cm_fsp_teacher_survey.dta", clear sort MCSID CMNUM by MCSID CMNUM: assert _N==1 *Check 2: Examining any non-matches per MCSID and CM number: Are any non-matches expected?/ What do the non-matches look like? ****All instances were merged -- Expected as all three datasets had the same number of observations -- n= 15,431 use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_cognitive_assessment", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\JM_mcs3_cm_derived.dta", keep(master match) *Same check via CM number ****All instances merged use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_cognitive_assessment", clear sort CMNUM by CMNUM: keep if _n==1 merge 1:m CMNUM using "H:\0000.MCS\July merge datasets\JM_mcs3_cm_derived.dta", keep(master match) clear ****Did the same for FSP data (with CM cog and then CM derived) *CM cog use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_fsp_teacher_survey.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\JM_mcs3_cm_cognitive_assessment", keep(master match) use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_fsp_teacher_survey.dta", clear sort CMNUM by CMNUM: keep if _n==1 merge 1:m CMNUM using "H:\0000.MCS\July merge datasets\JM_mcs3_cm_cognitive_assessment", keep(master match) clear *CM DV use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_fsp_teacher_survey.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\JM_mcs3_cm_derived.dta", keep(master match) use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_fsp_teacher_survey.dta", clear sort CMNUM by CMNUM: keep if _n==1 merge 1:m CMNUM using "H:\0000.MCS\July merge datasets\JM_mcs3_cm_derived.dta", keep(master match) clear *Making the merge between CM cognitive assessment and CM DV cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_cm_cognitive_assessment", clear merge 1:1 MCSID CMNUM using "H:\0000.MCS\July merge datasets\JM_mcs3_cm_derived.dta" rename _merge merge_CMCA_CMDV3 *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues (checked to see if all of the 2nd CM members were such) gen u = uniform() sort u list in 1/5 list in 1/2500 if CMNUM ==2 list in 1/15431 if CMNUM == 2 drop u *CMOL = CM Only Levels save "MCS3_CMOL_Jmerge.dta" *Making the merged between merged cog assessment-CM DV and FSP cd "H:\0000.MCS\July merge datasets" use "MCS3_CMOL_Jmerge.dta", clear merge 1:1 MCSID CMNUM using "H:\0000.MCS\July merge datasets\JM_mcs3_cm_fsp_teacher_survey.dta" rename _merge merge_CMOL1_FSP3 *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues (checked to see if all of the 2nd CM members were such) gen u = uniform() sort u list in 1/5 list in 1/2500 if CMNUM ==2 list in 1/15431 if CMNUM == 2 drop u *CMOL = CM Only Levels save "MCS3_CMOL_Jmerge.dta", replace *Final check with all three datasets in cd "H:\0000.MCS\July merge datasets" use "MCS3_CMOL_Jmerge.dta", clear sort MCSID CMNUM by MCSID CMNUM: assert _N==1 ******Merge #3 -- Responder only level merges***** *Check 1: uniqueness of ID or person number-MCSIDs ****No MCSIDs or person number-MCSIDs are the same cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_derived.dta", clear sort MCSID PnumID by MCSID PnumID: assert _N==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_cm_interview" sort MCSID PnumID by MCSID PnumID: assert _N==1 clear *Check 2: Examining any non-matches per MCSID and Person number: Are any non-matches expected?/ What do the non-matches look like? ****Not matched: 36 (25,667 matched) -- indicates that 36 families included in parent DV did not appear in parent cm interview. cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_derived.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "H:\0000.MCS\July merge datasets\JM_mcs3_parent_cm_interview", keep(master match) list MCSID if _merge==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_cm_interview", clear *Checked if any of these these were present at all in the parent CM interview dataset ****These definately don't exist in the parent CM interview list if MCSID == "M10341F" list if MCSID == "M10646V" list if MCSID == "M11872H" list if MCSID == "M12610R" list if MCSID == "M13953R" list if MCSID == "M14437E" list if MCSID == "M14497S" list if MCSID == "M14857W" list if MCSID == "M15807Q" list if MCSID == "M16732S" list if MCSID == "M17396A" list if MCSID == "M17897S" list if MCSID == "M19151P" list if MCSID == "M19223N" list if MCSID == "M20646A" list if MCSID == "M20779N" list if MCSID == "M21730Z" list if MCSID == "M22068W" list if MCSID == "M23010G" list if MCSID == "M23217V" list if MCSID == "M24542K" list if MCSID == "M25252G" list if MCSID == "M26002U" list if MCSID == "M26392X" list if MCSID == "M26700P" list if MCSID == "M27313L" list if MCSID == "M27348X" list if MCSID == "M27446Y" list if MCSID == "M27829K" list if MCSID == "M28198F" list if MCSID == "M28506X" list if MCSID == "M30490D" list if MCSID == "M31587R" list if MCSID == "M32196J" list if MCSID == "M34496B" list if MCSID == "M35597K" *Same check via Person MCSID number ****25,666 unique individuals (main or partner) were able to be matched. Not matched: 1,778 *There is such a large amount of Ps here; so picked the last 20 to see what the issue is potentially. ***None of these had any data in the parent cm interview dataset. cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_derived.dta", clear tab PNUM sort PnumID by PnumID: keep if _n==1 merge 1:m PnumID using "H:\0000.MCS\July merge datasets\JM_mcs3_parent_cm_interview", keep(master match) list PnumID if _merge==1 clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_cm_interview", clear list if MCSID == "M35726A2" list if MCSID == "M35727B1" list if MCSID == "M35734A2" list if MCSID == "M35738E2" list if MCSID == "M35742A2" list if MCSID == "M35743B2" list if MCSID == "M35751B7" list if MCSID == "M35755F2" list if MCSID == "M35761D2" list if MCSID == "M35778N2" list if MCSID == "M35788Q2" list if MCSID == "M35793M2" list if MCSID == "M35796Q2" list if MCSID == "M35808B2" list if MCSID == "M35825C2" list if MCSID == "M35833C2" list if MCSID == "M35834D2" list if MCSID == "M35859N2" list if MCSID == "M35865L2" list if MCSID == "M35869Q2" clear *decided to investigate any which had PNUMs after 2 and likely not natural parents cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_derived.dta", clear merge 1:1 MCSID PnumID using "H:\0000.MCS\July merge datasets\JM_mcs3_parent_cm_interview" *When seeing the merge results, there was n=1 only had data in the cm interview dataset (M13659N) - checked this out ****This was the only family to have 3 people in this merged datset (person numbers 1, 4 and 3; both 3 and 4 were eligible and interviewed as partner responders -- 4 was natural father but 3 had no further info for DV. Looks like a human error -- and I will need to merge these two). list MCSID if _merge ==2 *Tried to see the ns for people who were not main eligible & main actual responders by unmerged *Looked at the actual response to interview variable (BRESP00) to examine the same thing ****n=55 who WERE eligible to be main responder only had data in the parent DV dataset ****n=1,723 who WERE NOT eligible to be main responder (all partner in person/proxy, n=1 included from parent cm interview data) only had data in the parent DV dataset tab ELIG_New _merge *Also checked by their actual interview outcomes ****No one who actually responded as a main was unmatched. ****n=287 proxy partners were not matched, nor were those who were not interviewed (n=1490). There were none of these types in the matched set of data. ****n=2 partners were not matched. On closer inspection this is the M13659N PNUMs 4 and 3. This is more evidence to indicate this is actually the same person. ****These findings lead me to believe they just didn't include people who weren't interviewed or proxy partners in the cm interview dataset. Therefore, it seems like there are no issues with the dataset to be merged. tab CRESP00 _merge list MCSID PnumID if _merge ==1 & CRESP00 ==2 list MCSID PnumID if _merge ==2 & CRESP00 ==2 *I had a look at the HH grid dataset quickly to try and determine what Person number of the person split in M13659N by matching up the eligibility, response and relationship to CM data. It seems that there is no person with PNUM 4. I went and manually edited the dataset in the parent DV dataset so this matched up cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_hhgrid", clear clear cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_derived.dta", clear replace CPNUM00 = 3 in 3812 replace PnumID = "M13659N3" in 3812 replace PNUM = 3 in 3812 save "H:\0000.MCS\July merge datasets\JM_mcs3_parent_derived.dta", replace *Making the merge between responder dv and responder cm interview cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_parent_derived.dta", clear merge 1:1 MCSID PnumID using "H:\0000.MCS\July merge datasets\JM_mcs3_parent_cm_interview" rename _merge merge_PDV_PCM3 *Check 3: Seems like the split family member has now merged properly. Do a quick final check on this with the syntax used before ****Yep, just the proxies and no interview responders now not merged. tab CRESP00 merge_PDV_PCM3 list MCSID PnumID if merge_PDV_PCM3 ==1 & CRESP00 ==2 list MCSID PnumID if merge_PDV_PCM3 ==2 & CRESP00 ==2 *Check 4: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u *ROL = Resp Only Levels save "MCS3_ROL_Jmerge.dta" ******************************************************************************* ****** Wave 3 datasets merge - All datasets (already merged by level) ******** ******************************************************************************* ***************************************************************************** *****Merge #1 HH and CM only level data***** *HH = n=71,849 total observations; n=15,431 CM observations *CMOL = n=15,431 CM observations *I first need to make a variable I can merge on which is unique to each person -- a CM-based PID (already prepared in HH grid) cd "H:\0000.MCS\July merge datasets" use "MCS3_CMOL_Jmerge.dta", clear *Making a unique identifier for CMs for merging this and the HHgrid tab CCNUM00 sort MCSID gen CMnumID = MCSID + string(CCNUM00) if (CCNUM00 > 0) save "MCS3_CMOL_Jmerge.dta", replace *Check 1 ****All that were matched were the expected numbers. (n=56,418 not matched -- all the non-CMs) cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_hhgrid", clear merge 1:1 MCSID CMnumID using "MCS3_CMOL_Jmerge.dta" sort MCSID CMnumID by MCSID CMnumID: assert _N==1 tab CCNUM00 if _merge ==3 ****Check 2 *began by using my sample data and keeping just one observation for every value of personid. I don't care which observation I keep, I just need to keep one and only one. Then I merged on personid, keeping (1) the records that match and (2) the records from the master that do not match. I have no interest in the resulting dataset; I just wanted to see the table merge would report. TO CHECK IF EVERY VALUE OF MSCID THAT APPEARS IN FIRST DATASET ALSO APPEARS IN SECOND DATASET *Do this for each of the common variables across all datasets *For MCSID individually ****All ok -- expected cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_hhgrid", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "MCS3_CMOL_Jmerge.dta", keep(master match) *For CMnumID individually ****All ok -- again expected to have this many not merged as not in dataset because not CMs cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_hhgrid", clear sort CMnumID by CMnumID: keep if _n==1 merge 1:m CMnumID using "MCS3_CMOL_Jmerge.dta", keep(master match) **Completing the merge between HHgrid and CM level data** ****Check 3 cd "H:\0000.MCS\July merge datasets" use "H:\0000.MCS\July merge datasets\JM_mcs3_hhgrid", clear merge 1:1 MCSID CMnumID using "MCS3_CMOL_Jmerge.dta" rename _merge merge_HH_CMOL3 *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u save "MCS3_Final_Jmerge.dta" *****Merge #2 HH-CML and Responder only level data***** *71,849 in HH-CML *27,444 in RL *Check 1: Checks that merge numbers for the full datset by the chosen variables match up as expected, and there are no duplicates ****All that were matched were the expected numbers. (n=44,405 not matched -- all the non-responders (CM and non-CMs)) cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear merge 1:1 MCSID PnumID using "MCS3_ROL_Jmerge.dta" sort MCSID PnumID by MCSID PnumID: assert _N==1 clear ****Check 2: Checks if every value of MSCID and person ID appears in both datasets individually *Do this for each of the common variables across all datasets *For MCSID individually ****All MCSIDs matched (27,444) cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear sort MCSID by MCSID: keep if _n==1 merge 1:m MCSID using "MCS3_ROL_Jmerge.dta", keep(master match) clear *For PnumID individually ****All ok -- expected cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear sort PnumID by PnumID: keep if _n==1 merge 1:m PnumID using "MCS3_ROL_Jmerge.dta", keep(master match) **Completing the merge between HH-CML and Responder level data** ****Check 3 cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear merge 1:1 MCSID PnumID using "MCS3_ROL_Jmerge.dta" rename _merge merge_HHCM_ROL3 *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u sort MCSID CPNUM00 save "MCS3_Final_Jmerge.dta", replace *****Merge #3 HH-CML-RL and single family level data***** *71,849 in HH-CML-RL (15,236 families) *15,246 in SFL *Check 1: Checks that merge numbers for the full datset by MCSID match up as expected. Judge the amount of 'contradictions' represents the number of families in the HH-CML-RL dataset *Also did it the other way around -- which makes more sense as the SFL dataset has more families *Just stated "MCSID does not uniquely identify observations in the using data" - which happens because some families only have one line, while the ones added in from HH-CML-RL have more than one line per family. ****15,236 families notes as contradictions in 15,246 by-groups -- there are 10 families which seem to not be included from GLD dataset cd "H:\0000.MCS\July merge datasets" use "MCS3_SFL_Jmerge", clear merge 1:1 MCSID using "MCS3_Final_Jmerge.dta" sort MCSID by MCSID: assert _N==1 clear use "MCS3_Final_Jmerge.dta", clear merge 1:1 MCSID using "JM_mcs3_geographically_linked_data" sort MCSID by MCSID: assert _N==1 *Check 2: Checks if every value of MSCID appears in both datasets (supporting the above check) *Did it the other way around because the SFL data had more families ****All matched up ok, but I know there was a 10 family difference in previous check cd "H:\0000.MCS\July merge datasets" use "MCS3_SFL_Jmerge", clear sort MCSID merge 1:m MCSID using "MCS3_Final_Jmerge.dta" list MCSID if _merge ==1 clear *As I knew these were from the GLD data, I looked up the MCSIDs I had found before when merging the SFL data at wave 3. ****Found they all had no data but in the dataset as expected cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear sort MCSID list if MCSID == "M13234R" list if MCSID == "M15310U" list if MCSID == "M19649L" list if MCSID == "M21294A" list if MCSID == "M22583L" list if MCSID == "M22911G" list if MCSID == "M24550K" list if MCSID == "M25458U" list if MCSID == "M26360P" list if MCSID == "M33835W" **Completing the merge between HH-CML and Single family level data** *Rather than doing a 1:1 merge (as it wouldn't let me, and doesn't make sense), I did a 1:m merge -- this gives each of the family members the SFL data on their line-- So will need to keep this in mind when doing any exploration/analyses with the GLD variables. cd "H:\0000.MCS\July merge datasets" use "MCS3_SFL_Jmerge", clear merge 1:m MCSID using "MCS3_Final_Jmerge.dta" sort MCSID *Saved the merge variable for this merge rename _merge merge_HHCMR_SFL3 *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u sort MCSID PnumID save "MCS3_Final_Jmerge.dta", replace ******************************************************************************* *************************Wave 2 and 3 new datasets merge*********************** ******************************************************************************* *72,868 in MCS2 *71,859 in MCS3 *Check 1: Checks that merge numbers for the full datset by MCSID and Person number match up as expected. ****60,999 matched by MCSID and PnumID ****11,869 in MCS2 not matched ****10,860 in MCS3 not matched ****With N==1 assertion by MCSID, 17,033 contradictions in 19,243 (I think this is # families) by-groups ****With N==1 assertion by PnumID, 1 contradictions in 83,719 (I think this is individuals) by-groups -- ****With N==1 assertion by MCSID PnumID, 0 contradictions (I think this is family x individual) cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear merge 1:1 MCSID PnumID using "MCS2_Final_Jmerge.dta" sort MCSID by MCSID: assert _N==1 sort PnumID by PnumID: assert _N==1 sort MCSID PnumID by MCSID PnumID: assert _N==1 *With the above it was weird there was 1 contradiction. So created a varaible to see where duplicates were. ****This showed up with the 10 families that were present in the wave 3 geographically linked data, but found nowhere else (M22583L, M21294A, M13234R, M26360P, M25458U, M22911G, M19649L, M15310U, M24550K, M33835W). cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear sort PnumID quietly by PnumID: gen dup = cond(_N==1,0,_n) tab dup list MCSID if dup != 0 *Dropped these to see if the PnumID contradiction would go away ****It did drop if dup != 0 sort PnumID by PnumID: assert _N==1 clear *Check 2: Checks if every value of MSCID appears in both datasets (supporting the above check) ****Expected number not matched: ****60,999 matched ****11,869 in MCS2 not matched ****10,860 in MCS3 not matched *Adding this step in below because it won't let me merge with blank values cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear sort MCSID sort PnumID replace PnumID = "M26360P_GLD" if MCSID == "M26360P" replace PnumID = "M15310U_GLD" if MCSID == "M15310U" replace PnumID = "M13234R_GLD" if MCSID == "M13234R" replace PnumID = "M33835W_GLD" if MCSID == "M33835W" replace PnumID = "M22911G_GLD" if MCSID == "M22911G" replace PnumID = "M24550K_GLD" if MCSID == "M24550K" replace PnumID = "M22583L_GLD" if MCSID == "M22583L" replace PnumID = "M19649L_GLD" if MCSID == "M19649L" replace PnumID = "M21294A_GLD" if MCSID == "M21294A" replace PnumID = "M25458U_GLD" if MCSID == "M25458U" merge 1:m PnumID using "MCS2_Final_Jmerge.dta" clear **Completing the merge between Wave 2 and Wave 3 data** cd "H:\0000.MCS\July merge datasets" use "MCS3_Final_Jmerge.dta", clear sort PnumID replace PnumID = "M26360P_GLD" if MCSID == "M26360P" replace PnumID = "M15310U_GLD" if MCSID == "M15310U" replace PnumID = "M13234R_GLD" if MCSID == "M13234R" replace PnumID = "M33835W_GLD" if MCSID == "M33835W" replace PnumID = "M22911G_GLD" if MCSID == "M22911G" replace PnumID = "M24550K_GLD" if MCSID == "M24550K" replace PnumID = "M22583L_GLD" if MCSID == "M22583L" replace PnumID = "M19649L_GLD" if MCSID == "M19649L" replace PnumID = "M21294A_GLD" if MCSID == "M21294A" replace PnumID = "M25458U_GLD" if MCSID == "M25458U" sort MCSID PnumID merge 1:1 MCSID PnumID using "MCS2_Final_Jmerge.dta" sort MCSID PnumID *Saved the merge variable for this merge rename _merge merge_2_3 *Check 3: Do the combined results look like they go together? -- Look at random sample *All seem ok -- no obvious issues gen u = uniform() sort u list in 1/5 drop u *Check 4: Do the combined results look like they go together? -- Non-random (observations most likely to be mistakenly combined) *Did language and behaviour ****From looking at this, some of the children seemed to have very large differences between wave 2 and 3 Tscores. I do think this is because those tests can just differ. There did not seem to be any non-cms which had scores where they shouldn't have. gen diff = CCNVTSCORE - BDBAST00 sort diff list in 1/5 list in -5/l drop diff sort MCSID PnumID sort CMNUM gen diff = CEBDTOT - BEBDTOT sort diff list in 1/5 list in -5/l drop diff sort MCSID PnumID save "MCS_2and3_Jmerge" *Note: Checked the "productiveness" of the data matched up (i.e. not productive being in master dataset only). ****10 of the families (M13234R, M15310U, M19649L, M21294A, M22583L, M22911G, M24550K, M25458U, M26360P, M33835W) were coded as "productive" codes, even though they had no further data. ****3,653 families is 19,243 in longitudinal family file) -15,590 (productive families at wave 2); when seeing 3,564 Families that were not matched from the wave 2 dataset *This 1 family difference is explained in the clean up dataset*. -- ****the 10,860 from "count if BAOUTC00 != 1 & merge_2_3 ==3" obviously exist because these are wave 3 families and do not have any coding (i.e. are missing data) for the wave 2 variable. tab BAOUTC00 tab BAOUTC00 merge_HHCMR_SFL2 list MCSID if BAOUTC00 == 1 & merge_HHCMR_SFL2==1 count if BAOUTC00 != 1 & merge_2_3 ==1 count if BAOUTC00 != 1 & merge_2_3 ==2 count if BAOUTC00 != 1 & merge_2_3 ==3