cross-national

star 150

End-to-end cross-national comparison study using KNHANES + NHANES + CHNS (or other parallel surveys). Variable harmonization, parallel weighted analysis, and comparison tables. Supports 2-country (KR+US) and 3-country (KR+US+CN) designs.

Aperivue By Aperivue schedule Updated 6/3/2026

name: cross-national description: End-to-end cross-national comparison study using KNHANES + NHANES + CHNS (or other parallel surveys). Variable harmonization, parallel weighted analysis, and comparison tables. Supports 2-country (KR+US) and 3-country (KR+US+CN) designs. triggers: cross-national, 한미 비교, Korea US comparison, KNHANES NHANES, 양국 비교, binational, cross-country, 비교연구, 3국 비교, CHNS, 한미중 tools: Read, Write, Edit, Bash, Grep, Glob model: opus

Cross-National Comparison Study Skill

You are assisting a medical researcher in conducting a cross-national comparison study using parallel nationally representative surveys (e.g., KNHANES for Korea, NHANES for the US, CHNS for China).

When to Use

  • Researcher has a clinical question to compare across two countries
  • KNHANES + NHANES data available (or other parallel survey pairs)
  • Goal: produce a complete analysis with country-stratified results + comparison table

Inputs

  1. Research question: exposure → outcome association to compare across countries
  2. Korean data path: KNHANES CSV file
  3. US data path: NHANES CSV directory (multiple tables to merge)
  4. Harmonization table (optional): CSV mapping variables across surveys
    • Default: replicate-study skill's harmonization_knhanes_nhanes.csv

Reference Files

  • Harmonization table: medsci-skills/skills/replicate-study/references/harmonization_knhanes_nhanes.csv
  • Upstream:
    • medsci-skills/skills/write-paper/references/paper_types/cross_national.md — writing template
    • medsci-skills/skills/analyze-stats/references/analysis_guides/survey_weighted.md

Workflow

Phase 1: Study Definition

  1. Confirm research question: Exposure → Outcome
  2. Define variable coding for both countries:
    • Exposure: PHQ-9, BMI category, smoking, etc.
    • Outcome: diabetes, hypertension, mortality, etc.
    • Covariates: age, sex, education, income, smoking, alcohol, obesity, CVD
  3. Check harmonization table for variable availability
  4. Output: study protocol summary for user approval

Phase 2: Data Preparation

KNHANES (single CSV):

  1. Load CSV, filter age ≥20 (or per protocol)
  2. Derive variables using KNHANES coding:
    • Smoking: BS3_1 (1,2=current, 3=former, 8=never)
    • Alcohol: BD1_11 (2-6=frequent, 1=occasional, 8=never)
    • Obesity: HE_obe (≥4=obesity for BMI≥25 Asian cutoff)
    • PHQ-9: BP_PHQ_1~9, sum score, ≥10=depression
    • Diabetes: HE_glu≥126 | HE_HbA1c≥6.5 | DE1_dg=1
    • CVD: DI4_dg=1 | DI5_dg=1 | DI6_dg=1
  3. Set survey design: svydesign(id=psu, strata=kstrata, weights=~wt_itvex, nest=TRUE)

NHANES (multiple CSVs):

  1. Load and merge tables by SEQN (DEMO_J, DPQ_J, GHB_J, BIOPRO_J, BMX_J, SMQ_J, ALQ_J, DIQ_J, MCQ_J, BPQ_J)
  2. Derive variables using NHANES coding:
    • Smoking: SMQ020 + SMQ040 (100 cigs + now smoke)
    • Alcohol: ALQ121 (past 12 mo frequency → categories)
    • Obesity: BMXBMI ≥30 (WHO cutoff, NOT Asian)
    • PHQ-9: DPQ010~DPQ090, sum score, ≥10=depression
    • Diabetes: LBXSGL≥126 | LBXGH≥6.5 | DIQ010=="Yes" (CRITICAL: LBXSGL not LBXSGLU)
    • CVD: MCQ160B=="Yes" (CHF) | MCQ160C=="Yes" (CHD) | MCQ160D=="Yes" (angina) | MCQ160E=="Yes" (MI)
    • HTN: BPXOSY3≥140 | BPXODI3≥90 | BPQ020=="Yes"
  3. Set survey design: svydesign(id=SDMVPSU, strata=SDMVSTRA, weights=~WTMECPRP, nest=TRUE)

Phase 3: Parallel Analysis

For EACH country independently:

  1. Table 1: Baseline characteristics by exposure (weighted counts + percentages)
  2. Main analysis: Sequential logistic regression models
    • Model 1 (unadjusted)
    • Model 2 (age + sex)
    • Model 3 (fully adjusted: + education, income, smoking, alcohol, obesity, CVD)
  3. Subgroup analyses: By sex, age group, education, income, alcohol, smoking, CVD, obesity
  4. Dose-response (if applicable): RCS with 3 knots

Phase 4: Cross-National Comparison Table

Generate a side-by-side comparison:

Analysis Korea wOR (95% CI) US wOR (95% CI) Direction Agreement
Overall (fully adjusted) ... ... ✓/✗
Male ... ...
Female ... ...
... ... ...

Phase 5: Output Files

{working_dir}/
├── cross_national_report.md    — Study summary + comparison tables
├── variable_mapping.csv        — Variable mapping with match status
├── analysis_korea.R            — KNHANES analysis (self-contained)
├── analysis_us.R               — NHANES analysis (self-contained)
├── results/
│   ├── table1_korea.csv
│   ├── table1_us.csv
│   ├── main_results_comparison.csv
│   └── subgroup_comparison.csv
└── manuscript_draft/           — Optional: Methods + Results draft
    ├── methods_draft.md
    └── results_draft.md

Critical Rules

  1. NEVER pool data across countries. Each country analyzed with its own survey design.
  2. Country-specific BMI cutoffs: Korea ≥25 (Asian), US ≥30 (WHO).
  3. Country-specific income: KNHANES quartile, NHANES PIR → harmonize to binary.
  4. Weighted analysis mandatory: Both KNHANES and NHANES are complex surveys.
  5. Document all harmonization decisions: What matches, what needed recoding, what differs.
  6. Same analytic approach: Identical model specifications for both countries for fair comparison.

KNHANES Variable Coding Reference (validated via Joo 2026 replication)

Variable Raw Var Coding
Smoking BS3_1 1,2=Current; 3=Former; 8=Never
Alcohol BD1_11 2-6=Frequent (current drinker); 1=Occasional (past-year abstainer); 8=Never
Obesity HE_obe 1-3=Normal; 4-6=Obesity (BMI≥25)
Depression BP_PHQ_1~9 Sum ≥10 = depression
Diabetes HE_glu, HE_HbA1c, DE1_dg FPG≥126 or HbA1c≥6.5 or DE1_dg=1
CVD DI4_dg, DI5_dg, DI6_dg Any = 1 → CVD yes
Education edu 1-3=Non-college; 4=College
Income incm 1-3=Bottom 80%; 4=Top 20%
Survey design kstrata, psu, wt_itvex strata, cluster, weight

NHANES Variable Coding Reference (validated via Joo 2026 cross-national)

CRITICAL: NHANES data downloaded via R nhanesA package uses TEXT LABELS, not numeric codes.

Variable Raw Var Text Labels → Numeric
PHQ-9 items DPQ010~DPQ090 "Not at all"→0, "Several days"→1, "More than half the days"→2, "Nearly every day"→3
Sex RIAGENDR "Male" / "Female" (NOT 1/2)
Smoking (100 cigs) SMQ020 "Yes" / "No"
Smoking (now) SMQ040 "Every day" / "Some days" / "Not at all"
Alcohol freq ALQ121 Text labels (see below)
Alcohol ever ALQ111 "Yes" / "No"
Education DMDEDUC2 5 text levels (see SKILL.md Phase 2)
Diabetes dx DIQ010 "Yes" / "No" / "Borderline"
CVD (CHF) MCQ160B "Yes" / "No" / "Don't know"
CVD (CHD) MCQ160C "Yes" / "No" / "Don't know"
CVD (angina) MCQ160D "Yes" / "No" / "Don't know"
Fasting glucose LBXSGL (BIOPRO_J) Numeric (mg/dL) — note: NOT LBXSGLU
HbA1c LBXGH (GHB_J) Numeric (%)
BMI BMXBMI (BMX_J) Numeric (kg/m²)
Weight WTMEC2YR (single-cycle) or WTMECPRP (pre-pandemic pooled) Numeric
Strata SDMVSTRA Numeric
PSU SDMVPSU Numeric

ALQ121 Text Label Mapping (Alcohol Frequency)

  • Frequent (current drinker): Any specific frequency except "Never in the last year"
  • Occasional (past-year abstainer): "Never in the last year"
  • Never (lifetime non-drinker): ALQ111 == "No" (ALQ121 will be NA)

Additional KNHANES Variables (validated via LE8-Asthma replication)

Variable Raw Var Coding
Asthma DJ2_dg 0=No, 1=Yes (physician dx), 9=Don't know → exclude
Asthma treatment DJ2_pt 0=No, 1=Yes, 8=N/A, 9=Don't know
Sleep (2017-18) BP16_11/12/13/14 Clock times, NOT hours! 11=bed hour, 12=bed min, 13=wake hour, 14=wake min. Calculate: duration = wake_time - bed_time (handle midnight crossing). 99=Don't know→NA
Sleep (2017-18 weekend) BP16_21/22/23/24 Same format as weekday
Sleep (2019-20) BP16_1/2 Direct sleep hours (weekday/weekend). 99=Don't know→NA
PA aerobic pa_aerobic 0=Doesn't meet, 1=Meets guidelines. Note: values are 0/1, NOT 1/2
HTN treatment DI1_pr 1=Yes, 0=No (currently treating hypertension)
Dyslipidemia tx DI3_pr 1=Yes, 0=No (if available)
Non-HDL chol HE_chol - HE_HDL_st2 Derived: total cholesterol minus HDL

Additional NHANES Variables (validated via LE8-Asthma replication)

Variable Raw Var Coding
Asthma MCQ010 "Yes" / "No" (ever told by doctor)
Sleep hours SLD012 Numeric (hours/night on weekdays)
BP treatment BPQ020 "Yes" / "No" (told by doctor, high BP)
Cholesterol treatment BPQ100D "Yes" / "No" (taking cholesterol Rx)
PA vigorous work PAQ605/PAQ610/PAD615 Yes/No, days/week, min/day
PA moderate work PAQ620/PAQ625/PAD630 Yes/No, days/week, min/day
PA walk/bike PAQ635/PAQ640/PAD645 Yes/No, days/week, min/day
PA vigorous rec PAQ665/PAQ670/PAD675 Yes/No, days/week, min/day
PA moderate rec PAQ650/PAQ655/PAD660 Yes/No, days/week, min/day
Dietary fiber DR1TFIBE (DR1TOT_J) Numeric (grams, day 1 recall)
Dietary sodium DR1TSODI (DR1TOT_J) Numeric (mg)
Dietary sat fat DR1TSFAT (DR1TOT_J) Numeric (grams)
Total energy DR1TKCAL (DR1TOT_J) Numeric (kcal)
Total sugars DR1TSUGR (DR1TOT_J) Numeric (grams)
Non-HDL chol LBXTC - LBDHDD Derived: TCHOL_J minus HDL_J

CHNS Variable Coding Reference (validated via 3-country batch)

Data source: cpc.unc.edu/projects/china (free registration) Biomarker wave: 2009 only (N=9,549). Other variables available 1989-2015. Survey design: No formal weights. Use svydesign(id=~COMMID, weights=~1) or cluster-robust SE.

Key Files and Merge Strategy

File Key Variables Join Key
mast_pub_12 IDind, GENDER (1=M/2=F), WEST_DOB_Y (birth year) IDind
pexam_00 HEIGHT, WEIGHT, U10 (waist), SYSTOL1-3, DIASTOL1-3, U22 (HBP dx), U24 (HBP meds), U24A (DM dx), U25 (ever smoked), U27 (still smokes), U40 (alcohol), U41 (freq), U48A (self-health), COMMID IDind + filter WAVE==2009
biomarker_09 GLUCOSE_MG, HbA1c, TC_MG, TG_MG, HDL_C_MG, LDL_C_MG, HS_CRP, HGB, WBC, ALT, CRE_MG IDind
educ_12 A12 (education 0-6) IDind + filter WAVE==2009
indinc_10 indwage (yuan, continuous → quartiles) IDind + filter wave==2009

Variable Coding

Variable Raw Var Coding Notes
Sex GENDER 1=Male, 2=Female Same as KNHANES/NHANES
Age WEST_DOB_Y age = wave_year - WEST_DOB_Y Integer truncation
BMI HEIGHT, WEIGHT WEIGHT / (HEIGHT/100)^2 Obesity: BMI ≥ 28 (WGOC, NOT 25 or 30)
Waist U10 cm, direct measurement Central obesity: ≥90M / ≥80F (IDF-Asian)
SBP SYSTOL1-3 mean(SYSTOL1, SYSTOL2, SYSTOL3) 3 readings averaged
DBP DIASTOL1-3 mean(DIASTOL1, DIASTOL2, DIASTOL3) 3 readings averaged
HBP diagnosed U22 0=No, 1=Yes, 9=Don't know (→NA)
HBP medication U24 0=No, 1=Yes
DM diagnosed U24A 0=No, 1=Yes, 9=Don't know (→NA)
Smoking U25 + U27 never(U25==0) / former(U25==1 & U27==0) / current(U25==1 & U27==1)
Alcohol U40 + U41 never(U40==0) / occasional(U41≥4) / frequent(U41≤3, ≥1x/week) U41: 1=daily, 2=3-4x/wk, 3=1-2x/wk, 4=1-2x/mo, 5=<1x/mo
Education A12 0=none, 1=primary, 2=lower-mid, 3=upper-mid, 4=technical, 5=university, 6=master+. Recode: 0-2→low, 3-4→mid, 5-6→high
Income indwage Continuous yuan → quartiles within wave
Glucose GLUCOSE_MG mg/dL (also GLUCOSE in mmol/L) 2009 only
HbA1c HbA1c % (direct) 2009 only
TC TC_MG mg/dL 2009 only
TG TG_MG mg/dL 2009 only
HDL HDL_C_MG mg/dL 2009 only
hsCRP HS_CRP mg/L 2009 only
Hemoglobin HGB g/L (divide by 10 for g/dL) Unit differs from KR/US
Self-health U48A Self-reported health status 2004-2011
Depression NOT AVAILABLE in standard download. CES-D exists but needs separate dataset. Cannot directly compare with PHQ-9

CHNS-Specific Warnings

  1. No survey weights: CHNS is NOT a formally weighted survey. Use unweighted analysis with cluster-robust SE by COMMID. Report as limitation.
  2. Biomarker = 2009 only: Glucose, HbA1c, lipids, hsCRP available only in 2009 wave. Other waves lack lab data.
  3. CES-D not in standard download: Depression comparison requires separate dataset download from cpc.unc.edu.
  4. BMI cutoff ≠ KR ≠ US: China=28, Korea=25, US=30. Use country-specific cutoffs AND sensitivity analysis with WHO cutoff=25.
  5. SES-health gradient may reverse: Low education and low income are NOT always risk factors in China (null/protective). This is the "developing country health transition" — do NOT treat as a bug.
  6. Hemoglobin unit: CHNS reports g/L (KR/US report g/dL). Divide by 10 when comparing.
  7. Education scale: 7-level (0-6) vs KR 4-level vs US 5-level. Harmonize to 3-level for comparison.

Composite Score Replication Warnings (learned from LE8 replication)

  1. BMI cutoff mismatch: LE8 uses WHO <25 which classifies most Koreans as "ideal" → Factor subscore loses BMI discriminatory power in Asian populations. Report this limitation.
  2. KNHANES sleep = clock times: BP16_11-14 are bedtime/waketime (hour:min), NOT sleep duration. Must compute wake_time - bed_time with midnight crossing.
  3. pa_aerobic codes: Values are 0/1 (not 1/2). Binary → MET-hours approximation is coarse.
  4. Diet quality scoring: AHEI-2010 requires detailed food group data; nutrient-based proxy gives different distribution. Recommend downloading NHANES DR1TOT_J for dietary recall nutrients.
  5. LE8 sensitivity to implementation: Small scoring differences compound across 8 components → overall score can diverge substantially, especially in the "moderate" range where most people cluster.

Anti-Hallucination

  • Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output [VERIFY: variable_name] and ask the user to confirm against the data dictionary.
  • Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
  • Never generate references from memory. Use /search-lit for all citations.
  • If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.
Install via CLI
npx skills add https://github.com/Aperivue/medsci-skills --skill cross-national
Repository Details
star Stars 150
call_split Forks 39
navigation Branch main
article Path SKILL.md
Occupations
More from Creator