survey-design - SKILL.md Agent Skill

name: survey-design description: "Design survey instruments: questions, scales, flow, social desirability." argument-hint: "[describe your survey question or design challenge]"

Survey Instrument Designer

Instructions

1. Question Construction

Item-Specific Wording: Frame questions with item-specific response options rather than agree/disagree, true/false, or yes/no formats. Instead of "Do you agree that immigration benefits the economy?" use "How much does immigration benefit or harm the economy?" with a substantive scale. This reduces acquiescence bias and forces respondents to process the item content (Stantcheva 2023).
Open-Ended vs. Closed-Ended: Use open-ended questions to discover respondent frames and vocabulary before designing closed-ended items. Deploy open-ended items in pilots to generate response categories, then convert to closed-ended for the main study. In the main survey, reserve open-ended items for exploratory or manipulation-check purposes.
Behavioral vs. Attitudinal: Prefer behavioral measures (what respondents would do) over attitudinal measures (what respondents feel) when the research question concerns real-world consequences. Attitudinal items are appropriate when the construct of interest is itself an attitude, but note that attitude-behavior gaps are well documented.
Avoid Double-Barreled Questions: Each item should measure exactly one construct. "Do you support increased immigration and refugee resettlement?" conflates two distinct policy domains. Split into separate items.
Avoid Leading and Loaded Language: Avoid terms that signal a "correct" answer or carry strong normative connotations. Pilot-test whether question framing shifts responses -- if it does, the wording is a treatment, not a measure (Stantcheva 2023).
Numeric vs. Qualitative Response Options: For cross-country or cross-group comparisons, prefer qualitative response options ("a lot," "somewhat," "not at all") over exact numeric quantities. Specific numbers carry different informational weight across contexts -- "$50,000 income" means different things in the US and South Korea (Stantcheva 2023).

2. Scale Design

Number of Scale Points: 5- to 7-point scales are a common convention for attitudinal items, trading off discrimination against cognitive load. Fewer points can lose meaningful variance; more points may not add measurement precision. Reliability depends more on whether each point is meaningfully labeled than on the raw count; assess test-retest and internal consistency for the target construct rather than defaulting to a fixed number. For knowledge or factual questions, binary or categorical formats are often sufficient.
Labeled vs. Endpoint-Only: Label all scale points when feasible. Fully labeled scales reduce respondent uncertainty about the meaning of intermediate values and improve cross-respondent comparability.
Unipolar vs. Bipolar: Match scale polarity to the construct. Bipolar scales (oppose--support) suit constructs with a natural midpoint. Unipolar scales (not at all--extremely) suit constructs with a natural zero point (e.g., frequency, intensity).
Feeling Thermometers: Use with caution. Feeling thermometers (0--100) introduce measurement noise because respondents interpret the scale differently. They are useful for relative comparisons across targets within respondents but unreliable for absolute-level interpretation across respondents.
Index Construction: When combining multiple items into an index, assess internal consistency (Cronbach's alpha, McDonald's omega, or composite reliability) and report the method used alongside the chosen threshold and its rationale. The historical alpha > 0.70 convention is a starting point, not a standard; omega is generally preferred for multidimensional scales. For population-based survey experiments specifically, multi-item indices of the dependent variable are strongly preferred over single-item measures because heterogeneous samples inflate within-group variance (Mutz 2011). Pre-specify index construction rules in the pre-analysis plan; do not construct indices after seeing the data (see also methods-reporting).
Balanced Scales: Include equal numbers of positively and negatively worded options or directional anchors. Unbalanced scales (three positive options, one negative) bias responses toward the overrepresented direction.

3. Survey Flow and Organization

Ordering Effects: Question order affects responses through priming, anchoring, and context effects. Place general questions before specific ones when measuring broad attitudes; reverse this when specific experiences are the construct of interest.
Warm-Up Items: Begin the survey with non-sensitive, low-stakes items to build respondent engagement before introducing experimental blocks or sensitive questions. Demographic items can serve this purpose but should not precede treatment blocks if they could prime identity salience.
Treatment Placement: Place experimental treatments after warm-up items but before primary outcome measures. Separate treatment exposure from outcome elicitation with buffer items to reduce experimenter demand effects (Stantcheva 2023). The timing between treatment and outcome measurement is itself a design choice: too close risks unintentionally signaling the treatment-outcome link, too distant risks the treatment "wearing off" before the outcome is captured (Mutz 2011).
Treatment-Outcome Separation: Insert unrelated items or a brief distractor block between treatment and outcome to reduce the salience of the treatment-outcome link. This mitigates demand effects without introducing significant respondent burden. When the construct of interest is itself a short-lived priming effect, shorten the separation; when the worry is experimenter demand, lengthen it (Mutz 2011).
Block Randomization: Randomize the order of thematic blocks (e.g., policy attitudes, demographic items, secondary measures) across respondents to prevent systematic ordering effects. Within blocks, randomize item order for nominal response sets. Caveat: if a context effect is the estimand (e.g., you are studying how question order shapes response), do not randomize it away; instead manipulate order as a factor.
Manipulation Check Placement: Place manipulation checks after outcome elicitation, not immediately after treatment. Post-treatment manipulation checks placed before outcomes can signal the study's purpose and inflate demand effects (Stantcheva 2023; Mutz 2011). Distinguish among: (a) attention checks / instructed-response items that catch satisficing, (b) comprehension checks that verify respondents understood treatment content, and (c) manipulation checks that verify the independent variable moved — these serve different purposes and need not all be placed identically.

4. Pretesting and Cognitive Interviewing

Cognitive Interview Protocols: Conduct cognitive interviews using think-aloud protocols (respondents verbalize their reasoning while answering) and/or retrospective probing (follow-up questions about interpretation and processing). A common rule of thumb is 5--10 respondents per round; the substantive rule is to iterate until no new comprehension problems emerge (saturation).
Pilot Studies vs. Soft Launches: These serve distinct purposes. Pilot studies test content: comprehension, response distributions, treatment uptake, manipulation check performance. Soft launches test logistics: survey flow, skip logic, display rendering, timing, and platform-specific issues. Conduct both, in sequence (Stantcheva 2023). Treatment pretesting is especially important in population-based experiments because heterogeneous samples weaken the statistical signal of any given manipulation (Mutz 2011).
Pilot Timing: Pilot after instrument draft, before IRB submission when possible (so findings can inform the registered design), and before full deployment. Budget for at least two rounds of piloting.
What to Test: Assess comprehension (do respondents interpret items as intended?), information processing (do respondents engage with treatment materials?), timing (is the median completion time within the target range?), dropout patterns (where do respondents abandon the survey?), and floor/ceiling effects (are response distributions sufficiently spread?).

5. Respondent Burden and Survey Length

Completion Time Targets: A 10--20 minute completion time is a common target for online panels; Mutz (2011) notes most web interviews cap around 15--20 minutes without additional incentives. Completion rates typically decline as length grows, though the exact breakpoint depends on platform, incentives, and sample. For complex experimental designs (conjoint + vignette + battery), monitor pilot timing carefully and cut secondary measures before primary ones if the survey is too long (Stantcheva 2023).
Attention Checks: Embed 1--3 attention checks (instructed-response items, e.g., "Select 'Strongly agree' for this item") distributed across the survey. Pre-specify exclusion rules for attention check failures in the pre-analysis plan. Report results with and without failed-attention respondents (see methods-reporting).
Speeding Detection: Flag respondents whose completion time falls below a pre-specified threshold (a common convention is below one-third of the median; the defensible approach is to pre-specify the cutoff and the rationale in the PAP). Pre-specify whether speeders are excluded or retained with a robustness check. Speeding is an indicator of satisficing, not necessarily of random responding. Collecting page-level response timing is essentially free in web surveys and enables post-hoc exposure diagnostics beyond a single completion-time cutoff (Mutz 2011).
Mobile vs. Desktop: Design the survey for mobile-first if the platform supports it. Conjoint tables and matrix questions render poorly on mobile devices. Test rendering on both form factors during the soft launch (Stantcheva 2023). Report the device split in the methods section.

6. Sensitive Questions and Social Desirability

Assess Sensitivity Before Switching Methods: Do not default to indirect measurement for topics that merely feel sensitive. Blair, Coppock, and Moor's (2020) meta-analysis of 30 years of list experiments finds that sensitivity biases are typically smaller than 10 percentage points and in some domains (including measures of prejudice) are approximately zero. Apply their four-criterion test first: sensitivity bias is a meaningful problem only when (a) respondents have a specific social referent in mind, (b) they believe that referent can infer their answer, (c) they perceive the referent prefers a particular answer, and (d) they believe costs follow if the preferred response is not given. If any of the four fails, direct questioning with neutral framing is likely preferable.
Default to Direct with Neutral Framing; Switch to Indirect Only When Warranted: When using direct questions on sensitive topics, frame items neutrally and offer a full range of socially acceptable response options (e.g., prefer "What is your view on X?" with a balanced scale over "Do you support X?"). Self-administration, separating the respondent from an interviewer, and physical or digital privacy measures can reduce bias without the precision penalty of indirect methods (Blair, Coppock, and Moor 2020). Consider indirect techniques (list experiments, randomized response, endorsement experiments) only when the four-criterion test indicates bias is likely and when the expected bias exceeds the precision cost.
Precision Cost of Indirect Measurement: Under typical conditions, list experiments are approximately 14 times noisier than direct questions, so either the sample size or the expected bias must be large to justify the design (Blair, Coppock, and Moor 2020). Factor this into pre-registration and power planning. See the list-experiment skill for design and estimator decisions once indirect measurement is warranted.
Obfuscated Follow-Ups: For factual questions where respondents may misreport, use incentive-compatible designs (monetary incentives for accurate answers) or obfuscated follow-ups that verify self-reports without making the verification salient (Stantcheva 2023).

7. Treatment Delivery in Surveys

Treatment Strength as a Design Priority: Population-based samples are, by design, heterogeneous, which inflates within-group variance and makes a weak manipulation statistically invisible. Build treatments stronger than you think you need and keep them short enough that respondents actually process them — past a certain point, longer treatments do not become more effective (Mutz 2011). See also hypothesis-building for mapping constructs to treatment operationalizations.
Information Treatment Formats: Match format to treatment complexity. Short factual corrections work as text. Complex policy information benefits from infographics or structured tables. Audio-visual treatments (images, audio, video) can increase engagement but introduce confounds (narrator characteristics, production quality, bandwidth) that must be pretested and, where possible, held constant across conditions (Mutz 2011). For text-heavy treatments, breaking content across screens and interspersing light questions sustains attention better than long single-screen text (Mutz 2011).
Vignette Construction: Write vignettes at a "medium level of specificity" -- concrete enough to engage respondents but not so detailed that unintended confounds are introduced (Sniderman 2018). For factorial vignettes, verify that all attribute combinations produce coherent paragraphs. See conjoint-design for factorial-vignette estimators and power.
Forced Exposure and Comprehension Gates: When treatment uptake is critical (e.g., information experiments), use forced exposure (minimum display time before advancing) combined with comprehension gates (respondents must answer a comprehension question correctly to proceed). Collect page-level response timing to diagnose inadequate exposure post-hoc (Mutz 2011). Report the proportion passing comprehension gates and analyze both ITT (all assigned) and per-protocol (comprehension gate passers) samples; pre-register both specifications (see pre-registration-writing, methods-reporting).
Treatment Fidelity Verification: After the treatment block, include a brief treatment recall or manipulation check to verify that respondents processed the treatment content and that the independent variable actually moved. Place this after outcome measures to avoid signaling the study's purpose (see Section 3 on check placement). A study with a null outcome but a successful manipulation check is still interpretable; a study whose manipulation check also fails is not (Mutz 2011).
Cross-National Stimulus Design: When treatments are deployed across countries or languages, stimulus comparability itself becomes a design problem (translation equivalence, institutional backdrop, media-diet differences). See cross-national-design.

Quality Checks

Item-Specific Scales: Are all attitudinal items worded with item-specific response options (no agree/disagree or yes/no)?
No Double-Barreled Items: Does each question measure exactly one construct?
Scale Polarity: Are bipolar and unipolar scales matched to the construct?
Index Pre-Specified: Are index construction rules (reliability metric, threshold rationale, aggregation method) specified in the pre-analysis plan, not post hoc?
Multi-Item DVs: Is the primary dependent variable measured with multiple items where feasible, given that heterogeneous samples inflate measurement error (Mutz 2011)?
Treatment Strength: Has the treatment been pretested for strength, and is the treatment-outcome spacing justified against demand-effect and wear-off concerns?
Treatment-Outcome Separation: Are buffer items placed between treatment and outcome blocks?
Check Type Distinguished: Are attention, comprehension, and manipulation checks distinguished and placed appropriately (manipulation checks after outcome)?
Cognitive Interviews: Were cognitive interviews conducted to saturation (no new comprehension problems)?
Both Pilot Types: Were both a content pilot and a logistical soft launch conducted?
Completion Time: Is the median completion time within a plausible online-panel range (e.g., 10--20 minutes)?
Attention Check Rules: Are attention-check exclusion rules and speeding cutoffs pre-specified, and are results reported with and without exclusions?
Sensitivity Four-Criteria Test: For sensitive topics, was the BCM 2020 four-criterion test applied before defaulting to indirect measurement, and was direct-with-neutral-framing considered first?
Mobile Tested: Was the survey tested on mobile devices during the soft launch?
Forced Exposure: For information treatments, are forced exposure or comprehension gates used, and are both ITT and per-protocol specifications pre-registered?
Cross-Skill Alignment: Does the instrument tie to hypothesis-building (construct-to-item), methods-reporting (what gets disclosed), list-experiment (if indirect), and cross-national-design (if multi-country)?