Skip to main content

Data Quality Pipeline

Last updated: May 15, 2026, 9:53 AM EDT

The TABS project applies a multi-stage data quality pipeline to every survey response. This page documents how responses flow from collection through validation to analysis, the quality checks applied at each stage, the edge cases we discovered and resolved, and the sensitivity analysis that demonstrates our findings are robust to inclusion criteria.

All statistics on this page are generated automatically by the daily analysis pipeline and updated every time the pipeline runs. The numbers shown here match the Prolific platform exactly.

Data Flow

Every survey response passes through this pipeline before appearing in any analysis:

  1. Qualtrics Export - Raw survey responses exported via API (3-header-row CSV format with question text and import IDs)
  2. Prolific Enrichment - Each response is cross-referenced with Prolific submission data (approval status, auth check scores) using the participant ID as join key
  3. Deduplication - When a participant retakes the survey, only one response is kept. Completed responses are preferred over incomplete retakes (see Edge Cases below)
  4. Disposition Waterfall - An 11-step quality classification assigns each response to exactly one disposition category
  5. Sample Definition - Five samples are computed, ranging from most restrictive (Conservative Clean) to least restrictive (All V2)
  6. Statistical Analysis - Every metric is computed independently across all five samples

Pipeline and Data Flow

The TABS daily pipeline runs 11 stages end-to-end every morning at 09:00 UTC, plus on-demand via workflow_dispatch. Each stage produces a specific output that surfaces somewhere on this site. The diagram below traces a single response from Qualtrics export to the committed JSON that drives every results page. Click any stage to jump to where its output is visible.

Phase numbering matches .github/workflows/daily-pipeline.yml. Stages 6-9 emit the per-sample blocks consumed by the validation, factor-analysis, reliability, and findings pages.

  1. Stage 1- Phase 1

    Fetch Prolific Data

    Pulls per-participant auth-check scores and submission statuses (AWAITING REVIEW / APPROVED / REJECTED) from the Prolific API. ~10 seconds.

    scripts/analysis/fetch_prolific_data.py

    Outputs:

    • Prolific submission JSON
    • auth check pass/fail per participant

    /results/dashboard

  2. Stage 2- Phase 2.1

    Export Qualtrics CSV

    Exports the survey response CSV from the Qualtrics API for the active TABS V2 survey. 3-header-row format with question text and import IDs.

    scripts/analysis/export_qualtrics.py

    Outputs:

    • raw Qualtrics CSV (ephemeral, runner-only)

    /results/survey-stats

  3. Stage 3- Phase 2.2

    Enrich + Join

    Joins each Qualtrics row to the Prolific submission record on PROLIFIC_PID, attaching the auth-check verdict and the demographic roster.

    scripts/analysis/enrich_qualtrics_csv.py

    Outputs:

    • enriched CSV (ephemeral, runner-only - contains PII)

    /making-of-tabs/data-analysis

  4. Stage 4- Phase 2.3

    Disposition Audit

    11-step waterfall classifier: each response is assigned exactly one disposition - CLEAN, AWAITING REVIEW, FLAG, or AUTO-EXCLUDE - based on duration, IRI failures, and content checks.

    scripts/analysis/tabs_v2_data_audit.py

    Outputs:

    • disposition CSV (1-day artifact)
    • disposition-summary.json

    /results/data-quality

  5. Stage 5- Phase 2.4

    De-identify Public Dataset

    Strips PROLIFIC_PID and other direct identifiers, producing the public dataset that ships in the repo. The original PID column never leaves the runner.

    scripts/analysis/tabs_v2_data_audit.py

    Outputs:

    • public/datasets/TABS_V2_CRP_2026_public_dataset.csv

    /results/reproducibility

  6. Stage 6- Phase 2.5

    Descriptive Analysis

    Means, SDs, skewness, kurtosis, inter-construct correlations, top-3 forced-choice tallies. Run once per sample tier.

    scripts/analysis/tabs_v2_unified_data_analysis.py

    Outputs:

    • sensitivity-analysis.json (per-sample stats)

    /results/descriptive

  7. Stage 7- Phase 2.6

    Advanced Analysis

    Effect sizes (Cohen d), t-tests, ANOVA, regression on demographic groupings, cross-tabs. Per sample tier.

    scripts/analysis/tabs_v2_unified_data_analysis.py

    Outputs:

    • effect-size and inferential blocks in sensitivity JSON

    /results/findings

  8. Stage 8- Phase 2.7

    Psychometrics / Validation

    Cronbach's alpha, McDonald's omega, KMO, Bartlett, EFA, CFA (ML + DWLS), HTMT, HTMT2, bifactor, second-order, multigroup, ESEM, IRT GRM, Mardia normality, Mahalanobis outliers. Emits the 28+ keys consumed by /results/validation, /results/factor-analysis, /results/reliability, /results/findings.

    scripts/analysis/tabs_v2_unified_data_analysis.py

    Outputs:

    • src/data/crp-validation.json
    • src/data/live-validation.json
    • 28 new top-level keys (PR #1837)

    /results/validation

  9. Stage 9- Phase 2.8

    Quality Audit

    Outlier detection (Mahalanobis, Cook distance), common-method-variance check, missing-data pattern audit. Cross-checks earlier stages.

    scripts/analysis/tabs_v2_unified_data_analysis.py

    Outputs:

    • quality block in unified analysis JSON

    /results/validation

  10. Stage 10- Phase 3

    Operations

    Auto-approves CLEAN dispositions on Prolific, sends per-disposition messages to FLAG participants, generates the live dashboard JSON.

    scripts/analysis/{approve,message,reject}_*.py

    Outputs:

    • approval / message side effects on Prolific

    /results/dashboard

  11. Stage 11- Phase 4-5

    Commit Results + Daily Report

    Commits the regenerated JSONs to the repo via the format-and-pr action and files a daily disposition report issue. Triggers the deploy chain that pushes the updated site.

    .github/workflows/daily-pipeline.yml

    Outputs:

    • committed JSONs in src/data/
    • daily report issue

    https://github.com/clarkemoyer/technologyadoptionbarriers.org/blob/main/.github/workflows/daily-pipeline.yml

Demographic Data Sources

Participant demographics are collected from two independent sources that capture fundamentally different information:

AspectSurvey Demographics (Qualtrics)Platform Demographics (Prolific)
SourceSelf-reported in the TABS survey instrument (questions Q1-Q9)Prolific participant profile database (archived at submission completion)
TypeOrganizational/role-based characteristicsPersonal/sociodemographic + professional characteristics
Base FieldsExecutive Role (Q1), Decision Authority (Q2), Industry (Q3), Org Size (Q4), Profit Model (Q5), Revenue/Budget (Q6-Q7), Geography (Q8-Q9)Age, Sex, Ethnicity, Language, Country of Residence, Nationality, Country of Birth, Student Status, Employment Status
Prescreener FieldsN/A - all fields are part of the survey instrumentEmployment Sector, Industry, Company Size, Occupation, Education Level, Household Income, Fluent Languages, and hundreds more via GET /api/v1/filters/ (up to 15 per export)
Collection MethodQualtrics CSV export, processed by tabs_v2_unified_data_analysis.pyProlific API POST /studies/{id}/demographic-export/
Used in Analysis✅ All per-group statistics, effect sizes, cross-tabulations✅ Available via API for cross-validation and sample balancing; base fields not published on results pages (privacy protection)
Cross-ValidationOverlapping dimensions allow independent verification: Prolific industry↔ Qualtrics Q3_Industry, Prolific company_size ↔ Qualtrics Q4_OrgSize, Prolific employment_sector ↔ Qualtrics Q5_ProfitModel, Prolific occupation ↔ Qualtrics Q1_Role
Join KeyEmbedded in Qualtrics response row (same CSV)Matched via Prolific Participant ID (PID)

Key distinction: Survey Demographics (Qualtrics) and Platform Demographics (Prolific) are separate datasets that capture different types of information. Survey demographics document what role participants hold and what kind of organization they work in. Prolific demographics document who the participants are personally, plus professional characteristics (industry, company size, occupation) when prescreener filters are configured. Where fields overlap (industry, company size, sector, role), they provide an independent cross-validation opportunity to verify self-reported data and balance samples. The two are joined via Prolific Participant ID (PID).

🔒 Privacy-First Enrichment Architecture

Prolific demographic data (both base fields and prescreener responses) contains personally identifiable information (PII). The pipeline processes this data ephemerally:

  • Demographics are fetched to runner.temp during pipeline execution - never committed to the repository
  • Cross-validation checks run in-memory; only aggregate pass/fail flags are emitted
  • Published results pages display only category-level aggregates from Qualtrics survey data (Q1-Q9), never individual Prolific profile data
  • The PROLIFIC_STUDY_SCREENERS constant in tabs_api.py and prolific-api.ts documents the exact eligibility criteria and filter_ids used for enrichment export

Enrichment Filter Budget (7 of 15 max)

CategoryFilter IDsCount
Study screeners (prescreener)employment_sector, company_size, occupation3
Cross-validationindustry1
Augmentationeducation_level, household_income, fluent_languages3
Total 7 / 15

Note: “Current Country of Residence” and “Employment Status” are base fields (always included in every export) and do not count against the 15-filter limit. See Prolific Demographic Export API for export limits (15 filters, 2 configuration changes before lock).

Disposition Waterfall (Steps 0-10)

Each response is evaluated through this 11-step waterfall (steps 0-10). The first matching step determines the disposition - a response is never counted in multiple categories.

StepDispositionCriteria
0INCOMPLETESurvey not finished (Qualtrics Finished != TRUE)
1FLAG-AUTH-FAILProlific authenticity check: LLM or Bots score = "Low"
2FLAG-AUTH-MIXEDProlific authenticity check: LLM or Bots score = "Mixed"
3AUTO-EXCLUDE2+ IRI failures, OR speed flag (<5 min) + any IRI failure
4FLAG-SPEEDDuration < 5 min but all 3 IRIs correct
5FLAG-SINGLE-IRI1 IRI failure at normal speed (>= 5 min)
6FLAG-SMEALDuration 5-9 min (below Smeal eDBA benchmark of 9 min)
7FLAG-RECAPTCHAreCAPTCHA score < 0.5
8FLAG-STRAIGHTLININGQualtrics Q_StraightliningCount > 0 (same answer for entire block)
9FLAG-PARTIAL-STRAIGHTLININGWithin-person SD < 0.5 in any question block (Meade & Craig 2012)
10CLEANAll checks passed: finished, all 3 IRIs, duration >= 9 min, reCAPTCHA >= 0.5, no straightlining, auth checks pass

Instructed Response Items (IRIs)

Three attention check items are embedded within the survey, one per construct. Each instructs the respondent to select a specific answer. Exact string match is required - any other value (including “Don’t Know”) is scored as a failure.

ConstructColumnExpected Answer
Barriers (19 items)Q10-28_Barriers_19“Major Barrier”
Readiness (18 items)Q47-64_Readiness_18“Low Readiness/Capability”
Maturity (9 items)Q65-73_Maturity_9“Level 2: Developing/Repeatable”

Sample Definitions

Five sample definitions are used, from most to least restrictive. The Prolific Acceptedcount matches the Prolific platform’s “Approved” tab exactly. The clean samples apply additional quality filters on top of Prolific approval.

SampleDefinitionN
Conservative CleanProlific APPROVED + all quality checks (IRI, duration >= 540s, reCAPTCHA, straightlining, auth)124
Flexible CleanProlific APPROVED + basic quality (all 3 IRIs + duration >= 480s)184
Prolific AcceptedAll deduplicated V2 rows with Prolific APPROVED status339
All V2 FinishedFinished + duration >= 120s (extreme speeders excluded)598
All V2All V2 responses including incomplete697

Constraints: Conservative Clean ⊆ Flexible Clean ⊆ Prolific Accepted ⊆ All V2, and All V2 Finished ⊆ All V2. Prolific Accepted and All V2 Finished overlap but neither is guaranteed to be a subset of the other (Prolific Accepted includes INCOMPLETE+APPROVED responses; All V2 Finished includes non-APPROVED responses).

Exact Filter Chains (Authoritative Definitions)

Each sample definition is produced by applying filters in order. These are the canonical definitions used by the analysis pipeline (tabs_v2_unified_data_analysis.py). Every metric on the Results pages is computed against these exact filters.

1. Conservative Clean (Primary Analysis Sample)

The most restrictive sample. Used for all primary reporting. Requires Prolific approval plus passing every quality gate.

  1. Prolific_Status == “APPROVED”
  2. Qualtrics Finished == TRUE (survey completed)
  3. Duration ≄ 480 seconds (8 minutes)
  4. All 3 IRI attention checks correct (exact string match)
  5. Duration ≄ 540 seconds (9 min Smeal eDBA benchmark)
  6. reCAPTCHA score ≄ 0.5
  7. Q_StraightliningCount == 0 (no full-block straightlining)
  8. Within-person SD ≄ 0.5 in all blocks (no partial straightlining)
  9. Auth_LLM and Auth_Bots not LOW or MIXED

Source: filter_samples() in tabs_v2_unified_data_analysis.py

2. Flexible Clean (Expanded Quality Sample)

Includes manually-reviewed FLAG responses that were approved on Prolific. Uses a lower duration threshold and only checks IRI attention.

  1. Prolific_Status == “APPROVED”
  2. Qualtrics Finished == TRUE
  3. Duration ≄ 480 seconds (8 minutes)
  4. All 3 IRI attention checks correct

Does NOT check: reCAPTCHA, straightlining, partial straightlining, or auth flags.

3. Prolific Accepted (Platform-Verified Sample)

All deduplicated V2 responses where the participant has been approved on Prolific. This count must matchthe Prolific UI “Approved” tab exactly. Any discrepancy indicates a pipeline bug.

  1. Prolific_Status == “APPROVED”
  2. Deduplicated by PROLIFIC_PID (prefer completed response)

No quality filters. Includes incomplete/short responses if Prolific approved them.

4. All V2 Finished (Completed Responses)

All finished responses above a minimum duration threshold. Not filtered by Prolific status - includes returned, timed-out, and awaiting-review participants.

  1. Qualtrics Finished == TRUE
  2. Duration ≄ 120 seconds (extreme speeders excluded)

5. All V2 (Complete Dataset)

Every V2 response including incomplete, deduplicated by PROLIFIC_PID. This is the universe from which all other samples are drawn.

  1. StartDate on or after V2 launch (2026-03-23)
  2. Deduplicated by PROLIFIC_PID (prefer completed response)

⚠ Disposition CLEAN vs. Conservative Clean

These are related but distinct concepts that serve different purposes:

  • Disposition CLEAN (from the waterfall above): A response that passes all 10 quality checks without being flagged. Used by the operations pipeline to auto-approve participants on Prolific. Does not check Prolific_Status.
  • Conservative Clean(sample definition): Requires Prolific_Status == “APPROVED” plus all quality checks. Used for statistical analysis and reporting.

Expected relationship: After the daily auto-approve workflow runs, all Disposition CLEAN participants should have Prolific_Status == APPROVED, making the counts equal. Any persistent gap indicates a pipeline issue. The disposition dashboard cross-references these counts automatically.

Sensitivity Analysis

Every key statistic is computed across all five sample definitions. If a finding holds across Conservative Clean (N=124) and Flexible Clean (N=184), it is robust to inclusion criteria.

MetricConservative Clean
N=124
Flexible Clean
N=184
Prolific Accepted
N=339
All V2 Finished
N=598
All V2
N=697
Barrier Grand Mean2.84492.83802.84762.85242.8548
Barrier SD0.64450.70040.71550.78490.7852
Readiness Grand Mean3.03813.07993.10643.20223.2024
Readiness SD0.59140.65490.67710.70800.7068
Maturity Grand Mean3.04873.06203.12853.22323.2234
Maturity SD0.69070.77470.78500.79950.7988
B-R Correlation-0.3676-0.3932-0.2997-0.2799-0.2797
B-M Correlation-0.1435-0.2655-0.2293-0.2696-0.2697
R-M Correlation0.57260.68750.72140.71340.7134
Alpha Barriers0.86080.87400.88090.90840.9086
Alpha Readiness0.87700.91330.91950.92730.9273
Alpha Maturity0.82840.87840.88360.89080.8908

Edge Cases & Data Quality Decisions

During pipeline development, several edge cases were discovered and resolved. Each decision is documented here for transparency and reproducibility.

Retake Deduplication: Prefer Completed Response

Some participants completed the survey, received Prolific approval, then started a retake but did not finish it. The Qualtrics export contains both rows for the same Prolific PID. The Python analysis pipeline’s deduplication logic prefers the completed response (Finished=TRUE) over the incomplete retake, regardless of chronological order. This ensures the approved, completed response is used for analysis rather than being overwritten by an abandoned retake.

Note: The TypeScript disposition triage (used by the operations pipeline) still uses “latest row wins” dedup, which can keep an incomplete retake over a completed original. This is being addressed in issue #687 (TS → Python migration). The Python analysis pipeline already applies the correct logic.

Prolific Accepted Must Match Prolific UI

The “Prolific Accepted” sample count must match the Prolific platform’s “Approved” tab exactly. This is validated by cross-referencing the Prolific API submission statuses with the Qualtrics export. Any discrepancy indicates a pipeline bug, not a data issue.

The Prolific API is queried with limit=1000 per page to ensure all submissions are fetched. The enrichment step matches Prolific participant IDs to Qualtrics PROLIFIC_PID embedded data fields.

IRI Pass Rate Denominator

IRI (attention check) pass rates are computed using finished responses only as the denominator, not all responses. Incomplete responses cannot have valid IRI answers, so including them would artificially deflate pass rates.

Partial Straightlining Detection

Beyond Qualtrics’ built-in straightlining count, the pipeline computes within-person standard deviation per question block. If a respondent selected nearly identical answers for all items in a block (SD < 0.5), the response is flagged. The threshold follows Meade & Craig (2012), Psychological Methods, 17(3), 437-455.

IRI items are excludedfrom the SD calculation. IRI attention checks have predetermined correct answers (e.g., “Major Barrier”) that differ from typical straightline responses. Including them would artificially inflate within-person variance and mask genuine straightlining. Only substantive scale items are used: 18 Barrier items, 17 Readiness items, and 8 Maturity items.

The minimum response threshold for evaluation is ceil(block_count / 2) items answered, matching the TypeScript disposition pipeline exactly.

Qualtrics Export Format

Qualtrics CSV exports include 3 header rows: column names (row 0), question text (row 1), and import IDs (row 2). Data starts at row 3. The pipeline handles UTF-8 BOM markers (common in Qualtrics exports), embedded newlines in quoted feedback fields, and both label mode (“TRUE”/“FALSE”) and numeric mode (“1”/“0”) for the Finished column.

Don’t Know Responses (Readiness & Maturity)

The Readiness and Maturity constructs allow “Don’t Know” as a response option. These are treated as missing data (excluded from person-level means), not mapped to a numeric value. This prevents artificial deflation of construct scores. The Barriers construct does not include a Don’t Know option.

Reproducibility

All analysis code is open source. The sensitivity analysis is generated automatically by the daily analysis pipeline and committed to the repository as JSON data.

See What This Pipeline Produces