Reproducible Analysis Pipeline

Last updated: May 29, 2026, 7:40 AM EDT

The TABS project is committed to open science and computational reproducibility. Every statistic reported in the Culminating Research Project (CRP) can be independently verified by running our public analysis scripts against the released dataset. This page explains how the pipeline works and how you can use it.

The dataset is archived at Penn State ScholarSphere under a CC-BY-4.0 license, ensuring long-term institutional preservation independent of the principal investigator.

Architecture: Single Source of Truth

A common challenge in research software is keeping multiple systems in sync. The TABS project uses two parallel data processing systems: a live TypeScript pipeline for operational triage (approving and flagging survey submissions in real-time) and public Python scripts for research-grade statistical analysis. These systems share critical constants: scale mappings, IRI expected answers, column definitions, and duration thresholds.

To prevent divergence, all shared constants are defined in a single TypeScript file (tabs-survey-constants.ts) that serves as the authoritative source. A CI workflow automatically exports these constants to JSON for the Python scripts to consume. Any change to a constant in one system is immediately validated against the other.

Data Flow

tabs-survey-constants.ts defines all instrument constants (scale mappings, IRI answers, column names, thresholds)
generate-constants-json.ts exports constants to JSON on every commit
disposition.ts (TypeScript) imports constants for live triage
tabs_v2_data_audit.py (Python) reads the JSON for reproducible analysis
validate-analysis.yml (CI) verifies both systems agree on every push

Analysis Scripts

The reproducible analysis pipeline is built around a single unified script that consolidates all analysis phases - sensitivity, advanced statistics, data quality, and psychometric validation - into one run. The unified script runs validation across all five sample definitions, producing per-sample results with sample-size adequacy metadata. The individual component scripts remain available for targeted analysis.

Unified Analysis (tabs_v2_unified_data_analysis.py) - Primary

The recommended way to reproduce all TABS statistics. Consolidates sensitivity analysis, advanced statistics (PCA, regression, ANOVA), data quality audits, and full psychometric validation into a single run. Produces one unified JSON output containing all sections. Validation runs on all five named sample definitions (Conservative Clean through All V2) with per-sample CFA, EFA, HTMT, and reliability metrics.

Key outputs: Unified JSON with sensitivity analysis, advanced statistics, quality audit, and per-sample validation including adequacy metadata (N:parameter ratios). Extracts to sensitivity-analysis.json, live-validation.json, and crp-validation.json.

The following individual scripts are available for targeted analysis. They share the same constants and scale mappings but run independently.

1. Data Audit (tabs_v2_data_audit.py)

Implements the complete 10-step disposition waterfall that determines which survey responses are included in analysis. This is a faithful Python port of the live TypeScript triage logic, ensuring that the research analysis applies exactly the same quality criteria as the operational pipeline.

Waterfall steps: Incomplete check, Prolific auth verification, IRI attention checks (3 constructs), speed flags, Smeal benchmark, reCAPTCHA score, full-block straightlining, partial straightlining (within-person SD), and final CLEAN disposition.

2. Statistical Analysis (tabs_v2_analysis.py)

Computes all descriptive and inferential statistics reported in the CRP: construct means, standard deviations, correlations with 95% confidence intervals, t-tests with effect sizes, ANOVA, sensitivity analysis across sample definitions, and demographic cross-tabulations.

Key outputs: Barrier severity rankings, readiness profiles, maturity assessments, construct correlations, demographic comparisons across role, industry, org size, and profit model.

3. Psychometric Validation (tabs_v2_psychometrics.py)

Comprehensive instrument validation and data quality report covering content validity, construct validity, reliability, IRI attention check effectiveness, missing data patterns, response distributions, common method variance (CMV), and response bias diagnostics.

Key outputs:Cronbach’s alpha, KMO/Bartlett’s, PCA factor extraction, IRI pass rates by construct, missing data percentages, Shapiro-Wilk normality tests, skewness/kurtosis distributions, and Harman’s single-factor CMV test.

4. Advanced Statistics (tabs_v2_advanced.py)

Performs inferential statistics, factor analysis (PCA with Varimax rotation), interaction effects, and moderation analyses that extend the primary descriptive results.

Key outputs:Factor extraction with variance explained, KMO and Bartlett’s tests, budget moderation effects, revenue tiers, role-by-role comparisons, and geographic scope analysis.

5. Data Quality Audit (tabs_v2_quality_audit.py)

Systematically examines dataset flaws, biases, and limitations. Produces a comprehensive quality report including straightlining detection, outlier analysis (Mahalanobis distance), response pattern diagnostics, and order/fatigue effects.

Key outputs: Response entropy analysis, acquiescence bias metrics, extreme response style detection, position-based fatigue, and within-person SD distributions.

6. Defense Deck Validator (validate-deck.py)

Validates 81 statistical claims in the defense presentation (PPTX) against computed values from the source CSV. Ensures consistency between the CRP document and presentation materials.

Key outputs: Per-slide PASS/FAIL summary for item means, standard deviations, grand construct means, and Pearson correlations.

Sample Definitions

The analysis script supports five sample definitions, from most to least restrictive. Each applies different quality filters to the same underlying V2 dataset. Running all statistics against every sample definition demonstrates whether findings are robust to inclusion criteria - a key requirement for publication-grade research.

Sample	Criteria	N
Conservative Clean	Prolific APPROVED + all quality checks (IRI, duration >= 540s, reCAPTCHA, straightlining, auth)	140
Flexible Clean	Prolific APPROVED + basic quality (all 3 IRIs + duration >= 480s)	231
Prolific Accepted	All deduplicated V2 rows with Prolific APPROVED status	440
All V2 Finished	Finished + duration >= 120s (extreme speeders excluded)	675
All V2	All V2 responses including incomplete	782

N values are populated by running python tabs_v2_unified_data_analysis.py <csv> --json unified.json and extracting the sensitivity section from the unified output.

Sensitivity Analysis

Every key statistic is computed across all sample definitions to verify that conclusions do not depend on a single set of inclusion criteria. If a finding holds across Conservative Clean, Flexible Clean, and Prolific Accepted samples, it is robust. If it shifts substantially, the sensitivity analysis flags it for discussion.

Metric	Conservative Clean	Flexible Clean	Prolific Accepted	All V2 Finished	All V2
Barrier Grand Mean	2.8566	2.8804	2.8867	2.8941	2.8957
Barrier SD	0.6641	0.7199	0.7267	0.7835	0.7833
Readiness Grand Mean	3.0468	3.0546	3.093	3.2017	3.2018
Readiness SD	0.5978	0.6631	0.656	0.7121	0.7106
Maturity Grand Mean	3.0558	3.0495	3.1135	3.2204	3.2206
Maturity SD	0.7069	0.8104	0.7934	0.8087	0.8081
B-R Correlation	-0.4083	-0.433	-0.3299	-0.2585	-0.2581
B-M Correlation	-0.189	-0.2872	-0.2472	-0.2291	-0.2291
R-M Correlation	0.6001	0.7066	0.7197	0.7187	0.7187
Alpha Barriers	0.8664	0.8859	0.8897	0.9089	0.9089
Alpha Readiness	0.872	0.9123	0.9128	0.9267	0.9267
Alpha Maturity	0.8264	0.8864	0.8855	0.8924	0.8924

How to Reproduce the Analysis

Anyone can independently verify the statistics reported in the CRP by following these steps:

# Clone the repository
git clone https://github.com/clarkemoyer/technologyadoptionbarriers.org.git
cd technologyadoptionbarriers.org/scripts/analysis

# Install Python dependencies (pinned versions)
pip install -r requirements.txt

# ── Recommended: Unified analysis (reproduces all results) ──

# Run with production data (from ScholarSphere)
python tabs_v2_unified_data_analysis.py <path_to_csv> --json unified.json

# Run CRP-200 frozen dataset
python tabs_v2_unified_data_analysis.py <path_to_crp_csv> --crp200 --json crp-unified.json

# Extract individual JSON files from unified output
python -c "import json; d=json.load(open('unified.json')); json.dump(d['sensitivity'], open('sensitivity-analysis.json','w'), indent=2)"
python -c "import json; d=json.load(open('unified.json')); json.dump(d['validation'], open('live-validation.json','w'), indent=2)"

# Choose a different primary sample for advanced analysis
python tabs_v2_unified_data_analysis.py <csv> --json unified.json \
  --primary-sample flexible_clean

# ── Alternative: Individual scripts (for targeted analysis) ──

python tabs_v2_data_audit.py --input <path_to_csv>
python tabs_v2_analysis.py <path_to_csv>
python tabs_v2_psychometrics.py <path_to_csv>
python tabs_v2_advanced.py <path_to_csv>
python tabs_v2_quality_audit.py <path_to_csv>

# Validate defense presentation
python ../validate-deck.py <path_to_csv> <path_to_pptx>

Cross-Platform Validation Kits

Three self-contained download bundles let a colleague reproduce the published statistics in their preferred analysis tool without first installing Python, cloning the repo, or wiring up dependencies. Each zip ships with the frozen N=200 CRP-2026 dataset, an auto-extract launcher script, and the platform-native syntax/macros needed to run the analyses end-to-end.

Scope by platform:SPSS and Minitab cover the descriptive layer (Cronbach’s alpha, KMO, Bartlett, EFA, inter-construct correlations, group comparisons, Mahalanobis outliers). The CFA-derived statistics (omega from CFA, CR with standard errors, AVE, HTMT, HTMT2, bifactor, second-order, multigroup, measurement invariance, ESEM, IRT GRM, Mardia normality) require an SEM package and live in the Python kit, which mirrors the canonical TABS pipeline.

Kit	Platform target	Coverage	Download
tabs_v2_validation_spss.zip	IBM SPSS Statistics 31.0+	Descriptive psychometric layer	Download (~44 KB)
tabs_v2_validation_minitab.zip	Minitab 21+	Descriptive psychometric layer	Download (~28 KB)
tabs_v2_validation_python.zip	Python 3.12+ (TABS-native canonical)	Full pipeline including CFA, HTMT, IRT, multigroup	Download (~84 KB)

Auto-extract launchers

Each zip contains a one-click launcher that extracts the bundle, opens the right tool (SPSS Production Facility, Minitab macro, or a Python virtualenv), and runs the validation script with no manual setup beyond the platform license itself:

Windows: 0_DOUBLE_CLICK_ME_TO_START_WINDOWS.bat -- right-click the zip, “Extract All...”, then double-click the launcher inside the extracted folder.
macOS: 0_DOUBLE_CLICK_ME_TO_START_MAC.command -- double-click the zip in Finder, then double-click the .command file. macOS may ask you to approve unsigned scripts the first time; the project source is public so you can audit the launcher before approving.

The launcher source (and the platform-native syntax it runs) is browsable on GitHub: SPSS launcher, Minitab launcher, and the bundle definitions in build_zip.py.

Test Data & Automated Testing

The repository includes multiple test datasets and a comprehensive pytest suite so that anyone can verify the analysis logic without production data.

File	Records	Purpose
test_data.csv	5	Simplified format for quick logic checks with 5 actual response rows (clean, IRI fail, duration fail, Don’t Know); blank Qualtrics-style metadata rows are not included in the count
test_data_qualtrics.csv	15	Full Qualtrics CSV format with realistic headers and diverse demographic combinations
tests/generate_test_data.py	-	Deterministic generator for the production-format synthetic dataset. Public directory browsing is intentionally not linked here while the repository remediates and sanitizes the test CSV per the PII policy.

The test suite includes a comprehensive set of pytest modules covering every analysis script, cross-validation between scripts, edge cases, CLI argument parsing, and the operational pipeline tools. Tests run automatically in CI on every push.

Preventing Drift Between Systems

The most dangerous failure mode in a dual-system research pipeline is silent divergence: the live pipeline and the public scripts gradually drift apart until they produce different results from the same data. The TABS project prevents this through three mechanisms:

Centralized constants: All instrument definitions (scale labels, IRI answers, column names, thresholds) live in one file. Neither the TypeScript nor Python code hardcodes these values independently.
CI validation: Every commit that touches constants or analysis scripts triggers an automated workflow that regenerates the JSON export, validates all values match, and runs the Python scripts against test data.
Faithful port: The Python disposition waterfall (tabs_v2_data_audit.py) implements the exact same 10-step logic as the TypeScript live pipeline (disposition.ts), including matching within-person SD calculations for partial straightlining detection.

What the Shared Constants Cover

Constant Category	Examples	Why It Matters
Scale Mappings	“Major Barrier” = 5, “Very Low Readiness/Capability” = 1	If one system maps a label to the wrong number, all statistics diverge silently
IRI Expected Answers	Barrier: “Major Barrier”, Readiness: “Low Readiness/Capability”	Wrong IRI answer = wrong disposition = wrong sample = wrong conclusions
Column Names	Q10-28_Barriers_19 (IRI), Q47-64_Readiness_18 (IRI)	Column mismatch reads the wrong data entirely
Duration Thresholds	Speed flag: 300s, Smeal: 300-540s, Clean: 480s	Different thresholds produce different sample sizes and statistics

Open Science Commitments

Open source code: All scripts are public on GitHub
Pinned dependencies: requirements.txt locks exact package versions for reproducibility across environments
Test data included: Three test datasets (5 + 15 + 500 records) and a deterministic generator exercise all processing paths without requiring production data access
Automated test suite: A comprehensive pytest suite with CI integration verifies script correctness on every commit
Versioned datasets: Each data release (N=200, N=500, annual) receives its own DOI via ScholarSphere
CC-BY-4.0 license: Both code and data are freely reusable with attribution
CI-validated: Automated checks ensure constants match, scripts run, and outputs are correct on every code change

Get Involved

If you find an issue with the analysis scripts or want to contribute improvements, the repository welcomes pull requests. The analysis scripts are designed to be extended: add new statistical tests, improve visualizations, or adapt the pipeline for your own research context.

View Scripts on GitHub View Test Suite Access Dataset on ScholarSphere