Google Analytics Integration
Google Analytics 4 (GA4) tracks how researchers, participants, and the public interact with the TABS website. Understanding real visitor behavior is critical for an academic research project — it helps us measure dissemination impact, identify which content resonates, and report meaningful engagement numbers to stakeholders.
This page explains how we integrate with GA4, how our automated reporting works, and — most importantly — the methodology behind our "Verified Visitors" metric, which filters out test and automation traffic to surface only genuine human visitors.
How GA4 Is Integrated
Tag Management
GA4 is deployed through Google Tag Manager (GTM), which loads on every page via the root layout. GTM acts as a container for all analytics tags, allowing us to manage tracking without modifying application code. The GA4 measurement ID is configured inside the GTM container, not hardcoded in the site source.
Privacy & Consent
A cookie consent banner gates all analytics tracking. When a visitor first arrives, GTM loads but analytics tags are suppressed until the visitor grants consent. If consent is granted, a consent_update event is pushed to window.dataLayer, enabling GA4, Microsoft Clarity, and Meta Pixel. If consent is declined, no tracking cookies are set and no analytics data is collected.
This approach respects visitor privacy while providing aggregate, anonymized insights for measuring research impact.
Data API & Service Account
To programmatically access GA4 data, we use the Google Analytics Data API v1. A service account authenticates with a private key stored as a GitHub Actions secret in the google-prod environment. The service account has read-only access to the GA4 property — it cannot modify tracking configuration or delete data.
Key environment values
GA_PROPERTY_ID— The GA4 property identifierGOOGLE_SERVICE_ACCOUNT_EMAIL— Service account emailGOOGLE_PRIVATE_KEY— Service account private key (secret)
Automated Daily Reporting
A GitHub Actions workflow (ga-report.yml) runs daily at 00:00 UTC. It fetches a 28-day rolling window of analytics data and produces two outputs:
- Detailed JSON report — Saved to
reports/with per-page breakdowns, user counts, sessions, and engagement rates. This archive provides a longitudinal record of site performance. - Public impact stats — Written to
src/data/impact.json, which is imported by the homepage and media page to display live metrics like active users, page views, and verified visitors. These numbers update automatically when the workflow runs.
A companion script emails a formatted summary to the project team, including top pages, engagement trends, and the verified visitor count.
Workflow trigger
The workflow can also be triggered manually via workflow_dispatch. Manual runs include additional diagnostic output (channel and hostname breakdowns) that is suppressed during scheduled runs to keep logs clean and avoid exposing detailed analytics data unnecessarily.
The Verified Visitors Methodology
This section documents the most significant analytics challenge we encountered and how we solved it. It serves as a case study in why raw analytics numbers can be deeply misleading for projects that use automated testing.
The Problem: Inflated Numbers
When we first displayed analytics on the site, we showed Page Views (65,930) as a measure of engagement. This number seemed impressive but didn't feel right — the project was relatively new and hadn't received that level of organic attention.
Switching to Active Users (23,766) didn't help — the number was still far too high. The question was: where was all this traffic coming from?
Root Cause: Automated Testing Traffic
Investigation via the GA4 Data API revealed the answer. By breaking down active users by hostName, we discovered:
| Hostname | Active Users | Percentage |
|---|---|---|
| localhost | 21,318 | 99.6% |
| technologyadoptionbarriers.org | 82 | 0.4% |
| (other) | 1 | <0.01% |
99.6% of all recorded traffic came from localhost. This traffic was generated by:
- Playwright E2E tests — The CI/CD pipeline runs full end-to-end tests using real Chromium browsers against a local preview server. Because the tests execute in real browsers, GTM and GA4 tags fire normally, creating real GA4 sessions.
- AI coding agents — Multiple AI tools use Playwright to browse and validate the site during development, each generating distinct browser sessions.
- Local development — Developer sessions on
localhost:3000also trigger GA4 tracking.
Key insight
Unlike traditional bot traffic that might have unusual user agents or patterns, Playwright E2E tests run in real Chromium with real JavaScript execution. They are indistinguishable from human traffic to GA4 — they just happen to be on localhost instead of the production domain.
The Solution: Hostname Filtering
The fix is straightforward: filter by hostname. Since all legitimate visitor traffic arrives at technologyadoptionbarriers.org, we extract only the production hostname row from the GA4 hostname breakdown.
How it works
- The daily report script queries GA4 for
activeUsersbroken down byhostName. - The script finds the row where
hostNameequalstechnologyadoptionbarriers.org. - That row's
activeUsersvalue becomes the Verified Visitors metric. - The number is saved to
src/data/impact.jsonasverifiedVisitorsand displayed on the site.
Why Not Use GA4 Filters?
GA4 offers both property-level data filters and API-level dimensionFilter parameters. We attempted three different approaches to filter at the API level:
dimensionFilterwithinListFilteronhostName— filter was silently ignoreddimensionFilterwithstringFilteronhostName— filter was silently ignored- Adding
hostNameas a dimension alongside the filter — still returned localhost rows
The GA4 Data API's dimensionFilter does not reliably apply when using metricAggregations: TOTAL without row-level dimensions. Rather than continue debugging an opaque API behavior, we adopted the pragmatic approach: fetch the breakdown, filter in JavaScript.
A property-level filter in the GA4 admin console could also work, but would permanently exclude localhost data from all reports, which is undesirable for debugging and development monitoring.
What "Verified Visitors" Means
The Verified Visitors metric represents:
- Active users as defined by GA4 (users who had an engaged session or first visit within the reporting period)
- On the production hostname only (
technologyadoptionbarriers.org) - 28-day rolling window (matching the report period)
It explicitly excludes:
localhosttraffic (Playwright tests, AI agents, local dev)- Staging or preview deployment traffic
- Any other non-production hostname
Transparency note
This metric does not apply additional channel or source filtering. Some automated visitors (e.g., bots that visit the production domain) may still be included. We chose not to over-filter because it is better to slightly overcount real visitors than to accidentally exclude legitimate human traffic from organic search, social media, or direct links.
Lessons Learned
1. Automated Tests Create Real Analytics
If your CI/CD pipeline runs E2E tests in real browsers (Playwright, Cypress, Selenium) against a site that has analytics tags, those tests will generate real analytics data. This is not a bug — it's expected behavior. Plan for it from day one by either excluding localhost at the property level or filtering in your reporting pipeline.
2. GA4 API Filters Have Quirks
The GA4 Data API's dimensionFilter parameter does not always behave as documented, especially when combined with metricAggregations and no row-level dimensions. If you encounter filter issues, fetch the full breakdown and filter in your application code. It's more explicit and debuggable.
3. Always Show Your Methodology
Raw analytics numbers without context are meaningless at best, misleading at worst. This page exists because we believe in transparency — if we display a number on the site, we should explain exactly how it was derived and what it does and does not represent.
Related
- Technical Integrations overview
- Qualtrics Integration — survey engine and API workflows
- Prolific Integration — participant recruitment methodology
This page documents the analytics methodology as of February 2026. The verified visitors approach was developed iteratively during issue #333 of the TABS project.
