Skip to content

Conversation

@cmraible
Copy link
Collaborator

@cmraible cmraible commented Dec 17, 2025

ref https://linear.app/ghost/issue/NY-865/analytics-sources-not-populating-for-tangle-due-to-408-timeouts

These endpoints were scanning _mv_hits twice: once through filtered_sessions to get qualifying session IDs, then again to aggregate by location/pathname. When no session-level filters (source, device, utm_*) are provided, the endpoints now query _mv_hits directly in a single pass. This should significantly reduce query time and timeouts for these high-traffic endpoints.

@cmraible cmraible marked this pull request as ready for review December 17, 2025 06:40
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 17, 2025

Walkthrough

These changes modify two Tinybird pipeline endpoints to add a DESCRIPTION and implement conditional SQL paths based on the presence of session-level filters. If any of source, device, utm_source, utm_medium, utm_campaign, utm_term, or utm_content are defined, the pipeline uses a join with filtered_sessions and applies the site_uuid filter; otherwise it queries _mv_hits directly with site_uuid and time-range filters (using date_from/date_to with timezone defaults). Existing filters (member_status, location, pathname, post_uuid), grouping, and ordering logic are preserved.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Review both pipeline files for consistent conditional logic and identical filter checks.
  • Verify timestamp handling and default ranges (last 7 days vs. end-of-day +1) and timezone application in the direct _mv_hits path.
  • Confirm join condition correctness and that site_uuid and other filters apply correctly in both branches.
  • Ensure DESCRIPTION accurately documents behavior.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main optimization: avoiding double-scan in top_locations and top_pages endpoints.
Description check ✅ Passed The description clearly explains the performance issue being addressed and the solution implemented, directly relating to the changeset modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch short-circuit-filtered-sessions-in-locations-and-pages

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

These endpoints were scanning _mv_hits twice: once through filtered_sessions
to get qualifying session IDs, then again to aggregate by location/pathname.
When no session-level filters (source, device, utm_*) are provided, the
endpoints now query _mv_hits directly in a single pass. This should
significantly reduce query time and timeouts for these high-traffic endpoints.
@cmraible cmraible force-pushed the short-circuit-filtered-sessions-in-locations-and-pages branch from da037b6 to dc83b28 Compare December 18, 2025 18:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da037b6 and dc83b28.

📒 Files selected for processing (2)
  • ghost/core/core/server/data/tinybird/endpoints/api_top_locations.pipe (1 hunks)
  • ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipe (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-11T18:43:07.356Z
Learnt from: cmraible
Repo: TryGhost/Ghost PR: 25705
File: ghost/core/core/server/data/tinybird/pipes/mv_session_data.pipe:5-5
Timestamp: 2025-12-11T18:43:07.356Z
Learning: In Tinybird pipe files (.pipe), a standalone '%' character at the start of a SQL block enables Tinybird templating (e.g., {{ String(site_uuid, ...) }}). This is valid syntax and should not be flagged as an error. Apply this guidance to all .pipe files in the repository; reviewers should not treat these templated blocks as syntax errors and should ensure tooling/linting recognizes the templating syntax.

Applied to files:

  • ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipe
  • ghost/core/core/server/data/tinybird/endpoints/api_top_locations.pipe
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Tinybird Tests
  • GitHub Check: Lint
  • GitHub Check: Unit tests (Node 22.18.0)
  • GitHub Check: Build & Push Docker Image
🔇 Additional comments (2)
ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipe (1)

5-7: LGTM - Clear documentation of the optimization strategy.

The description accurately explains the short-circuit behavior for performance improvement.

ghost/core/core/server/data/tinybird/endpoints/api_top_locations.pipe (1)

15-37: Consistent implementation with api_top_pages.pipe.

The optimization pattern correctly mirrors the approach in api_top_pages.pipe, ensuring both endpoints benefit from the performance improvement when no session-level filters are provided.

Comment on lines +16 to +38
{% if defined(source) or defined(device) or defined(utm_source) or defined(utm_medium) or defined(utm_campaign) or defined(utm_term) or defined(utm_content) %}
{# Session-level filters require filtered_sessions join #}
from _mv_hits h
inner join filtered_sessions fs
on fs.session_id = h.session_id
where
site_uuid = {{String(site_uuid, 'mock_site_uuid', description="Tenant ID", required=True)}}
{% else %}
{# No session-level filters - query _mv_hits directly for better performance #}
from _mv_hits h
where
site_uuid = {{String(site_uuid, 'mock_site_uuid', description="Tenant ID", required=True)}}
{% if defined(date_from) %}
and timestamp >= toDateTime({{ Date(date_from) }}, {{String(timezone, 'Etc/UTC', description="Site timezone", required=True)}})
{% else %}
and timestamp >= toDateTime(dateAdd(day, -7, today()), {{String(timezone, 'Etc/UTC', description="Site timezone", required=True)}})
{% end %}
{% if defined(date_to) %}
and timestamp < toDateTime({{ Date(date_to) }}, {{String(timezone, 'Etc/UTC', description="Site timezone", required=True)}}) + interval 1 day
{% else %}
and timestamp < toDateTime(dateAdd(day, 1, today()), {{String(timezone, 'Etc/UTC', description="Site timezone", required=True)}})
{% end %}
{% end %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -type f -name "*filtered_sessions*" | head -20

Repository: TryGhost/Ghost

Length of output: 228


🏁 Script executed:

find . -name "*filtered_sessions*" -o -name "*.incl" | head -20

Repository: TryGhost/Ghost

Length of output: 196


🏁 Script executed:

cat -n ./ghost/core/core/server/data/tinybird/pipes/filtered_sessions.pipe

Repository: TryGhost/Ghost

Length of output: 5505


🏁 Script executed:

cat -n ./ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipe

Repository: TryGhost/Ghost

Length of output: 3474


Add date filtering when joining filtered_sessions to ensure consistent date range across both query paths.

When session-level filters are present, the query joins filtered_sessions but lacks explicit date_from/date_to filtering on the _mv_hits table in the WHERE clause (lines 16-22). This creates inconsistent behavior compared to the direct query path (lines 28-37), where date filtering is applied directly to timestamps. Sessions can return hits outside the requested date range because the join only filters session IDs without constraining the hit timestamps. Add timestamp filters to the WHERE clause when using the filtered_sessions join path to maintain consistency—applying the same 7-day default and date range logic as the direct path.

🤖 Prompt for AI Agents
In ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipe around
lines 16 to 38, the branch that joins filtered_sessions is missing the timestamp
filters applied in the direct _mv_hits path; update the WHERE clause in the
filtered_sessions branch to add the same date filtering logic as the else
branch: apply timestamp >= toDateTime(Date(date_from) or default
toDateTime(dateAdd(day, -7, today())) with the same timezone param, and
timestamp < toDateTime(Date(date_to) or default toDateTime(dateAdd(day, 1,
today()))) (or the explicit + interval 1 day for date_to) so hits are
constrained to the requested/default date range when using the filtered_sessions
join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants