-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Optimized top_locations and top_pages endpoints to avoid double-scan #25738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimized top_locations and top_pages endpoints to avoid double-scan #25738
Conversation
WalkthroughThese changes modify two Tinybird pipeline endpoints to add a DESCRIPTION and implement conditional SQL paths based on the presence of session-level filters. If any of source, device, utm_source, utm_medium, utm_campaign, utm_term, or utm_content are defined, the pipeline uses a join with Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
These endpoints were scanning _mv_hits twice: once through filtered_sessions to get qualifying session IDs, then again to aggregate by location/pathname. When no session-level filters (source, device, utm_*) are provided, the endpoints now query _mv_hits directly in a single pass. This should significantly reduce query time and timeouts for these high-traffic endpoints.
da037b6 to
dc83b28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ghost/core/core/server/data/tinybird/endpoints/api_top_locations.pipe(1 hunks)ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipe(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-11T18:43:07.356Z
Learnt from: cmraible
Repo: TryGhost/Ghost PR: 25705
File: ghost/core/core/server/data/tinybird/pipes/mv_session_data.pipe:5-5
Timestamp: 2025-12-11T18:43:07.356Z
Learning: In Tinybird pipe files (.pipe), a standalone '%' character at the start of a SQL block enables Tinybird templating (e.g., {{ String(site_uuid, ...) }}). This is valid syntax and should not be flagged as an error. Apply this guidance to all .pipe files in the repository; reviewers should not treat these templated blocks as syntax errors and should ensure tooling/linting recognizes the templating syntax.
Applied to files:
ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipeghost/core/core/server/data/tinybird/endpoints/api_top_locations.pipe
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Tinybird Tests
- GitHub Check: Lint
- GitHub Check: Unit tests (Node 22.18.0)
- GitHub Check: Build & Push Docker Image
🔇 Additional comments (2)
ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipe (1)
5-7: LGTM - Clear documentation of the optimization strategy.The description accurately explains the short-circuit behavior for performance improvement.
ghost/core/core/server/data/tinybird/endpoints/api_top_locations.pipe (1)
15-37: Consistent implementation with api_top_pages.pipe.The optimization pattern correctly mirrors the approach in
api_top_pages.pipe, ensuring both endpoints benefit from the performance improvement when no session-level filters are provided.
| {% if defined(source) or defined(device) or defined(utm_source) or defined(utm_medium) or defined(utm_campaign) or defined(utm_term) or defined(utm_content) %} | ||
| {# Session-level filters require filtered_sessions join #} | ||
| from _mv_hits h | ||
| inner join filtered_sessions fs | ||
| on fs.session_id = h.session_id | ||
| where | ||
| site_uuid = {{String(site_uuid, 'mock_site_uuid', description="Tenant ID", required=True)}} | ||
| {% else %} | ||
| {# No session-level filters - query _mv_hits directly for better performance #} | ||
| from _mv_hits h | ||
| where | ||
| site_uuid = {{String(site_uuid, 'mock_site_uuid', description="Tenant ID", required=True)}} | ||
| {% if defined(date_from) %} | ||
| and timestamp >= toDateTime({{ Date(date_from) }}, {{String(timezone, 'Etc/UTC', description="Site timezone", required=True)}}) | ||
| {% else %} | ||
| and timestamp >= toDateTime(dateAdd(day, -7, today()), {{String(timezone, 'Etc/UTC', description="Site timezone", required=True)}}) | ||
| {% end %} | ||
| {% if defined(date_to) %} | ||
| and timestamp < toDateTime({{ Date(date_to) }}, {{String(timezone, 'Etc/UTC', description="Site timezone", required=True)}}) + interval 1 day | ||
| {% else %} | ||
| and timestamp < toDateTime(dateAdd(day, 1, today()), {{String(timezone, 'Etc/UTC', description="Site timezone", required=True)}}) | ||
| {% end %} | ||
| {% end %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
fd -type f -name "*filtered_sessions*" | head -20Repository: TryGhost/Ghost
Length of output: 228
🏁 Script executed:
find . -name "*filtered_sessions*" -o -name "*.incl" | head -20Repository: TryGhost/Ghost
Length of output: 196
🏁 Script executed:
cat -n ./ghost/core/core/server/data/tinybird/pipes/filtered_sessions.pipeRepository: TryGhost/Ghost
Length of output: 5505
🏁 Script executed:
cat -n ./ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipeRepository: TryGhost/Ghost
Length of output: 3474
Add date filtering when joining filtered_sessions to ensure consistent date range across both query paths.
When session-level filters are present, the query joins filtered_sessions but lacks explicit date_from/date_to filtering on the _mv_hits table in the WHERE clause (lines 16-22). This creates inconsistent behavior compared to the direct query path (lines 28-37), where date filtering is applied directly to timestamps. Sessions can return hits outside the requested date range because the join only filters session IDs without constraining the hit timestamps. Add timestamp filters to the WHERE clause when using the filtered_sessions join path to maintain consistency—applying the same 7-day default and date range logic as the direct path.
🤖 Prompt for AI Agents
In ghost/core/core/server/data/tinybird/endpoints/api_top_pages.pipe around
lines 16 to 38, the branch that joins filtered_sessions is missing the timestamp
filters applied in the direct _mv_hits path; update the WHERE clause in the
filtered_sessions branch to add the same date filtering logic as the else
branch: apply timestamp >= toDateTime(Date(date_from) or default
toDateTime(dateAdd(day, -7, today())) with the same timezone param, and
timestamp < toDateTime(Date(date_to) or default toDateTime(dateAdd(day, 1,
today()))) (or the explicit + interval 1 day for date_to) so hits are
constrained to the requested/default date range when using the filtered_sessions
join.
ref https://linear.app/ghost/issue/NY-865/analytics-sources-not-populating-for-tangle-due-to-408-timeouts
These endpoints were scanning
_mv_hitstwice: once throughfiltered_sessionsto get qualifying session IDs, then again to aggregate by location/pathname. When no session-level filters (source, device,utm_*) are provided, the endpoints now query_mv_hitsdirectly in a single pass. This should significantly reduce query time and timeouts for these high-traffic endpoints.