SQL(Athena/Presto) and Python Reference Guide for Data Manipulation
This reference guide provides practical SQL examples for common data transformation tasks in Athena/Presto Syntax. To use this reference in any other SQL version, upload the code to any AI tool and ask to convert. Each section includes working queries with inline comments explaining datatypes and outputs.
For Postgres Edition, refer to this.
Use this code snippet to work with timestamps, dates, and time dimensions for reporting and analytics. This code converts a single timestamp into multiple time formats and dimensions. Use this to understand how to extract dates, weeks, months, quarters, and years from timestamps.
Generates a row for every day in the last 365 days with week and month boundaries. Use this as a lookup table to join with fact tables for time-based reporting and aggregations.
Demonstrates text manipulation, pattern matching, and string parsing. Use this to clean, transform, and extract data from text fields like names, emails, URLs, and product codes.
Demonstrates ranking, aggregation, and offset functions for B2B usage-based pricing. Use this to understand how to perform running totals, rankings, comparisons, and analytics across ordered data.
All examples use Presto/Trino SQL syntax and include practical B2B SaaS use cases with fictitious company names for illustration purposes.
WITH time_dimensions AS ( SELECT '2025-12-31T13:01:49.459Z' AS ts_iso_utc, 1735651309.459 AS epoch_seconds, 'Europe/London' AS timezone, '2025-12-31 13:01:49' AS ts_string)SELECT ts_iso_utc, -- varchar: '2025-12-31T13:01:49.459Z' epoch_seconds, -- double: 1735651309.459 from_unixtime(epoch_seconds) AS ts_from_epoch, -- timestamp: 2025-12-31 13:01:49.459 UTC from_iso8601_timestamp(ts_iso_utc) AS ts_utc, -- timestamp: 2025-12-31 13:01:49.459 UTC date_trunc('second', from_iso8601_timestamp(ts_iso_utc)) AS ts_iso_utc_datetime, -- timestamp: 2025-12-31 13:01:49 CAST(from_iso8601_timestamp(ts_iso_utc) AS date) AS utc_date, -- date: 2025-12-31 CAST(date_trunc('week', from_iso8601_timestamp(ts_iso_utc)) - INTERVAL '1' day AS date) AS utc_week_start, -- date: 2025-12-28 CAST(date_trunc('week', from_iso8601_timestamp(ts_iso_utc)) - INTERVAL '1' day + INTERVAL '6' day AS date) AS utc_week_end, -- date: 2026-01-03 CAST(date_trunc('month', from_iso8601_timestamp(ts_iso_utc)) AS date) AS utc_month_start, -- date: 2025-12-01 CAST(date_trunc('month', from_iso8601_timestamp(ts_iso_utc) + INTERVAL '1' month) - INTERVAL '1' day AS date) AS utc_month_end, -- date: 2025-12-31 CAST(date_trunc('quarter', from_iso8601_timestamp(ts_iso_utc)) AS date) AS utc_quarter_start, -- date: 2025-10-01 year(from_iso8601_timestamp(ts_iso_utc)) AS utc_year, -- bigint: 2025 CAST(from_iso8601_timestamp(ts_iso_utc) AS date) + INTERVAL '30' day AS date_plus_30_days, -- date: 2026-01-30 date_diff('day', CAST(from_iso8601_timestamp(ts_iso_utc) AS date), current_date) AS days_from_timestamp_to_today, -- bigint: 0 -- date_parse: Parse string to timestamp using format pattern date_parse(ts_string, '%Y-%m-%d %H:%i:%s') AS parsed_timestamp, -- timestamp: 2025-12-31 13:01:49.000 -- date_format: Format timestamp to string using format pattern date_format(from_iso8601_timestamp(ts_iso_utc), '%Y-%m-%d') AS formatted_iso_date, -- varchar: '2025-12-31' date_format(from_iso8601_timestamp(ts_iso_utc), '%Y-%m-%d %H:%i:%s') AS formatted_iso_datetime -- varchar: '2025-12-31 13:01:49'FROM time_dimensions;
WITH date_spine AS ( SELECT date_column FROM UNNEST(sequence(current_date - INTERVAL '365' day, current_date, INTERVAL '1' day)) AS t(date_column))SELECT date_column AS date, -- date: varies (last 365 days) CAST(date_trunc('week', date_column) - INTERVAL '1' day AS date) AS week_start_date, -- date: Sunday of week CAST(date_trunc('week', date_column) - INTERVAL '1' day + INTERVAL '6' day AS date) AS week_end_date, -- date: Saturday of week CAST(date_trunc('month', date_column) AS date) AS month_start_date, -- date: First day of month CAST(date_trunc('month', date_column + INTERVAL '1' month) - INTERVAL '1' day AS date) AS month_end_date -- date: Last day of monthFROM date_spine;
WITH customer_data AS ( SELECT * FROM ( VALUES (1, ' quantum.nebula@turmeric-tech.com ', 'Premium Plan - Annual'), (2, 'stellar.phoenix@cardamom.co', 'STARTER plan (monthly)'), (3, 'Cosmic.Thunder@SAFFRON.IO', 'Enterprise Plan - Quarterly'), (4, 'atomic_lightning@coriander-labs.net', 'growth PLAN - annual'), (5, 'digital.vortex@peppercorn.com', 'Premium Plan - Monthly') ) AS t(customer_id, email, plan_description))SELECT customer_id, -- bigint: 1-5 email, -- varchar: ' quantum.nebula@turmeric-tech.com ' plan_description, -- varchar: 'Premium Plan - Annual' CONCAT('Customer: ', CAST(customer_id AS varchar), ' - ', TRIM(email)) AS concat_example, -- varchar: 'Customer: 1 - quantum.nebula@turmeric-tech.com' SUBSTRING(TRIM(email), 1, POSITION('@' IN TRIM(email)) - 1) AS substring_example, -- varchar: 'quantum.nebula' REPLACE(plan_description, '-', '–') AS replace_example, -- varchar: 'Premium Plan – Annual' TRIM(email) AS trim_example, -- varchar: 'quantum.nebula@turmeric-tech.com' UPPER(email) AS upper_example, -- varchar: ' QUANTUM.NEBULA@TURMERIC-TECH.COM ' LOWER(TRIM(email)) AS lower_example, -- varchar: 'quantum.nebula@turmeric-tech.com' SPLIT(TRIM(email), '@')[1] AS split_example, -- varchar: 'quantum.nebula' REGEXP_EXTRACT(TRIM(email), '^([^@]+)') AS regexp_extract_example, -- varchar: 'quantum.nebula' REGEXP_REPLACE(LOWER(plan_description), '[^a-z0-9]+', '_') AS regexp_replace_example, -- varchar: 'premium_plan_annual' LENGTH(TRIM(email)) AS length_example, -- bigint: 34 POSITION('@' IN TRIM(email)) AS position_example, -- bigint: 15 LOWER(plan_description) LIKE '%annual%' AS like_example -- boolean: true (for annual plans)FROM customer_dataORDER BY customer_id;
WITH usage_billing_data AS ( SELECT * FROM ( VALUES (1, 'Turmeric Technologies', 'Enterprise', DATE '2025-01-31', 45000, 2250.00), (2, 'Cardamom Systems', 'Growth', DATE '2025-01-31', 18000, 900.00), (3, 'Saffron Solutions', 'Enterprise', DATE '2025-01-31', 52000, 2600.00), (4, 'Cumin Innovations', 'Starter', DATE '2025-01-31', 3500, 175.00), (5, 'Turmeric Technologies', 'Enterprise', DATE '2025-02-28', 48000, 2400.00), (6, 'Coriander Labs', 'Growth', DATE '2025-02-28', 22000, 1100.00), (7, 'Cardamom Systems', 'Growth', DATE '2025-02-28', 16500, 825.00), (8, 'Peppercorn Group', 'Enterprise', DATE '2025-02-28', 61000, 3050.00), (9, 'Fenugreek Digital', 'Growth', DATE '2025-02-28', 25000, 1250.00), (10, 'Mustard Seed Tech', 'Starter', DATE '2025-02-28', 4200, 210.00) ) AS t(invoice_id, customer_name, plan_tier, billing_period_end, api_calls, amount_billed))SELECT invoice_id, -- bigint: 1-10 customer_name, -- varchar: 'Turmeric Technologies', 'Cardamom Systems', etc. plan_tier, -- varchar: 'Enterprise', 'Growth', 'Starter' billing_period_end, -- date: End of billing period api_calls, -- bigint: Total API calls for the month (3500 to 61000) amount_billed, -- double: Amount billed based on usage ($175 to $3050) ROW_NUMBER() OVER (ORDER BY billing_period_end, customer_name) AS row_num, -- bigint: 1, 2, 3, ... (sequential invoice number) RANK() OVER (ORDER BY api_calls DESC) AS rank_by_usage, -- bigint: Rank by API call volume (1=highest usage) DENSE_RANK() OVER (ORDER BY amount_billed DESC) AS dense_rank_by_revenue, -- bigint: Dense rank by revenue (no gaps) ROW_NUMBER() OVER (PARTITION BY plan_tier ORDER BY api_calls DESC) AS usage_rank_within_tier, -- bigint: Usage ranking within each plan tier NTILE(4) OVER (ORDER BY api_calls) AS usage_quartile, -- bigint: 1=Low, 2=Med-Low, 3=Med-High, 4=High usage quartile SUM(amount_billed) OVER (ORDER BY billing_period_end, customer_name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_revenue, -- double: Running total of revenue (2250, 3150, 5750, ...) AVG(api_calls) OVER (PARTITION BY customer_name ORDER BY billing_period_end ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_avg_usage_3mo, -- bigint: 3-month rolling average API calls per customer COUNT(*) OVER (PARTITION BY plan_tier) AS invoices_per_tier, -- bigint: Total invoices for each plan tier SUM(api_calls) OVER (PARTITION BY plan_tier) AS total_usage_by_tier, -- bigint: Total API calls for each plan tier api_calls / CAST(SUM(api_calls) OVER (PARTITION BY plan_tier) AS DOUBLE) AS pct_of_tier_usage, -- double: Percentage of tier's total usage (0.35 = 35%) MAX(api_calls) OVER (PARTITION BY customer_name) AS customer_peak_usage, -- bigint: Highest usage month for each customer LAG(api_calls, 1) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS prev_month_usage, -- bigint: Previous month's API calls (NULL for first month) LEAD(api_calls, 1) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS next_month_usage, -- bigint: Next month's API calls (NULL for last month) api_calls - LAG(api_calls, 1) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS usage_change_mom, -- bigint: Month-over-month usage change (+3000 = 3000 more calls) CAST(api_calls - LAG(api_calls, 1) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS DOUBLE) / NULLIF(LAG(api_calls, 1) OVER (PARTITION BY customer_name ORDER BY billing_period_end), 0) AS usage_growth_rate, -- double: Month-over-month growth rate (0.067 = 6.7% growth) LAG(amount_billed, 1) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS prev_month_revenue, -- double: Previous month's billed amount amount_billed - LAG(amount_billed, 1) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS revenue_change_mom, -- double: Month-over-month revenue change FIRST_VALUE(api_calls) OVER (PARTITION BY plan_tier ORDER BY billing_period_end) AS tier_baseline_usage, -- bigint: First month's usage for each plan tier (baseline) LAST_VALUE(api_calls) OVER (PARTITION BY plan_tier ORDER BY billing_period_end ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS tier_latest_usage, -- bigint: Most recent usage for each plan tier NTH_VALUE(amount_billed, 2) OVER (PARTITION BY plan_tier ORDER BY amount_billed DESC) AS second_highest_revenue_in_tier, -- double: Second highest revenue in each plan tier ARRAY_AGG(plan_tier) OVER (PARTITION BY customer_name ORDER BY billing_period_end ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS plan_history -- array(varchar): Ordered list of plan tiers over time per customer -- ------------------------------------------------------------------------- AVG(api_calls) OVER (ORDER BY billing_period_end ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS centered_moving_avg, -- double: Symmetric 3-row window (1 before + current + 1 after) — smooths without lagging like a backward-only rolling average SUM(amount_billed) OVER (ORDER BY billing_period_end, customer_name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS remaining_revenue, -- double: Reverse running total — sum from current row to last row, the complement of cumulative_revenue SUM(api_calls) OVER (PARTITION BY customer_name ORDER BY billing_period_end ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) AS usage_4mo_window, -- bigint: Asymmetric window — 2 rows before + current + 1 row after, weights history more than lookahead SUM(amount_billed) OVER (ORDER BY billing_period_end RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS range_cumulative_revenue, -- double: RANGE frame groups all rows with the same ORDER BY value together, unlike ROWS which processes one row at a time AVG(api_calls) OVER (PARTITION BY plan_tier ORDER BY api_calls RANGE BETWEEN 5000 PRECEDING AND 5000 FOLLOWING) AS range_band_avg_usage, -- double: Value-based band — averages all rows whose api_calls falls within ±5000 of the current row's value (not ±5000 rows) PERCENT_RANK() OVER (ORDER BY api_calls) AS pct_rank_usage, -- double: Relative rank as 0.0–1.0 fraction: (rank - 1) / (total_rows - 1). Lowest = 0.0, highest = 1.0 PERCENT_RANK() OVER (PARTITION BY plan_tier ORDER BY api_calls) AS pct_rank_usage_within_tier, -- double: Same as pct_rank_usage but restarted within each tier — shows where a customer stands among peers CUME_DIST() OVER (ORDER BY amount_billed) AS cumulative_distribution, -- double: Fraction of rows with amount_billed <= current row (0.8 = exceeds 80% of all rows). Never returns 0.0, minimum is 1/n CUME_DIST() OVER (PARTITION BY plan_tier ORDER BY amount_billed) AS cumulative_distribution_within_tier, -- double: Same as cumulative_distribution but scoped to the tier — useful for percentile segmentation within a tier LAG(api_calls, 1, 0) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS prev_month_usage_default_zero, -- bigint: LAG with default — returns 0 instead of NULL for the first month, preventing NULL propagation in downstream calculations LEAD(amount_billed, 1, 0.0) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS next_month_revenue_default_zero, -- double: LEAD with default — returns 0.0 instead of NULL for the last month LAG(api_calls, 2, 0) OVER (PARTITION BY customer_name ORDER BY billing_period_end) AS two_months_ago_usage, -- bigint: LAG with offset 2 and default — api_calls from 2 months ago, returns 0 if fewer than 2 prior months exist AVG(CAST(api_calls AS DOUBLE)) OVER (PARTITION BY plan_tier) - api_calls AS usage_diff_from_tier_avg, -- double: Difference between this row and the tier average — Athena equivalent of EXCLUDE CURRENT ROW peer comparison SUM(amount_billed) OVER (PARTITION BY plan_tier) - amount_billed AS tier_revenue_excl_self, -- double: Total tier revenue minus this row's contribution — Athena equivalent of EXCLUDE CURRENT ROW SUM(CASE WHEN api_calls > 20000 THEN 1 ELSE 0 END) OVER (PARTITION BY plan_tier) AS high_usage_invoices_in_tier, -- bigint: Count of invoices with >20000 calls in this tier — Athena equivalent of COUNT(*) FILTER (WHERE ...) OVER SUM(CASE WHEN api_calls > 20000 THEN amount_billed ELSE 0.0 END) OVER (PARTITION BY plan_tier) AS high_usage_revenue_in_tier -- double: Total revenue from high-usage months only within this tier — Athena equivalent of SUM FILTER (WHERE ...) OVERFROM usage_billing_dataORDER BY billing_period_end, customer_name;