v.1

SQL Mistakes to Avoid: A Guide to Optimizing Your Queries

EN
Learn how to optimize SQL queries in product analytics by avoiding common pitfalls, enhancing query performance, and tracking key metrics like retention and DAU efficiently.
SQL Mistakes to Avoid: A Guide to Optimizing Your Queries
Published
26.09.2024
|
Anastasia Bogapova

Written by: Anastasia Bogapova, Growth Product Manager at devtodev, ex-Lead Product Analyst with 5 years experience in data.

Author’s note: At a previous company, we ran into an unexpected problem: the monthly cost of running SQL queries was higher than the combined salaries of five analysts. As the team expanded, this expense became a bigger concern. It was clear we needed to make a change, so we decided to conduct an internal audit of our queries and trained the team on SQL optimization techniques. The results were incredible — within two months, we managed to cut the monthly cost in half. Not only that, but task execution and dashboard updates became noticeably faster, thanks to more efficient queries.

In this article, I’ll share the key lessons from that experience, along with 9 essential tips to improve SQL efficiency and streamline data workflows in devtodev.

Mastering SQL is crucial for avoiding such inefficiencies. If you're just starting out, we have a helpful guide on SQL for Beginners, and for those looking to go deeper, topics like Metric Growth Rate are available.

Contents

  1. Over-Retrieving Data with SELECT *
  2. Use Event Time Filters
  3. Slow Metric Aggregation (DAU/MAU)
  4. Poor Event Analysis with Multiple Subqueries
  5. Ineffective Retention Rate Calculation with Subqueries
  6. Lack of Pre-Aggregated Data for Dashboards
  7. Inconsistent Granularity in Cohort Analyses
  8. Avoid Redundant Data Selection
  9. Filter Before Joining

Introduction

Product analytics is about transforming user data into actionable insights, and SQL is the tool that powers much of that transformation. However, SQL queries can be tricky—small mistakes often result in slow performance, inaccurate data, and bottlenecks that frustrate analysts and delay decisions. This article highlights common SQL pitfalls in product analytics and provides practical tips for optimizing your queries to handle large datasets, track key metrics, and analyze user funnels effectively. We’ll include examples tailored to scenarios like retention tracking, cohort analysis, and metric aggregation.

Common Pitfalls in SQL Queries for Product Analytics

1. Over-Retrieving Data with SELECT *

When analyzing large user datasets, a common mistake is fetching all columns using SELECT *. This approach wastes resources and slows down performance, especially when analyzing complex product metrics like Daily Active Users (DAU) or conversion funnels. 

Read more: SQL Knowledge Levels: Beginner, Middle, Advanced

Inefficient Query:

with funnel_events AS (
SELECT * FROM your_project.custom_events
WHERE eventtime::date = '2024-09-09'
)
SELECT 
name as stage, 
COUNT(DISTINCT devtodevid) as num_users
FROM funnel_events 
WHERE name in ('registration','engagement','purchase')
GROUP BY name
ORDER BY num_users desc;

This query pulls every column for the specified date, regardless of whether all columns are needed.

Optimized Query:

with funnel_events AS (
SELECT DISTINCT
name, 
devtodevid 
FROM your_project.custom_events
WHERE eventtime::date = '2024-09-09'
AND name in ('registration','engagement','purchase')
)
SELECT 
name as stage, 
COUNT(devtodevid) as num_users
FROM funnel_events 
GROUP BY name
ORDER BY num_users desc;

By explicitly specifying the necessary columns, you reduce the amount of data retrieved, improving performance, particularly for large datasets. Additionally, if you only need a subset of the data, using a LIMIT N clause (e.g., LIMIT 10) can further optimize the query by restricting the result set to the most relevant records.

2. Use Event Time Filters

An often overlooked optimization when analyzing large datasets is the use of event time filters, such as eventtime or created fields. Without these filters, queries process more data than needed, causing slower execution and wasting resources. This is especially critical when dealing with time-sensitive metrics like retention or event tracking, where efficiency is key.

Inefficient Query:

SELECT 
  devtodevid, 
  name, 
  eventtime 
FROM 
  your_project.custom_events
WHERE 
  name in ('login', 'purchase', 'logout');

This query fetches all rows for these events, regardless of the time period, which can result in excessive data retrieval.

Optimized Query:

SELECT 
  devtodevid, 
  name, 
  eventtime 
FROM 
  your_project.custom_events
WHERE 
  eventtime::date BETWEEN '2024-09-01' AND '2024-09-10'
  AND name in ('login', 'purchase', 'logout');

By narrowing the date range using eventtime::date, this query retrieves only the relevant records for the specified period, reducing the data volume and improving performance. Applying time filters is especially beneficial when querying events over large time periods, ensuring efficient processing and faster results.

Read more: SQL for Beginners: How to Track First In-App Events 

3. Slow Metric Aggregation (DAU/MAU)

One common issue in DAU/MAU calculations is the use of inefficient aggregation methods, particularly with large datasets. A frequent mistake is using DISTINCT to count unique users, which leads to long query times. This inefficiency occurs because SQL has to scan the entire table to identify distinct users, consuming more resources and slowing down performance, especially when dealing with large user datasets.

To avoid this, it’s best to use pre-aggregated DAU tables where available, as this significantly reduces query time and resource usage. If a pre-aggregated table isn’t available, grouping users before counting them is a more efficient alternative to DISTINCT.

Inefficient Query (DAU Calculation):

SELECT 
COUNT(DISTINCT devtodevid) AS daily_active_users 
FROM your_project.sessions 
WHERE eventtime::date = '2024-09-09';

Using DISTINCT on a large table can slow down performance, as it requires SQL to process unique users, even when a simpler approach might work.

Optimized Query:

Use the pre-calculated DAU table, found in your project directory at your_project.dau. Alternatively, you can create a temporary table using a subquery—by grouping users first, you avoid the heavy computational cost of DISTINCT while achieving the same result.

SELECT 
COUNT(devtodevid) AS daily_active_users 
FROM your_project.dau
WHERE eventdate = '2024-09-09';

Read more: SQL for Beginners: How to Calculate DAU, WAU, MAU 

4. Poor Event Analysis with Multiple Subqueries

Funnel analysis, such as tracking users through various stages (registration, engagement, purchase), is fundamental in product analytics. However, using multiple subqueries to count users at each stage leads to unnecessary performance issues.

To avoid this issue, it's better to use a single query with FILTER clauses. This allows SQL to scan the table only once while filtering the data for each funnel stage, significantly improving performance.

Inefficient Query (Funnel Analysis):

SELECT *
FROM
(SELECT COUNT(*) FROM your_project.custom_events WHERE name = 'registration' and eventtime>=current_date - interval '1 month') AS total_registrations,
(SELECT COUNT(*) FROM your_project.custom_events WHERE name = 'engagement' and eventtime>=current_date - interval '1 month') AS total_engagements,
(SELECT COUNT(*) FROM your_project.custom_events WHERE name = 'purchase' and eventtime>=current_date - interval '1 month') AS total_purchases

In this query, each funnel stage is handled by a separate subquery, which results in scanning the same table three times, greatly reducing performance.

Optimized Query:

SELECT 
COUNT(DISTINCT devtodevid) filter (where name = 'registration') as total_registrations,
COUNT(DISTINCT devtodevid) filter (where name = 'engagement') as total_engagements,
COUNT(DISTINCT devtodevid) filter (where name = 'purchase') as total_purchases
FROM your_project.custom_events
WHERE eventtime::date >= current_date - interval '1 month';

This query performs a single table scan and uses filters to count users at each stage, reducing data processing time and decreasing the load on the database.

Read more: SQL for Beginners: Nested Queries and Temporary Tables

5. Ineffective Retention Rate Calculation with Subqueries

Retention metrics, such as calculating the percentage of users who return after a certain period, are vital for product teams. However, using subqueries for each user event can be inefficient and cause slow query performance.

Inefficient Query (Multiple Subqueries for Retention):

SELECT 
devtodevid, 
(SELECT MIN(eventtime::date) FROM your_project.sessions WHERE devtodevid = sa.devtodevid) AS first_activity, 
(SELECT MAX(eventtime::date) FROM your_project.sessions WHERE devtodevid = sa.devtodevid) AS last_activity 
FROM your_project.sessions as sa;

Using subqueries and JOIN operations to determine retention dates for each user is a common mistake when calculating this metric. This approach slows down query performance because each subquery or join forces the database to repeatedly scan large tables, executing multiple operations for the same data set. As the number of users grows, the query execution time increases significantly due to the additional processing required for each individual user.

If you want to learn more about how to work with date and number generation, see our guide on how to generate dates and numbers in SQL.

Optimized Query:

SELECT DISTINCT
devtodevid, 
FIRST_VALUE(eventtime::date) OVER (PARTITION BY devtodevid ORDER BY eventtime) AS first_activity, 
LAST_VALUE(eventtime::date) OVER (PARTITION BY devtodevid ORDER BY eventtime) AS last_activity 
FROM your_project.sessions;

Using window functions like FIRST_VALUE and LAST_VALUE is a far more efficient way to calculate retention metrics without needing subqueries.

Best Practice:

Use pre-calculated retention tables, available in your project directory at 

your_project.returns_by_calendar or your_project.returns_by_24h

The advantage of using these tables is that they include the retentionday field, which specifies the exact day a user returned, making it easier to track retention patterns without the need for complex calculations.

SELECT
devtodevid,
retentionday
FROM your_project.returns_by_24h 
WHERE eventtime::date >= current_date - interval '1 month'; 

6. Lack of Pre-Aggregated Data for Dashboards

Relying on on-the-fly data aggregation for large datasets in real-time dashboards often leads to performance bottlenecks. Aggregating vast amounts of data in real time consumes substantial resources and significantly slows down dashboard loading times.

To enhance performance, it’s more efficient to use pre-aggregated data. Creating a materialized view that pre-processes the data reduces the need for real-time calculations, allowing dashboards to load quickly without putting excessive strain on the database. This approach keeps data dynamically updated while minimizing query execution time.

Inefficient Query (Real-time Aggregation):

SELECT 
COUNT(DISTINCT devtodevid) AS total_users, 
COUNT(paymentid) AS total_purchases, 
SUM(revenueusd) AS total_revenue 
FROM your_project.payments
WHERE eventtime::date = '2024-09-01';

This query performs real-time aggregation, scanning the entire table, which becomes slow and inefficient for large datasets.

Optimized Query (Using Pre-Aggregated Tables):

-- Pre-aggregated summary table created using VIEW
CREATE MATERIALIZED DAILY VIEW 
daily_purchase_summary_2024 (purchase_date,total_users, total_purchases, total_revenue) as
SELECT
eventtime::date AS purchase_date, 
COUNT(DISTINCT devtodevid) AS total_users, 
COUNT(paymentid) AS total_purchases, 
SUM(revenueusd) AS total_revenue 
FROM your_project.payments
WHERE eventtime::date >= '2024-01-01'
GROUP BY 1; 
-- Querying the summary table from VIEW
SELECT 
total_purchases, 
total_revenue 
FROM view.daily_purchase_summary_2024 
WHERE purchase_date = '2024-09-01';

By using this pre-aggregated view, you significantly reduce computation time, as the data is already summarized, allowing dashboards to load quickly without heavy processing.

💡 Tip: In addition to creating your own custom VIEW, we also recommend leveraging the ready-to-use, daily-updated tables provided by devtodev:

  • Tables: returns_by_24h and returns_by_calendar for calculating retention or churn rates. For an example of how to use these tables, refer to this guide.
  • Table dau: This table already contains unique records based on devtodevid and user activity date, simplifying the calculation of Daily Active Users (DAU). For usage examples, check DAU Calculation Example.

7. Inconsistent granularity in cohort analyses

Cohort analysis, which groups users based on shared behaviors or sign-up dates, is essential for understanding user retention and growth. However, querying cohorts without grouping data into larger time periods like weeks or months can create unnecessarily large result sets and slow down queries.

Inefficient Query (Daily Cohort Analysis):

SELECT 
devtodevid, 
created::date as user_created_date 
FROM your_project.users 
WHERE created::date >= '2024-01-01';

Fetching individual dates makes it difficult to identify broader trends and often returns too many rows.

Optimized Query:

SELECT 
devtodevid, 
DATE_TRUNC('month', created) AS cohort_month 
FROM your_project.users 
WHERE created::date >= '2024-01-01';

Using DATE_TRUNC groups users by month (or week), making it easier to analyze trends and speeding up the query by reducing data granularity.

8. Avoid Redundant Data Selection

A common mistake when analyzing different user cohorts is duplicating queries with slightly different conditions. For instance, after writing a query for one cohort, it might seem easy to copy it for another cohort with just a different condition. However, this approach leads to multiple reads of the same data, which increases processing time and resource consumption.

Inefficient Query:

WITH without_payments AS (
  SELECT
    s.eventtime::date AS event_date,
    COUNT(DISTINCT s.devtodevid) AS DAU
  FROM your_project.sessions AS s
  WHERE
    s.eventtime >= current_date - INTERVAL '30 day'
    AND s.eventtime < current_date
    AND s.paymentcount = 0
  GROUP BY 1
),
with_1_5_payments AS (
  SELECT
    s.eventtime::date AS event_date,
    COUNT(DISTINCT s.devtodevid) AS DAU
  FROM your_project.sessions AS s
  WHERE
    s.eventtime >= current_date - INTERVAL '30 day'
    AND s.eventtime < current_date
    AND s.paymentcount BETWEEN 1 AND 5
  GROUP BY 1
),
more_5_payments AS (
  SELECT
    s.eventtime::date AS event_date,
    COUNT(DISTINCT s.devtodevid) AS DAU
  FROM your_project.sessions AS s
  WHERE
    s.eventtime >= current_date - INTERVAL '30 day'
    AND s.eventtime < current_date
    AND s.paymentcount > 5
  GROUP BY 1
)
SELECT 
t0.event_date, 
t0.DAU AS without_payments_dau, 
t1_5.DAU AS with_1_5_payments_dau, 
t6p.DAU AS more_5_payments_dau
FROM without_payments AS t0
LEFT JOIN with_1_5_payments AS t1_5 
  ON t0.event_date = t1_5.event_date
LEFT JOIN more_5_payments AS t6p 
  ON t0.event_date = t6p.event_date
ORDER BY event_date;

In this query, the session table is read three times for each group of users, followed by joins, which increases processing time.

Optimized Query:

SELECT event_date,
  MAX(dau) FILTER (WHERE group = '0') AS without_payments_dau,
  MAX(dau) FILTER (WHERE group = '1-5') AS with_1_5_payments_dau,
  MAX(dau) FILTER (WHERE group = '6+') AS more_5_payments_dau
FROM (
      SELECT
        s.eventtime::date AS event_date,
        CASE
          WHEN s.paymentcount = 0 THEN '0'
          WHEN s.paymentcount BETWEEN 1 AND 5 THEN '1-5'
          ELSE '6+'
        END AS group,
        COUNT(DISTINCT s.devtodevid) AS dau
      FROM your_project.sessions AS s
      WHERE s.eventtime >= current_date - INTERVAL '30 day'
        AND s.eventtime < current_date
      GROUP BY 1, 2
      ORDER BY event_date
) t
GROUP BY 1
ORDER BY 1;

In this optimized version, we reduce the number of times the table is read by grouping the conditions in a single query. By using the CASE statement to categorize users and filtering the results, we avoid multiple scans of the same data and eliminate the need for joins. This approach significantly improves query performance by minimizing data processing.

9. Filter Before Joining

Applying filters after performing joins is a frequent inefficiency when working with multiple tables. This approach processes more data than necessary, slowing down the query and overloading the database.

To improve performance, always apply filters before joining tables. This reduces the amount of data involved in the join operation, making the query more efficient.

Inefficient Query:

SELECT 
u.devtodevid, 
p.revenueusd 
FROM your_project.users AS u 
JOIN your_project.payments AS p 
ON u.devtodevid = p.devtodevid
WHERE u.created::date > '2023-01-01';

In this query, the join processes all the data in the users table before filtering, which increases execution time.

You can learn more about using joins by checking out our guide on SQL for Beginners: Joins and Funnels.

Optimized Query:

WITH filtered_users AS (
SELECT 
  devtodevid
FROM your_project.users  
WHERE created::date > '2023-01-01')
SELECT 
u.devtodevid, 
p.revenueusd 
FROM filtered_users AS u 
JOIN your_project.payments AS p 
ON u.devtodevid = p.devtodevid;

By filtering the users table beforehand, fewer rows are joined, resulting in a faster and more resource-efficient query.

Best Practices for Optimizing SQL Queries

  • Avoid SELECT *:Always select specific columns to reduce memory usage and improve query performance.
  • Use Event Time Filters: Apply filters based on eventtime(eventtime::date) or created::date to reduce data processing time.
  • Pre-Aggregate Data: Use CREATE VIEW to pre-aggregate data for real-time dashboards, reducing the need for on-the-fly calculations.
  • Avoid Multiple Subqueries: Instead of using separate subqueries for each event stage in funnel analysis, use CASE or FILTER statements within a single query.
  • Use Window Functions for Complex Calculations: Leverage window functions like ROW_NUMBER, FIRST_VALUE, and LAST_VALUE for efficient calculations.
  • Optimize Granularity in Cohort Analysis: Use larger time periods like weeks or months instead of days to reduce dataset size and improve performance.
  • Filter Before Joining: Always apply filters before performing joins to minimize the number of rows processed.
  • Avoid Redundant Data Selection: When querying similar data under different conditions, use CASE or FILTER within a single query to avoid duplicating logic.
  • Limit Data Early: Always retrieve only the necessary columns and rows. Use filters like LIMIT or WHERE early in your queries to minimize the dataset.

Conclusion

Optimizing SQL queries is crucial for product analytics, ensuring that insights are delivered efficiently and accurately. Common mistakes like over-retrieving data, inefficient metric aggregations, and poor indexing practices can severely impact performance. By adopting optimization techniques — such as filtering before joins, leveraging window functions for complex calculations, and using pre-aggregated data — you can ensure that queries run smoothly even on large datasets.

These optimizations enable faster tracking of key metrics, better funnel analysis, and more efficient cohort analysis, leading to quicker, data-driven product decisions.

If you’re considering switching to external solutions or enhancing your internal systems, devtodev offers the right blend of tools and flexibility to meet your goals. Our team is here to support you through this transition as efficiently and seamlessly as possible. Feel free to reach out to us at info@devtodev.com for any support or questions.

Check out our demo to see how it works in action!

Read more