v.1

SQL for Beginners: Window Functions

EN
Learn what a window function is and how to use it to calculate each item's share of total gross.
SQL for Beginners: Window Functions
Published
24.05.2022
|
devtodev

Welcome to the next article from the SQL series. Working with our demo project, we have already managed to study join and union operations, calculate the number of active usersapply nested queries and temporary tables. Now let's talk about window functions.

In our app, users can pay real money for 6 different items. In previous article, we considered how much each of the items brings in cash. And what is the share of each of them in our total gross?

First, we’ll calculate how much gross each of the items brings and write the result in the product_sum table.

with product_sum as (
​select product, sum(priceusdsumm
from p102968.payments p
where eventtime >=current_date - interval '7 day'and eventtime current_date
and tester is false
and cheater is false
group by 1
)

Then let's display the resulting table in a separate query, adding a column in which we divide the amount of gross from each item by the amount of gross from all items. Multiply the result by 100 to get the %.

select product, summ
,100*summ â€‹(select sum(summ) from product_sum sum) as "% of total"
from product_sum
order by desc

sql window function
Query result. Screenshot from devtodev demo

In this query, for each row with an item, we use a subquery (select summ...) to calculate the total amount of gross. To avoid this, as well as to simplify and speed up the query, you can use window functions.

Window Functions

The window function performs calculations on a set of rows related to the current row.

You can compare it to an aggregation function. In contrast, in a window function, several rows are not grouped into one, but continue to exist separately. At the same time, a window function, like an aggregating one, can access not only the current query row. Let's modify our query using a window function:

​select product, summ
,100*summ / sum(summ) over () as "% of total"
from product_sum
order by desc

The over clause turns the aggregating function sum() into a window function, and it simply calculates the sum of all items in this query.
So, we get an absolutely identical query result without using a subquery, and without using additional grouping!

Aggregation functions (sum, count, avg…) are always used with grouping; in the case of window functions, the grouping rules are specified inside the over clause.

Now let's get rid of the CTE of the table, and count all in one query:

​select product, 100*sum(priceusd) over (partition by product) / sum(priceusd) over() as "% of total"
from p102968.payments p
where eventtime >=current_date - interval '7 day'and eventtime current_date
and tester is false
and cheater is false
order by desc

The partition by clause specifies that the rows are to be divided into groups or sections by concatenating the same values ​​of the partition by clauses. The window function is evaluated over the rows that are in the same section as the current row. As a result, we have all the rows from the payments table that have been filtered by where; and the share of items in total gross is calculated for each such row. 

sql partition by
Query result. Screenshot from devtodev demo

To get rid of duplicate values, you can use distinct in the select function.

We will get exactly the same result as in the very first query of this article, without CTE and nested queries.

Cumulative gross per day for 5-minute intervals

Such a report is already present on the devtodev Real-time dashboard, but let's find out how our gross is formed during the day.

First, let's calculate the income in every 5 minutes of the previous day. To do this, we need to take the hour of each payment and add the minutes given to the 5-minute intervals. Then we calculate the amount of income for each 5-minute period. 

​select (date_part('hour', eventtime::timestamp)|| 'hours')::interval + (((date_part('minute', eventtime::timestamp)::integer / 5::integer) 5::integer)|| 'minutes'):: interval as intervals 
sum(priceusd) priceusd
from p102968.payments p
where eventtime >current_date - interval '1 day'and eventtime current_date
and tester is false
and cheater is false
group by 1

We can notice that there may be intervals in which no payment was made. This means that the current query will not return all the intervals within 24 hours, but only intervals in which events were committed.

We need to generate a progression of 5-minute intervals, as we did in the previous article.

with intervals as (
​select (date_part('hour', intervals)|| 'hours')::interval + (date_part('minute', intervals)::integer|| 'minutes'):: interval as intervals 
from ( 
​select generate_series(ut('2021-10-01 00:00:00'), ut('2021-10-01 23:55:00')'5 min':: interval)::timestamp intervalst
)

Now, when joining two tables, we will have a complete list of 5-minute intervals and the amount of payments in these intervals. For periods without payments, the value will be NULL.

sql 5-minute intervals
Query result. Screenshot from devtodev demo

But how to calculate the value, which will take into account the previous values? The answer is window functions. The over clause can divide data not only into groups, but also set the order in which the function is calculated within each group using order by.

We will use the same aggregation function sum, and specify the order of summation over (order by intervals). Thus, for each row in the table, the sum from the beginning to the current row will be considered.

If partition by is specified inside the over clause, then the sum will be calculated from the beginning of each group to the current line.

The final query will look like this:

with intervals as (
​select (date_part('hour', intervals)|| 'hours')::interval + (date_part('minute', intervals)::integer|| 'minutes'):: interval as intervals 
from ( 
​select generate_series(ut('2021-10-01 00:00:00'), ut('2021-10-01 23:55:00')'5 min':: interval)::timestamp intervalst
)
​select intervals as "Date time"
(priceusd) as "Gross for 5-min" 
sum(priceusd) over(order by intervals) as "Cumulative gross"
from (
​select (date_part('hour', eventtime::timestamp)|| 'hours')::interval + (((date_part('minute', eventtime::timestamp)::integer / 5::integer) 5::integer)|| 'minutes'):: interval as intervals || 'minutes'):: interval as intervals
sum(priceusd) priceusd
from p102968.payments p
where eventtime >current_date - interval '1 day'and eventtime current_date
and tester is false
and cheater is false
group by 1
t
right join intervals using(intervals)
order by 1

sql gross for 5-minute interval
Query result. Screenshot from devtodev demo
sql gross for 5-minute intervals
Query result on the chart. Screenshot from devtodev demo

This approach is well suited for calculating the Cumulative ARPU or LTV, depending on the time after installation, as well as any other cumulative metric. These examples are just a small part of what you can do with window functions. In future articles, we will take a closer look at the capabilities of other window functions.

Read more