Here is a new article from the «SQL for Beginners» series. We have already talked about how to start learning SQL, how to use it to find out the number of active users, calculate the main monetization metrics of our demo project, and determine the structure of user purchases.
This time, let's talk about joining tables.
Join is a basic database tool that is used to link tables based on a specific set of fields. This operation always uses two tables, usually called left (in from) and right (in join), and a join condition.
select column_names (1..N)
from left_table_name
join right_table_name
on join_condition
There are several basic types of joins, each of which can be used for completely different tasks.
Or more complicated:
Data enrichment
One of the most commonly used tasks performed with join is data enrichment. Let's say we want to know which campaign attracts more paying users . We have a payments table, but there is no data about the campaign that attracted the user. This data is contained in the users table.
In this case, the best solution would be to use left join, which in any case will return all rows from the payments table and add data from the users table.
The condition for our join will be a unique user ID, which is present in both tables.
select p.devtodevid::text as devtodevid, p.eventtime, p.product, u.campaign, u.publisher
from p102968.payments
left join p102968.users on p.devtodevid = u.devtodevid
where eventtime>current_date - interval '7 day'
order by devtodevid, eventtime
As a result, we get a table in which information about the campaign and publisher that brought this user to our project is added to each payment.
Remember that the join will return all columns of both tables. Let's say there are 20 columns in the payments table, and 30 in users. After joining them, we’ll get all 50 columns in the resulting table, so be careful with these operations.
Two events (X and Y)
We are interested to know how many users have sent help to their friends in the last week after joining the clan.
To do this, take the table from p102968. "_ joined the clan" as jc and join it with the inner join p102968. "_ help sent" as hs using(devtodevid).
select count(distinct jc.devtodevid) as "Joined the clan and Help sent users"
from p102968."_joined the clan" as jc
inner join p102968."_help sent" as hs
using(devtodevid)
where jc.eventtime>current_date - interval '7 day'
and hs.eventtime>current_date - interval '7 day'
Inner join allows us to return only those users who completed both the first and second events. We have concatenated tables with using(devtodevid), which is a more compact form of the on jc.devtodevid = hs.devtodevid condition. Also, in the where clauses, we indicated that both events should have happened within the last week.
Funnel from event X to event Y
If we need to find out not just the number of those who have completed both events, but the number of those who joined the clan and those who sent help after entry, we can use the left join for this. In this case, we need not only to correlate the user from the left table with the user from the right on jc.devtodevid = hs.devtodevid, but also take into account that the second event must take place after the first (if the order of actions is important to us). Then we have to add and jc.eventtime < hs.eventtime to the join condition.
select count(distinct jc.devtodevid) as "Joined the clan and Help sent users"
, count(distinct hs.devtodevid) as "Help sent users"
from p102968."_joined the clan" as jc
left join p102968."_help sent" as hs
on jc.devtodevid = hs.devtodevid and jc.eventtime < hs.eventtime
where jc.eventtime>current_date - interval '7 day'
and (hs.eventtime>current_date - interval '7 day' or hs.eventtime is null)
There is one nuance that you may not have noticed – this is the filter condition or hs.eventtime is null. It is necessary here, because as a result of the left join, for users who, after entering the clan, do not have a help event, some of the data (from the second table) will be missing, and therefore filled with null. Without it, our initial condition hs.eventtime> current_date-interval '7 day' would filter out all rows with such users (with null data for the second event), since eventtime = null is not comparable to the date.
Users who didn't perform Y after X
Now let's try to find those users who fell out of our funnel. This is very easy to do using the query above.
We have already mentioned that the result of such a join of the tables will be a list of users who joined the clan and following the sending of help. If the user didn’t send help after joining the clan, then the right hs table will be empty for him.
This means that in order to get users who fell out of the funnel, we just need to find those who have an empty second event. To do this, we simply add and hs.devtodevid is null to the where clause. This request will be called Left exclusive join.
The final query:
select count(distinct jc.devtodevid) as "Joined the clan and Help sent users"
from p102968."_joined the clan" as jc
left join p102968."_help sent" as hs
on jc.devtodevid = hs.devtodevid and jc.eventtime < hs.eventtime
where jc.eventtime>current_date - interval '7 day'
and (hs.venttime>current_date - interval '7 day' or hs.eventtime is null)
and hs.devtodevid is null
P.S
You can combine tables not only by user id, but also by any other columns – dates, countries, product names. With a single query, you can calculate how many active users were there during each day of the month (using the sessions table) and combine it with the table that calculates total revenue (using the payments table) for the same days. Then, in the final table, you can display Active users, Gross for each of the days and calculate ARPDAU (Gross / Active users). We will leave this topic for future articles.