Cohort Analysis

Cohort analysis breaks down data into groups which usually share common characteristics or actions within a defined time frame. Cohort analysis helps identify patterns in the life cycle of customers, adapt, and tailor the service to specific cohorts. A group of people who share a common characteristic or actions within a defined period form a cohort, represented by a row in the returned result set. The date range selected in the report will specify at which date the first cohort begins.

Cohort

The syntax for the cohort function is as follows:

SELECT cohort_name, cohort_id, cohort_size, bucket_id, {function} AS {alias}
FROM cooladata
WHERE {query conditions}
CLUSTER COHORT BY {n days|weeks|months|quarters|years | property_name} 
STARTS WITH {TRUE | event_name = "<event_name>"} [AND {conditions}] 
FOLLOWED BY [NOT] {TRUE | event_name = "<event_name>"} [AND {conditions}] 
BUCKET BY {n days|{[CALENDAR] weeks|months|quarters|years} 
FIRST | ALL | RECURRING 
HAVING  [cohort_id < n AND] bucket_id < n
GROUP BY cohort_name, cohort_id, cohort_size, bucket_id

SELECT

SELECT cohort_name, cohort_id, cohort_size, bucket_id, {function} AS {alias}

Mandatory fields in the SELECT clause:

  • cohort_name: a string describing the cohort:
    • For DAY & WEEK – the date (YYYY-MM-DD) of the first day in the cohort
    • For MONTH – the month (YYYY-MM) of the cohort
    • For Properties – the value of the property
  • cohort_id: the ordinal number of the cohort (starts with 0)
  • cohort_size: the number of users in the cohort
  • bucket_id: the ordinal number of the bucket (starts with 0)
  • Aggregate Function: can be the count of users in the bucket, the cohort size minus the count of users in the bucket (i.e. the rest of the users), or the sum/average/max/min/stddev of a measure (property).

All arguments but the aggregate function must appear in the GROUP BY clause.

CLUSTER COHORT BY

CLUSTER COHORT BY {n days|weeks|months|quarters|years | property_name}

The grouping logic for each cohort. Can be defined by one of the following:

  • Time period defined by a number and a time resolution (days, weeks, months, quarters or years)
  • User scope properties

STARTS WITH

STARTS WITH  {TRUE | event_name = "<event_name>"} 

The first event performed:

  • event_name: performed a specific event
  • TRUE: performed any event

Any number of additional conditions for the event can be specified, i.e. where any of the event's properties have specific values, using "AND <condition>".

FOLLOWED BY

FOLLOWED BY [NOT] {TRUE | event_name = "<event_name>"}

The concurrent event performed:

  • event_name: performed a specific event
  • NOT event_name: did not perform a specific event
  • TRUE: performed any event

Note: to find the number of users who did nothing use the TRUE argument and, in the select clause, count the number of users left in the bucket – i.e.:

cohort_size - count(distinct customer_user_id, exact) AS count

Any number of additional conditions for the event can be specified, i.e. where any of the event's properties have specific values, using "AND <condition>".

BUCKET BY

BUCKET BY {n days|{[CALENDAR] weeks|months|quarters|years} 
FIRST | ALL | RECURRING 

The duration of each bucket and the rule by which to count users in each bucket:

  • FIRST: only counts the user once on the first occurrence of the event throughout his life cycle
  • ALL: the user is counted each time he performed the event
  • RECURRING:each cohort user is counted only if the user's activity occurred consecutively in each bucket throughout the duration of the cohort.

HAVING

HAVING  cohort_id < n AND bucket_id < n

Used to limit the number of buckets and cohorts in the query. When clustering cohorts by property, limiting cohort_id isn't necessary (will return all the property values, sorted by cohort size).

Example:

Which users bought something after registering?Users that registered within 7-day periods, and then paid for something, by months, showing only the first instance.This will show us on which month each user made his first payment.

SELECT cohort_name, cohort_id, cohort_size, bucket_id, COUNT(DISTINCT customer_user_id, exact) AS count
FROM cooladata
WHERE date_range(last 90 days)
CLUSTER COHORT BY 7 DAYS
   STARTS WITH event_name = "register"
   FOLLOWED BY event_name = "pay"
   BUCKET BY 30 DAYS FIRST
HAVING cohort_id < 5 AND  bucket_id < 5
GROUP BY cohort_name, cohort_id, cohort_size, bucket_id

Reverse Cohort

It is also possible to analyze a sequence of two events in the reverse order, i.e. what happened prior to a certain event.Replace "starts with… followed by…" with "ending with… preceded by…", as follows:

replace

STARTS WITH  {TRUE | event_name = "<event_name>"} 
FOLLOWED BY [NOT] {TRUE | event_name = "<event_name>"}

with

ENDING WITH  {TRUE | event_name = "<event_name>"} 
PRECEDED BY [NOT] {TRUE | event_name = "<event_name>"}

The rest of the query would still use the same syntax.

Export users

You can also generate a list of users and their properties using the cohort query.To do so, replace the SELECT clause with the user properties you wish to export and remove the GROUP BY clause.

replace

SELECT cohort_name, cohort_id, cohort_size, bucket_id, {function} AS {alias}

with

SELECT {user properties}

Example:

List of user ID's and emails who bought something within the first week after registration.Using the the same cohort query, select the user ID's and emails instead of the cohorts and buckets. To limit the list to the first week, limit the results to buckets 0 and 1.Note that only user scope properties can be exported (see more on property scopes here).

SELECT customer_user_id, user_email
FROM cooladata 
WHERE date_range(last 30 days)\
CLUSTER COHORT BY 7 DAYS 
   STARTS WITH event_name = "register" 
   FOLLOWED BY event_name = "pay" 
   BUCKET BY 7 DAYS ALL
HAVING (bucket_id<2)