The gist for the uninitiated: What you get from cohort analysis is a picture of how users/customers progress in your product as a function of when they first signed up. It lets you see how product changes, marketing pushes, network-effects, press, etc. impact conversions and the funnel.
Anyways, to do cohort analysis using Map Reduce, the pattern I'm using is:
- Map over all inputs you care about; output customer_id → state_mapping
- Shuffle all mapper outputs to collate by customer_id
- Reduce customer_id → state_mapping to cohort_id → state_mapping
- Re-Shuffle reducer output again to collate by cohort_id
- Re-Reduce cohort_id → state_mapping to a single combined state_mapping for that day
- Output the row of cohort data to CSV, etc
The output table looks like this:
|Cohort||Source||User signed up||User purchased||User purchased twice||...|
The output graph is normalized to 100% and looks like this:
The multi-level mapreduce as a diagram with intermediates:
The most important parts:
Each user only counts towards a single row. The row to select is the very first day they entered your system. This is the user's cohort. The reducer in step #3 will have access to the user's full history in your system. There you'll need to sort the user's events by time and figure out the earliest date they were active.
Each user only counts in a single column. In #3 you need to decide what the user's mutually exclusive state is. This should be the furthest point along in your product lifecycle that they've reached. For my product the states go from "sign-up" to "repeat customer".
Representing state mappings. For the state_mapping intermediates above, I use JSON dictionaries mapping exclusive states to the number 1. Then I fold the dicts in the reducer in step #5 to get the totals for each output row.
Segment your cohorts. Similar to tracking an advertising campaign, you can segment your cohorts and visualize them independently. For example, say one group of users signed up from a homepage flow, while another group signed up via a blog flow. These can be analyzed separately too. To do this I make the reducer in #3 output a cohort_id that contains the cohort date and the cohort group (e.g., "10/22/12:homepage").
Let me know if you have any suggestions! This is my shiny new hammer and everything looks like a nail~
(follow-up on YC HN, r/programming)