Tracking Newsfeed activity

Written on August 1, 2018

Let’s imagine a user browsing Facebook’s Newsfeed. This post describes how I would track basic Newsfeed user activity, and design supporting schemas in Redshift.

feed_init triggers when the user lands on a web page or app screen with a Newsfeed.

Field Type Description
user_id uuid id of the user. Distribution key in Redshift.
ts timestamp wallclock time when the event triggered. Sort key
session_id uuid each feed load generates a unique session_id

feed_show_story triggers every time a story is displayed to the user.

Field Type Description
user_id uuid distribution key
ts timestamp sort key
session_id uuid each feed load generates a unique session_id
story_id timestamp uuid of the story shown
story_index timestamp index of the story in the feed, start at zero

Both events are sent to and stored in Redshift, one table per event.

So far we only have raw logs. Analysts prefer sessions, so let’s give them feed_sessions:

Field Type Description
user_id uuid distribution key
session_start_at timestamp sort key
session_id uuid each feed load generates a unique session_id
stories_seen int max story index, 0 if none were seen

This sessionized table comes from an ETL job like this one in Redshift SQL dialect:

select 
  i.user_id,
  i.ts as session_start_at,
  i.session_id,
  nvl(max(s.story_index),0) as stories_seen
from feed_init i
left join feed_show_story s
  on i.session_id = s.session_id
  and datediff('second', i.ts, s.ts) <= 12*60*60 
  -- cap sessions to 12h
where i.ts >= '2018-08-01' 
  and i.ts < '2018-08-02'
  and s.ts >= '2018-08-01'
  and s.ts < '2018-08-02 12:00:00' 
  -- extra 12h so as to not cut sessions starting at 23:59

In a next post, I’ll cover the inefficiency with joining this way on Redshift.