Setting up analytics and BI tools
Notes from a Holistics.io book on setting up analytics and BI tools.
Chapter 1. High-level Overview of an Analytics Setup
Basics: dirty (ie raw operational) data in lake > clean data in data warehouse (DW) > visualization/reporting/ML to stakeholders.
Holistics’ tenets:
- ELT rather than ETL
- DW in cloud rather than on premise
- center on data modeling
- analytics: SQL rather than not SQL
- focus on analytics workflow rather than visualization
Chapter 2. Centralizing Data
DW is where you store and process data for analytics.
ETL: data transformed in the tool’s memory, then dumped into DW. Comapre to ELT: data transformed within the DW. Now that storage is cheap, ELT is better than ETL.
Extract and Load is much more than just an select * from source_db
.
So don’t reinvent the wheel, and use off-the-shelf ETL tools:
- ETL tools: Airflow, Prefect, Meltano, Talend, Stitchdata.
- DW: Redshift, Bigquery, Snowflake, Presto.
- Data lake: S3 or GCS.
- Transform tools (to clean, aggregate, pre-compute): dbt, Looker, Holistics.
Chapter 3. Data Modeling for Analytics
Data Analyst implements business logic into data logic in the modeling layer, eg via dbt. This enables self-serve analytics for stakeholders. Goal of modeling is self-service, so business needs/usage determine modeling, not the other way.
Data models are abstract, unlike tables. Models store metadata too, via calculated/derived fields, and relationship mapping. Models can be persisted back into DW.
Modeling approaches: Kimball, Inmon, Data Vault, star schema.
Modeling has 4 steps: pick business process, decide grain, choose dimensions, choose measures.
For slow-changing dimensions, snapshot them all daily, because storage and compute is cheap, but engineer’s time is expensive.
Chapter 4. Using Data
1990s: Mostly banks shuffled data, and 1GB RAM cost $30k. Analysts used Cognos to see available cubes made by data engineers. Waterfall process meant updating cubes took weeks. Bootlenecked on data engineers.
2000s: Tech companies appear. Analysts use SQL Server for DW, and to define and run cubes. Any employee can go to Tableau and self-service their metric. Problem: every team defines metrics their own way in Tableau, leading to discrepancies and confusion.
2016: Tech startups. Analysts define metrics in a single place in Looker, which runs ELT on top of Redshift DW.
No-SQL tech from early 2010s like Hadoop or Spark did not catch up for analytics compared to SQL, because analysts are not engineers, and only know SQL.
Dimension of BI tool | Old way | Recommended by Holitsitcs |
---|---|---|
SQL vs Non-SQL | Non-SQL: Tableau, PowerBI, Sisense | SQL: Holistics, Looker, Mode, Redash, Metabase |
Embedded Datastore vs External Datastore | Embedded: MicroStrategy, Tableau, PowerBI, Sisense | External: Holistics, Looker, Metabase, Redash |
In-memory (ie local) vs In-database | In-memory: Tableau, MicroStrategy, Sisense, PowerBI, etc. | In-database: Holistics, Looker, Redash, Metabase, etc. |
Modeling vs non-modeling BI tools | Non-modeling: Tableau, Mode, Redash | Modeling: Qlik, PowerBI, Looker, Holistics |
Company data maturity cycle: first ad-hoc queries, then static reports/dashboards, then self-service from prepared datasets.