Compute on Cloud: Data Warehouse for Online Business -- Introduction

This is the start of my blog series on Data Warehouse for Online Business. The series will describe

the specification of such Data Warehouse and the common architectures for implementation. The following is the high level architecture of such data warehouse system:

The Data Warehouse contains the following components and sub-systems:

Data Injection/Input

Data input includes both snapshot of system states and events. The former is usually implemented as snapshots of database tables. And the later includes both system events (such as DB updates and events from online services) and user interaction events implemented as tracking data.
Events are usually injected thru message Pub-Sub (eg, Kafka) or batch copy of log files from online services.
DB tables are periodically snapshot-copied to Data Warehouse's data store.

Data Store

I like to use the abstraction of "tables" as representation of data in the Data Store, though they may include both norm (or in lesser strict way, flat) tables and denormalized tables with nested fields (eg, Google BigQuery). Distributed file system is usually the base storage layer for the data store. Hadoop's HDFS is such a DFS that optimizes on append and sequential reads. Parquet is a popular open-source choice for organize columnar tables over HDFS files. In addition, meta data management and transaction support over the data store can be provided by open source Hive or Delta Lake (open sourced in 2019).

Batch Processing (Transformation)

Batch processing transforms the tables in Data Store. Each batch pipeline inputs from tables and outputs to tables. There are some popular open-source batch processing programming models:

Mapreduce over record files.
Spark over RDD.
Spark SQL over Data Frames (tables)

A data warehouse usually has multiple batch pipelines with dependencies between them defined as that one pipeline consumes the outputs from others. Such dependency relationship can be modeled as a DAG for the purpose of orchestration of pipelines. Batch pipelines are usually run periodically to process recently generated data or accumulated data.

Stream Processing (Transformation)

Stream computation inputs and outputs streams of events. Stream pipeline aggregates input events over time window and outputs events that represent some updates to system states. For example, bot detection aggregates user impressions and actions by IP address by a sliding time window and outputs to label IP addresses as bot source.

Multiple stream pipelines are connected usually with Event Pub-Sub. Some stream pipeline may directly sink to tables in the data store or outputs to OLAP engine (sometimes thru event pub-sub).

OLAP Engine

This subsystem is optimized for online analytical queries. It usually includes its own storage and is also traditionally referred to as analytical database. Some tables in the Data Store are imported periodically into OLAP engine. Some OLAP engine also support streaming data. For example, Apache Druid is a general purpose OLAP using an internal columnar store. Kylin is an OLAP primarily for data cubes and uses HBase for storage.

Data Reporting/Visualization

Those are user-facing services that queries OLAP and generates/displays visualization and reports using data from OLAP. For example, Pivot is an open-source reporting UI over Druid.

Ad-hoc Analysis

Ad-hoc analysis is usually implemented as one-time batch processing of data in the Data Store. But some SQL like query language is usually preferred for ease of use. For example, Hive SQL is run as Mapreduce and Spark SQL is run as Spark batch job. Some users may prefer running Ad-hoc analysis on general-purpose OLAP engines (eg, Google Dremel, Apache Druid) for better performance with additional costs on adding the data to OLAP periodically.

Machine Learning

The data warehouse may also have a ML interface over its large scale parallel batch processing. For example, PySpark enables running feature transformation and model training as Spark batch job over tables in Data Store.

I will go into details on each sub system in a series of posts and will add links to them on this page.

Compute on Cloud

Data Warehouse for Online Business -- Introduction