Skip to content

Raw Sink Service Specification

This document specifies the behavior of the service that batches raw messages into long term storage.

Data Flow

Consume messages from a Google Cloud PubSub topic and write in batches to BigQuery. Split batches based by time windows based on when they were retrieved from PubSub. Additionally, split batches when they reach a certain size, if possible.

Implementation

Execute this as a custom Java application running on GKE.

Latency

Accept a configuration for batch window size. Deliver batches to BigQuery within 5 minutes of the batch window closing.

Other Considerations

Message Acks

Only acknowledge messages in the PubSub topic subscription after delivery to BigQuery.