Skip to content

Pain points

A running list of things that are suboptimal in GCP.

App Engine

For network-bound applications it can be prohibitively expensive. A PubSub push subscription application that decodes protobuf and forwards messages to the ingestion-edge used ~300 instances at $0.06 per instance hour to handle ~5krps, which is ~$13K/mo.

Dataflow

Replaces certain components with custom behavior that is not part of the open source Beam API, making it so they can't be extended (e.g. to expose a stream of messages that have been delivered to PubSub).

BigQueryIO.Write

Requires decoding PubsubMessage.payload from JSON to a TableRow, which gets encoded as JSON to be sent to BigQuery.

Crashes the pipeline when the destination table does not exist.

FileIO.Write

Acknowledges messages in PubSub before they are written to accumulate data across multiple bundles and produce reasonably sized files. Possible workaround being investigated in #380. This also effects BigQueryIO.Write in batch mode.

PubsubIO.Write

Does not support dynamic destinations.

Does not support [NestedValueProvider] for destinations in streaming mode on Dataflow, which is needed to create classic templates that accept a mapping of document type to a predetermined number of destinations. This is because Dataflow moves the implementation into the shuffler to improve performance. Current workaround is to specify mapping at classic template creation time, or use Flex Templates.

Does not use standard client library.

Does not expose an output of delivered messages, which is needed for at least once delivery with deduplication. Current workaround is to use the deduplication available via PubsubIO.read().

Uses HTTPS JSON API, which increases message payload size vs protobuf by 25% for base64 encoding and causes some messages to exceed the 10MB request size limit that otherwise would not.

Templates

Does not support repeated parameters via ValueProvider<List<...>>, as described in Dataflow Java SDK #632.

PubSub

Can be prohibitively expensive. It costs ~$51K/mo to use PubSub with a 70MiB/s stream published or consumed 7 times (Edge to raw topic, raw topic to Cloud Storage, raw topic to Decoder, Decoder to decoded topic, decoded topic to Decoder for deduplication, decoded topic to Cloud Storage, decoded topic to BigQuery).

Push Subscriptions are limited to min(10MB, 1000 messages) in flight, making the theoretical maximum parallel latency per message ~62ms to achieve 16krps.