Pain points
A running list of things that are suboptimal in GCP.
App Engine
For network-bound applications it can be prohibitively expensive. A PubSub push
subscription application that decodes protobuf and forwards messages to the
ingestion-edge used ~300
instances at $0.06
per instance hour to handle
~5krps
, which is ~$13K/mo
.
Dataflow
Replaces certain components with custom behavior that is not part of the open source Beam API, making it so they can't be extended (e.g. to expose a stream of messages that have been delivered to PubSub).
BigQueryIO.Write
Requires decoding PubsubMessage.payload
from JSON to a TableRow
, which gets
encoded as JSON to be sent to BigQuery.
Crashes the pipeline when the destination table does not exist.
FileIO.Write
Acknowledges messages in PubSub before they are written to accumulate data
across multiple bundles and produce reasonably sized files. Possible
workaround being investigated in #380. This also effects BigQueryIO.Write
in batch mode.
PubsubIO.Write
Does not support dynamic destinations.
Does not use standard client library.
Does not expose an output of delivered messages, which is needed for at least
once delivery with deduplication. Current workaround is to use the deduplication
available via PubsubIO.read()
.
Uses HTTPS JSON API, which increases message payload size vs protobuf by 25% for base64 encoding and causes some messages to exceed the 10MB request size limit that otherwise would not.
PubSub
Can be prohibitively expensive. It costs
~$51K/mo
to use PubSub with a 70MiB/s
stream published or consumed 7 times (Edge to
raw topic, raw topic to Cloud Storage, raw topic to Decoder, Decoder to decoded
topic, decoded topic to Decoder for deduplication, decoded topic to Cloud
Storage, decoded topic to BigQuery).
Push Subscriptions are limited to min(10MB, 1000 messages)
in flight, making
the theoretical maximum parallel latency per message ~62ms
to achieve
16krps
.