Umbrella task to track changes related to this issue:
We noticed some dropped UDP packets (from /proc/net/udp) for the Benthos socket used by HAProxy (cp hosts) and nginx (ncredir hosts). This is somehow expected when Benthos is not able to process with large volume of requests.
An ideal setup should be able to cope with 150k reqs/s (spikes experienced during large DDOS attacks), maybe allowing some spooling off time but without dropping any message.
Some notes (summary of last days conversation about this topic):
- Dropped packets/messages aren't tracked in Benthos as they are discarded "silently" at OS level, so we need a way to track those and eventually trigger alerts
- Switching to UDS for Benthos is a quick fix to avoid dropped messages but
- Must be done at systemd level (passing the unixgram socket as stdin to Benthos) or patching Benthos input to support this.
- When the UDS buffer is full, it blocks the writer, something we definitely want to avoid.
- Interposing some kind of buffer (eg. fifo-log-demux) between the producer and Benthos could definitely help alleviating high requests spikes. Plus, the new release has some metrics about discarded messages.
- We experience buffer overrun in VarnishKafka too (journalctl -u varnishkafka-webrequest.service --grep "Log overrun"). This means that we're currently loosing some messages from the current setup too?
- A nice feature to have is the ability to "tap" into the current setup to have some insight of the logs before (or after) being processed by Benthos, in realtime. This is especially useful for debugging. With Benthos this is technically feasible but we noticed a significant performance drop with that. With fifo-log-demux should be as easy as connecting another client to the "read end".