书名：Practical Real-time Data Processing and Analytics
作者名：Shilpi Saxena Saurabh Gupta
本章字数：608字
更新时间：2025-04-04 18:19:00

Taping data from source to the processor - expectations and caveats

In this section, we will discuss the expectations of log streaming tools in terms of performance, reliability, and scalability. The reliability of the system can be identified by message delivery semantics. There are three types of delivery semantics:

At most once: Messages are immediately transferred. If the transfer succeeds, the message is never sent out again. However, many failure scenarios can cause lost messages.
At least once: Each message is delivered at least once. In failure cases, messages may be delivered twice.
Exactly once: Each message is delivered once and only once.

Performance consists of I/O, CPU, and RAM usage and impact. By definition, scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged in order to accommodate that growth. So, we will identify whether tools are scalable to handle increased loads or not. Scalability can be achieved horizontally and vertically. Horizontally means adding more computing machines and distributing the work, while vertically means increasing the capacity of a single machine in terms of CPU, RAM, or IOPS.

Let's start with NiFi. It is a guaranteed delivery processing engine (exactly once) by default, which maintains write-ahead logs and a content repository to achieve this. Performance depends on the reliability that we choose. In the case of NiFi guaranteed message delivery, all messages are written to the disk and then read from there. It will be slow, but you have to pay in terms of performance if you don't want to lose even a single message. We can create a cluster of NiFi controlled by the NiFi cluster manager. Internally, it is managed by zookeeper to sync all of the nodes. The model is master and slave, but if the master dies then all nodes continue to operate. A restriction will be that no new nodes can join the cluster and you can't change the NiFi flow. So, NiFi is scalable enough to handle the cluster.

Fluentd provides At most once and At least once delivery semantics. Reliability and performance is achieved by using the Buffer plugin. Memory Buffer structure contains a queue of chunks. When the top chunk exceeds the specified size or time limit, a new empty chunk is pushed to the top of the queue. The bottom chunk is written out immediately when a new chunk is pushed. File Buffer provides a persistent buffer implementation. It uses files to store buffer chunks on a disk. As per its documentation, Fluentd is a well-scalable product where M*N, is resolved by M+N where M is the number of input plugins and N is the number of output plugins. By configuring multiple log forwarders and log aggregators, we can achieve scalability.

Logstash reliability and performance is achieved by using third party message brokers. Logstash fits best with Redis. Other input and output plugins are available to integrate with message brokers, like Kafka, RabbitMQ, and so on. Using Filebeat as leaf nodes, we can get scalability with Logstash to read from multiple sources, from the same source with a different log directory, or from the same source and the same directory with a different file filter.

Flumes get its reliability using channels and syncs between sink and channel. The sink removes an event from the channel only after the event is stored into the channel of the next agent or stored in the terminal repository. This is single hop message delivery semantics. Flume uses a transactional approach to guarantee the reliable delivery of the events. To read data over a network, Flume integrates with Avro and Thrift.