书名：Practical Real-time Data Processing and Analytics
作者名：Shilpi Saxena Saurabh Gupta
本章字数：444字
更新时间：2025-04-04 18:19:00

Flume

Flume is the most famous project of Apache for log processing. To download it, refer to the following link: https://flume.apache.org/download.html. Download the apache-flume-1.7.0-bin.tar.gz Flume setup file and unzip it, as follows:

cp apache-flume-1.7.0-bin.tar.gz ~/demo/
tar -xvf ~/demo/apache-flume-1.7.0-bin.tar.gz

The extracted folders and files will be as per the following screenshot:

We will demonstrate the same example that we executed for the previous tools, involving reading from a file and pushing to a Kafka topic. First, let's configure the Flume file:

    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    a1.sources.r1.type = TAILDIR
    a1.sources.r1.positionFile = /home/ubuntu/demo/flume/tail_dir.json
    a1.sources.r1.filegroups = f1
    a1.sources.r1.filegroups.f1 = /home/ubuntu/demo/files/test
    
    a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
    a1.sinks.k1.kafka.topic = flume-example
    a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
    
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 6
    
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1

Flume has three components that define flow. The first are sources, from which the logs or events come. There are multiple sources available in Flume to define the flow. A few are kafka, TAILDIR, and HTTP, and you can also define your own custom source. The second component is sink, which is the destination of events where it will be consumed. The third is channels, which defines the medium between source and sink. The most commonly used channels are Memory, File, and Kafka, but there are also many more. Here, we will use TAILDIR as source, Kafka as sink, and Memory as channel. As of previously configuration a1 is the agent name, r1 is the source, k1 is the sink, and c1 is the channel.

Let's start with source configuration. First of all, you have to define the type of source using <agent-name>.<sources/sinks/channels>.<alias name>.type. The next parameter is positionFile which is required to keep track of the tailing file. filegroups indicates a set of files to be tailed. filegroups.<filegroup-name> is the absolute path of the file directory. Sink configuration is simple and straightforward. Kafka requires bootstrap servers and topic names. Channels configuration is long, but here we used only the most important ones. Capacity is the maximum number of events stored in the channel and transaction Capacity is the maximum number of events the channel will take from a source or give to a sink per transaction.

Now, start the Flume agent using the following command:

    bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name a1 -Dflume.root.logger=INFO,console

It will be started and the output will be as follows:

Create a Kafka topic and name it flume-example:

bin/kafka-topics.sh --create --topic flume-example --zookeeper localhost:2181 --partitions 1 --replication-factor 1

Next, start the Kafka console consumer:

bin/kafka-console-consumer.sh --topic flume-example --bootstrap-server localhost:9092

Now, push some messages in the file /home/ubuntu/demo/files/test as in the following screenshot:

The output from Kafka will be as seen in the following screenshot: