书名：Practical Real-time Data Processing and Analytics
作者名：Shilpi Saxena Saurabh Gupta
本章字数：577字
更新时间：2025-04-04 18:18:59

The NRT system and its building blocks

The first and foremost question that strikes us here is "when do we call an application an NRT application?" The simple and straightforward answer to this is a software application that is able to consume, process, and generate results very close to real-time; that is, the lapse between the time the event occurred to the time results arrived is very small, an order of a few nanoseconds to at most a couple of seconds.

It's very important to understand the key aspects where the traditional monolithic application systems are falling short to serve the need of the hour:

Backend DB: Single point monolithic data access.
Ingestion flow: The pipelines are complex and tend to induce latency in end to end flow.
Failure & Recovery: The Systems are failure prone, but the recovery approach is difficult and complex.
Synchronization and state capture: It's very difficult to capture and maintain the state of facts and transactions in the system. Getting diversely distributed systems and real-time system failures further complicates the design and maintenance of such systems.

The answer to the previous issues is an architecture that supports streaming and thus provides its end users access to actionable insights in real-time over forever flowing in streams of real-time fact data. Couple of challenges to think through by design of a stream processing system and captured in in the following points:

Local state and consistency of the system for large scale high velocity systems
Data doesn't arrive at intervals, it keeps flowing in, and it's streaming in all the time
No single state of truth in the form of backend database, instead the applications subscribe or tap into the stream of fact data

Before we delve further, it's worthwhile understanding the notation of time:

The image has made it, very clear to correlate the SLAs with each type of implementation (batch, near real-time, and real-time) and the kinds of use cases each implementation caters for. For instance, batch implementations have SLAs ranging from a couple of hours to days and such solutions are predominantly deployed for canned/pre-generated reports and trends. The real-time solutions have an SLA magnitude of a few seconds to hours and cater for situations requiring ad-hoc queries, mid-resolution aggregators, and so on. Where the real-time application is the most mission critical in terms of SLA and resolutions is where each event accounts for and the results have to return within an order of milliseconds to seconds.

Now that we understand the time dimensions and SLAs with respect to NRT, real-time, and batch systems, let's walk to the next step that talks about understanding the building blocks of NRT systems.

In essence, it consists of four main components/layers, as depicted in the following figure:

The message transport pipeline
The stream processing component
The low-latency data store
Visualization and analytical tools

The first step is the collection of data from the source and providing it to the Data Pipeline, which actually is a logical pipeline that collects the continuous events or streaming data from various producers and provides it to the consumer stream processing applications. These applications transform, collate, correlate, aggregate, and perform a variety of other operations on this live streaming data and then finally store the results in the low-latency data store. Then, there are a variety of analytical, business intelligence, and visualization tools and dashboards that read this data from the data store and present it to the business user.