- Practical Real-time Data Processing and Analytics
- Shilpi Saxena Saurabh Gupta
- 387字
- 2025-04-04 18:18:59
Data collection
This is the beginning of the journey of all data processing. Be it batch or real-time, the foremost challenge is getting the data from its source to the systems for processing. We can look at the processing unit as a black box and a data source, and at consumers as publishers and subscribers. This is captured in the following diagram:

The key aspects that come under the criteria of data collection tools, in the general context of big data and real-time specifically, are as follows:
- Performance and low latency
- Scalability
- Ability to handle structured and unstructured data
Apart from this, any data collection tool should be able to cater for data from a variety of sources such as:
- Data from traditional transnational systems: When considering software applications, we must understand that the industry has been collating and collecting data in traditional warehouses for a long time. This data can be in the form of sequential files on tapes, Oracle, Teradata, Netezza, and so on. So, starting with a real-time application and its associated data collection, the three options the system architects have are:
- To duplicate the ETL process of these traditional systems and tap the data from the source
- Tap the data from these ETL systems
- The third and a better approach is to go the virtual data lake architecture for data replication
- Structured data from IOT/Sensors/Devices, or CDRs: This is the data that comes at a very high velocity and in a fixed format—the data can be from a variety of sensors and telecom devices. The main complexity or challenge of data collection/ingestion of this data is the variety and the speed of data arrival. The collection tools should be capable of handling both the variety and the velocity aspects, but one good aspect of this kind of data for the upstream processing is that the formats are pretty standardized and fixed.
- Unstructured data from media files, text data, social media, and so on: This is the most complex of all incoming data where the complexity is due to the dimensions of volume, velocity, variety, and structure. The data formats may vary widely and could be in non-text format such as audio/ videos, and so on. The data collection tools should be capable of collecting this data and assimilating it for processing.