Saturday, September 08, 2018

Batch ETL to Stream Processing

Many of our customers are moving their traditional ETL jobs to real-time stream processing.
The following article is an excellent read of why Kafka is an excellent choice for unified batch processing and stream processing.

Snippets from the article:

  • Several recent data trends are driving a dramatic change in the old-world batch Extract-Transform-Load (ETL) architecture: data platforms operate at company-wide scale; there are many more types of data sources; and stream data is increasingly ubiquitous
  • Enterprise Application Integration (EAI) was an early take on real-time ETL, but the technologies used were often not scalable. This led to a difficult choice with data integration in the old world: real-time but not scalable, or scalable but batch.
  • Apache Kafka is an open source streaming platform that was developed seven years ago within LinkedIn.
  • Kafka enables the building of streaming data pipelines from “source” to “sink” through the Kafka Connect API and the Kafka Streams API.
  • Logs unify batch and stream processing. A log can be consumed via batched “windows”, or in real time by examining each element as it arrives.