What is RightData DataFactory?

DataFactory is a low-code/no-code platform for building and orchestrating data pipelines with ease.

What are the features of DataMarket?

DataMarket offers a simplified way to find, access, and act on data products using natural language search.

What is DataTrust used for?

DataTrust ensures data observability and quality through continuous monitoring and validation.

Whitepaper

January 24, 2023

How Kafka and the Pub/Sub Model Fits into Event Driven Architectures

Download

How Kafka and the Pub/Sub Model Fits into Event Driven Architectures

Modern Data Management for Event Driven Architectures (EDA)

With the need for managing streaming data growing, Event-Driven Architectures must utilize the power of streaming frameworks – in this case, Kafka. Today, both the state change of the data as well as the notifications of those changes should be managed in modern data architecture. It is important to note that notification comes as a result of the state change and comprises everything from publishing to consumption, a core principle of the Pub/Sub Model.

Streaming Data Today

Streaming data is generated continuously by up to thousands of data sources, which typically provide data records simultaneously, in kilobyte sizes. Processed sequentially and incrementally on a record-by-record basis or over time, streaming enables analytics and views on emerging change and events. Processing and querying occur in rolling time or even against the most recent data record, needing latency in the order of seconds or milliseconds. This aligns with the Kafka Pub/Sub methodology for efficient streaming data delivery.

The value is high for industry segments such as retail, utilities, financial, gaming, and military, especially when combined with other relevant trend information. The application and use of Kafka Pub/Sub messaging are only widening, with log files from mobile, web, and e-commerce services in play from connected devices to the Internet of Things (IoT) to instrumentation in data centers.

At the heart of Event-Driven Architecture is a design model that connects distributed software systems and allows for efficient communication. EDA makes it possible to exchange information in real-time or near real-time and is common for designing apps that rely on microservices (services running their own process). EDA attributes shown to the right are driving a data enterprise handling complex and streaming applications.

Putting EDA’s together with streaming in a modern-day architecture is therefore necessary in all of these industry applications. Using a Publish/Subscribe (Pub/Sub) model, together with Kafka, the open-source leader for streaming, is an effective way to get started.

EDA and Publish/Subscription (“Pub/Sub”)

Publish/Subscribe messaging, or Pub/Sub messaging, is a form of asynchronous service-to-service communication used in serverless and microservices architectures. In a Pub/Bub model, any message published to a topic is immediately received by all of the subscribers and can be used to enable EDA’s or decouple applications in order to increase performance, reliability, and scalability.

Pub/Sub uses a flexible messaging pattern that allows disparate system components to interact with one another asynchronously as well. The key point is that Pub/Sub enables computers to communicate and react to data updates as they occur, which is central to Kafka’s Pub/Sub Model approach.

Pub/Sub and EDA

Optimizing EDA means using the power of the model architecture to ensure performance. In the Publish/Subscribe framework, the following key areas should be considered:

Scalability: EDAs allow for great horizontal scalability as one event may trigger responses from multiple systems with unique needs and providing different results.

Loose coupling: Producers and consumers are unaware of each other. There is an intermediary that receives events, processes them, and sends them to the systems interested in specific events. This allows for loose coupling of services and facilitates their modifying, testing, and deployment.

Asynchronous eventing: Event notifications are broadcast asynchronously, meaning events are published as they happen. A service can consume or process a published event later if it is unavailable, and it does not affect or block a producing service.

Fault tolerance: Owing to systems being loosely coupled, the failure of one will not make any difference to the working of other systems.

Additional areas: Agility, interoperability.

Getting Started With Apache Kafka – For Streaming

There are many alternatives to Kafka for streaming such as Amazon Kinesis, RabbitMQ, ActiveMQ, Red Hat AMQ, IBM MQ, and Amazon SQS, but starting with the market leader provides a pathway for enterprise success when it comes to creating an Event Driven Architecture. First, the concept of 1) producers providing data to Kafka topics and 2) consumers accessing Kafka topics into target platforms is foundational to Kafka-managed streaming.

There is more information on the Apache Kafka website, but suffice to say Kafka is a standard for streaming data and fits well into an EDA strategy. As an event streaming solution that enables applications to manage massive amounts of data, Kafka is more than capable for scalable fault-tolerant strategies that can handle billions of events. In addition, the framework’s publish-subscribe-messaging approach accepts data streams easily and allows analytics workflow to be created.

In particular, with a robust EDA approach, companies can leverage the key features Apache Kafka including fault tolerance, low latency, scalability, APIs, and support of 3rd party integrations. The last feature is important as enterprise architects are integrating AWS Redshift, Cassandra, and Spark, to name a few.

Bringing It All Together Using Kafka Streaming for EDA

As mentioned, Kafka is a system framework that captures data in streams from events. When an event occurs, there is an indication of that state change from data surrounding keys, values, timestamps, or metadata. This then creates the message that drives the system management and the notification. The key point is that event streaming creates a continuous data workflow ensuring the right information is ready at the right time and place. This leads to EDA using Apache Kafka.

With an Event-Driven Architecture applications and microservices get notified of changes, creating a more responsive feedback loop; in short, the very data streams are creating real time value for both the payload of what the data is as well as the metadata context they hold. This is particularly important for applications that use timestamps or geo-location data in IoT or military systems, respectively.

Where Is The Crux Value of EDA for Kafka?

Going back to the simple and powerful need for Kafka – connecting producers and consumers for microservices and applications – represents the crux value as well as setting an approach for building the system.

When you can leverage state change or “listen” intently for notifications for streaming data, you can now see the value in a scalable approach for processing and messaging and look to take action (or not). With the hybrid approach, unlike messaging only systems, you can set time parameters around sets of data to establish boundaries of capture and value. In a sense, the queue becomes fungible with a data life.

In addition, streaming data can be made available (and governed) by publishing once and allowing to be read by many consumers. In that sense as well, the “Pub/Sub” nature of Kafka fits nicely with an EDA approach. There are also additional reasons why Apache Kafka fits with EDA, but utilizing a modern data integration platform such as Dextrus to see how pipelines and transformations work within this idea shows optimization.

How Dextrus Enables EDA With Kafka And Spark

Dextrus Streaming Pipelines

Streaming Pipelines: Dextrus has deep capability to perform performance-based streaming. Let’s take a look at the pipeline and screenshots of the platform. Dextrus pipelines stream data from sources to multiple targets with a combination of powerful streaming transformation capabilities (i.e. applying a series of operations to a stream of data in real-time, as it is being received). This allows for immediate processing and analysis of the data, rather than waiting for it to be stored and then processed in batches. Streaming transformations can be used in a variety of contexts, such as real-time data processing pipelines, real-time analytics, and machine learning applications. A typical pipeline is shown below illustrating streaming workflow.

Bulk Streaming Pipelines (BSP): Dextrus BSPs help to stream several datasets at once. In addition, the BSP approach is perfect for classic Extract & Load (EL) scenarios. BSP also provides an effortless way to move and replicate all tables data from various sources, such as Oracle, SQL Server, MySQL, and PostgreSQL, to a target system, such as Amazon RDS, Amazon Aurora, and Snowflake.

The service also ensures that migration is done with minimal disruption to the source database by implementing a database log-based extraction and also has a capability to fetch CDC (Change Data Capture) using a query-based mechanism ensuring data is consistent and accurate throughout the pipeline.

Finally, BSPs can be equipped with schema evolution to ensure that any changes to the schema can be done in a backward-compatible way and the system can handle the changes without causing any interruption to the streaming pipeline.

Examples of Dextrus Streaming

Four scenarios are presented with screenshots for the ease of use in streaming using Kafka:

Scenario 1: Streaming from SQL Server to Kafka topic

Scenario 2: Streaming from raw Kafka topic to processed Kafka topic

Scenario 3: Streaming from processed Kafka topic to Snowflake as target

Scenario 4: Consuming Tweets from Twitter (real-time) and transferring into a Snowflake table

Summary: Dextrus Speeds Time To Implement Event Driven Architectures

Putting the pieces together, albeit strong open-source software, takes a comprehensive effort. Today, data architects are looking toward no-code/low-code platform, such as Dextrus, to speed the process. The Return on Investment and Total Cost of Ownership metrics are proven given the implementation for complex streaming and event-driven needs and applications.

When EDA-focused streaming pipelines are built, the transformation pipelines also begin to coalesce policy and technical performance. In short, streaming pipelines, under an EDA framework can be high performing with even higher business impact.

The Choice For The Dextrus Platform

With the foundations of a comprehensive integrated platform and feature-rich data integration the Dextrus Platform is providing a new data management for architects and engineers.

Dextrus is built for all stakeholders In a Chief Data Officer organization. The most pervasive trend today is the concept of self-service for data management, where all data stakeholders can get visibility into the data and collaborate with the IT department on the best data workflow. Dextrus enables all data practitioners for the ultimate decisions by doing what they do best.

Dextrus manages Event Driven Architectures for streaming. With the power of Apache or commercial software, Dextrus manages complex data workflow for all kinds of streaming needs including financial, retail, geo-location, and power, to name a few.

Dextrus manages complex transformations. At the heart of data workflow is the data graph, where transformations are created in a rule-based and governed way. With ingest of data sorted on the front end for any source, API, or format, the Dextrus pipeline is the master manager for follow-on asset management.

Dextrus wrangles data. As different data sources move to greater value down the workflow chain, especially for delta/data lakehouse-driven use cases, organized data for both structured and semi structured textual data is paramount. With keen metadata management and a feed into machine learning, wrangled data is the core for business users who need powerful ML data tools.

Modern Data Management Today

Dextrus adds value – across the entire workflow, cloud, and machine learning endpoints. The impact – a unified data experience where visibility and management are at the forefront. The best framework for decisions in data is a cogent approach where efficiency and access are preeminent. Providing ingestion and workflow for a wide spectrum of data sources and types, for both data scientists and business users, represent core values for modern data integration.

A modern data platform must manage across the infrastructure to move and transform data that supports a multitude of use cases such as data engineering, operational data integration, cloud (including hybrid/multi-cloud), and metadata-driven machine learning. Current capabilities must include data movement topology, data virtualization, streaming, API services, complex transformations, augmented data integration, advanced data preparation, and integration portability.

Decision-makers are also embracing this vision of areas for metadata support, data governance, DataOps (CI/CD), and FinOps costing. Because data and APIs often create a massive data workflow, automated integration is a cornerstone to modern data management.

Check out Dextrus here for more information and understanding or contact RightData at info@getrightdata.com.