Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally.
Kafka was originally developed at LinkedIn, and was subsequently open-sourced in early 2011. Jay Kreps, Neha Narkhede, and Jun Rao helped co-create Kafka. Graduation from the Apache Incubator occurred on 23 October 2012. Jay Kreps chose to name the software after the author Franz Kafka because it is "a system optimized for writing", and he liked Kafka's work.
Apache Kafka is based on the commit log, and it allows users to subscribe to it and publish data to any number of systems or real-time applications. Example applications include managing passenger and driver matching at Uber, providing real-time analytics and predictive maintenance for British Gas smart home, and performing numerous real-time services across all of LinkedIn.
Kafka provides three main functions to its users:
The below section includes notes from Designing Event-Driven Systems written by Ben Stopford. I would like to thank Ben Stopford for sharing his valuable experience and knowledge.
Kafka provides an asynchronous protocol for connecting programs together, but it is undoubtedly a bit different from, say, TCP (transmission control protocol), HTTP, or an RPC protocol. The difference is the presence of a broker. A broker is a separate piece of infrastructure that broadcasts messages to any programs that are interested in them, as well as storing them for as long as is needed. So it’s perfect for streaming or fire-and-forget messaging.
So Kafka is a mechanism for programs to exchange information, but its home ground is event-based communication, where events are business facts that have value to more than one service and are worth keeping around.
If we consider Kafka as a messaging system—with its Connect interface, which pulls data from and pushes data to a wide range of interfaces and datastores, and streaming APIs that can manipulate data in flight—it does look a little like an ESB (enterprise service bus). The difference is that ESBs focus on the integration of legacy and off-the-shelf systems, using an ephemeral and comparably low throughput messaging layer, which encourages request-response protocols.
Kafka is a streaming platform and as such puts emphasis on high throughput events and stream processing. A Kafka cluster is a distributed system at heart, providing high availability, storage, and linear scale-out. This is quite different from traditional messaging systems, which are limited to a single machine, or if they do scale outward, those scalability properties do not stretch from end to end. Tools like Kafka Streams and KSQL allow you to write simple programs that manipulate events as they move and evolve. These make the processing capabilities of a database available in the application layer, via an API, and outside the confines of the shared broker.
Kafka provides a far higher level of throughput, availability, and storage, and there are hundreds of companies routing their core facts through a single Kafka cluster. Beyond that, streaming encourages services to retain control, particularly of their data, rather than providing orchestration from a single, central team or platform. So while having one single Kafka cluster at the center of an organization is quite common, the pattern works because it is simple—nothing more than data transfer and storage provided at scale and high availability. This is emphasized by the core mantra of event-driven services: Centralize an immutable stream of facts. Decentralize the freedom to act, adapt, and change.
There are two APIs for stream processing: Kafka Streams and KSQL. These are database engines for data in flight, allowing users to filter streams, join them together, aggregate, store state, and run arbitrary functions over the evolving dataflow. These APIs can be stateful, which means they can hold data tables much like a regular database. The third API is Connect. This has a whole ecosystem of connectors that interface with different types of database or other endpoints, both to pull data from and push data to Kafka. Finally, there is a suite of utilities such as Replicator and Mirror Maker, which tie disparate clusters together, and the Schema Registry, which validates and manages schemas applied to messages passed through Kafka and a number of other tools in the Confluent platform.
Moreover, it has a SQL interface that lets users define queries and execute them over the data held in the log. These can be piped into views that users can query directly. It also supports transactions. KSQL and Kafka Streams are optimized for continual computation rather than batch processing. So, Kafka is designed to move data, operating on that data as it does so. It’s about real-time processing first, long-term storage second.
A streaming platform brings these tools together with the purpose of turning data at rest into data that flows through an organization. The broker’s ability to scale, store data, and run without interruption makes it a unique tool for connecting many disparate applications and services across a department or organization. The Connect interface makes it easy to evolve away from legacy systems, by unlocking hidden datasets and turning them into event streams. Stream processing lets applications and services embed logic directly over these resulting streams of events.
A Kafka cluster is essentially a collection of files, filled with messages, spanning many different machines. Most of Kafka’s code involves tying these various individual logs together, routing messages from producers to consumers reliably, replicating for fault tolerance, and handling failure gracefully. So it is a messaging system, at least of sorts, but it’s quite different from the message brokers that preceded it. Like any technology, it comes with both pros and cons, and these shape the design of the systems we write.
Like many good outcomes in computer science, this scalability comes largely from simplicity. The underlying abstraction is a partitioned log essentially a set of append-only files spread over a number of machines which encourages sequential access patterns that naturally flow with the grain of the underlying hardware.
At the heart of the Kafka messaging system sits a partitioned, replayable log. The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file. When a service wants to read messages from Kafka, it “seeks” to the position of the last message it read, then scans sequentially, reading messages in order while periodically recording its new position in the log.
Taking a log-structured approach has an interesting side effect. Both reads and writes are sequential operations. This makes them sympathetic to the underlying media, leveraging prefetch, the various layers of caching, and naturally batching operations together. This in turn makes them efficient. In fact, when you read messages from Kafka, the server doesn’t even import them into the JVM (Java virtual machine). Data is copied directly from the disk buffer to the network buffer (zero-copy)—an opportunity afforded by the simplicity of both the contract and the underlying data structure.
So batched, sequential operations help with overall performance. They also make the system well suited to storing messages longer term. Most traditional message brokers are built with index structures—hash tables or B-trees—used to manage acknowledgments, filter message headers, and remove messages when they have been read. But the downside is that these indexes must be maintained, and this comes at a cost. They must be kept in memory to get good performance, limiting retention significantly. But the log is O(1) when either reading or writing messages to a partition, so whether the data is on disk or cached in memory matters far less.
Partitions and Partitioning
Partitions are a fundamental concept for most distributed data systems. A partition is just a bucket that data is put into, much like buckets used to group data in a hash table. In Kafka’s terminology each log is a replica of a partition held on a different machine. (So one partition might be replicated three times for high availability. Each replica is a separate log with the same data inside it.) What data goes into each partition is determined by a partitioner, coded into the Kafka producer. The partitioner will either spread data across the available partitions in a round-robin fashion or, if a key is provided with the message, use a hash of the key to determine the partition number. This latter point ensures that messages with the same key are always sent to the same partition and hence are strongly ordered.
Linear Scalability
Producers spread messages over many partitions, on many machines, where each partition is a little queue; load-balanced consumers (denoted a consumer group) share the partitions between them; rate limits are applied to producers, consumers, and groups .
The main advantage of this, from an architectural perspective, is that it takes the issue of scalability off the table. With Kafka, hitting a scalability wall is virtually impossible in the context of business systems. This can be quite empowering, especially when ecosystems grow, allowing implementers to pick patterns that are a little more footloose with bandwidth and data movement.
Scalability opens other opportunities too. Single clusters can grow to company scales, without the risk of workloads overpowering the infrastructure. For example, New Relic relies on a single cluster of around 100 nodes, spanning three data centers, and processing 30 GB/s. In other, less data-intensive domains, 5- to 10-node clusters commonly support whole-company workloads. But it should be noted that not all companies take the “one big cluster” route. Netflix, for example, advises using several smaller clusters to reduce the operational overheads of running very large installations, but their largest installation is still around the 200-node mark. To manage shared clusters, it’s useful to carve bandwidth up, using the bandwidth segregation features that ship with Kafka.
Segregating Load in Multiservice Ecosystems
Service architectures are by definition multitenant. A single cluster will be used by many different services. In fact, it’s not uncommon for all services in a company to share a single production cluster. But doing so opens up the potential for inadvertent denial-of-service attacks, causing service degradation or instability. To help with this, Kafka includes a throughput control feature, called quotas, that allows a defined amount of bandwidth to be allocated to specific services, ensuring that they operate within strictly enforced service-level agreements or SLAs. Greedy services are aggressively throttled, so a single cluster can be shared by any number of services without the fear of unexpected network contention.
Ensuring Messages Are Durable
Kafka provides durability through replication. This means messages are written to a configurable number of machines so that if one or more of those machines fail, the messages will not be lost. If you configure a replication factor of three, two machines can be lost without losing data.
Highly sensitive use cases may require that data be flushed to disk synchronously, but this approach should be used sparingly. It will have a significant impact on throughput, particularly in highly concurrent environments. If you do take this approach, increase the producer batch size to increase the effectiveness of each disk flush on the machine (batches of messages are flushed together). This approach is useful for single-machine deployments, too, where a single ZooKeeper node is run on the same machine and messages are flushed to disk synchronously for resilience.
Load-Balance Services and Make Them Highly Available
Event-driven services should always be run in a highly available (HA) configuration unless there is genuinely no requirement for HA. The main reason for this is it’s essentially a no-op. If we have one instance of a service, then start a second, the load will naturally balance across the two. The same process provides high availability should one node crash.
Should one of the services fail, Kafka will detect this failure and reroute messages from the failed service to the one that remains. If the failed service comes back online, load flips back again. This process actually works by assigning whole partitions to different consumers. A strength of this approach is that a single partition can only ever be assigned to a single service instance (consumer). This is an invariant, implying that ordering is guaranteed, even as services fail and restart.
Compacted Topics
By default, topics in Kafka are retention-based: messages are retained for some configurable amount of time. Kafka also ships with a special type of topic that manages keyed datasets—that is, data that has a primary key (identifier) as you might have in a database table. These compacted topics retain only the most recent events, with any old events, for a certain key, being removed.
Compacted topics work a bit like simple log-structure merge trees (LSM trees). The topic is scanned periodically, and old messages are removed if they have been superseded (based on their key);. It’s worth noting that this is an asynchronous process, so a compacted topic may contain some superseded messages, which are waiting to be compacted away.
Long-Term Data Storage
One of the bigger differences between Kafka and other messaging systems is that it can be used as a storage layer. In fact, it’s not uncommon to see retention-based or compacted topics holding more than 100 TB of data. But Kafka isn’t a database; it’s a commit log offering no broad query functionality (and there are no plans for this to change). But its simple contract turns out to be quite useful for storing shared datasets in large systems or company architectures.
Security
Kafka provides a number of enterprise-grade security features for both authentication and authorization. Client authentication is provided through either Kerberos or Transport Layer Security (TLS) client certificates, ensuring that the Kafka cluster knows who is making each request. There is also a Unix-like permissions system, which can be used to control which users can access which data. Network communication can be encrypted, allowing messages to be securely sent across untrusted networks. Finally, administrators can require authentication for communication between Kafka and ZooKeeper.
Life is a series of natural and spontaneous changes. Don’t resist them that only creates sorrow. Let reality be reality. Let things flow naturally forward.
Lao-Tzu, 6th–5th century BCE