Blog | Tavara

API Quotas & Data Caching in Apache Kafka.

June 29, 2024 · 3 min read

Engineer

API Processing millions of transactions per min

FDX/OFX is API mechanism to download US Bank account data. EU and UK use openbanking/PSD2 standards. Main challenge with API access is client aggregators tend to overuse the service. Many clients will query more than couple times a day and download multi year data. This results in disproportionate use of services by some users and puts heavy system load.

One solution for equitable use of service to have client quotas. Client or API user will have multiple associated accounts. So rate limit should on account level not on client ID.

API Quotas

Aggregation Service, collects data from all account services. Typically it will be data for checking, saving, cards, loan and investment account. Quotas are applied to individual account. To limit round trip, quotas should be applied upstream as possible, so good starting point is API Gateway. API gateway rate limits rules are based on identifiable entity, including apps, developers, API keys, access tokens, signed request, certs, IPs and other such things. API gateway will not keep record on how many accounts/day calls were made. Next component is Aggregation service which coordinates, maps and merges data from backend services and send consolidated response.

Exploring Kafka Streams as Caching Layer

Kafka streams have some interesting attributes that we liked. 1) Stream maintains a local partition cache in RocksDB, so round trips to backend is avoided. 2) Works well with microserives we can deploy solution to K8s or VM server, it just a library include. 3) We had working Kafka cluster, so no additional infrastructure needs. Data is isolated between Kafka topic, local cache, share nothing, with no performance impact to other applications. We decided to use Apache Kafka Streams as quota checker and data caching solution.

Solution

Partition Key — Account was partition key it was unique across bank
Quote checker — We had a KTable with account id and counter. Business exception response was send back if count exceed.
Cache TTL — Topic and storage TTL configured for day
Data Cache — JSON response was mapped and stored in cache

Round Trip,

API Gateway forwarded the request, service looked up Quote KTable, if this was first request then it make account entry to table.
Invoked appropriate backend service, mapped the response and save it in Kafka Streams.
Composed consolidated client response after collecting data from all the services.
In next call for same account, quota table counter is incremented and consolidated response is picked up and send back to client. Thus saving trips to backend.

This resulted in lot of saving on infrastructure utilization. Improved SLA the total response time came from min to sub second response, aproximate 100x improvement

Data privacy, Generative AI and Possible Solution.

June 29, 2024 · 3 min read

Amit Joshi

Engineer

Generative AI and Data Privacy

Generative AI tools like ChatGPT are trained on public data sets. It answers question based on public data, similar to google search. However challenge is how-to use Gen AI with private data. Gen. AI companies have provided guidance and framework to address some of privacy concerns but overall this a grey area, potential risk many companies are not ready to take. Here is data privacy report from commonsense.org. https://privacy.commonsense.org/privacy-report/ChatGPT

AI and specifically Gen AI brings lots of goodies, internet is flooded with many such examples. Standout features that separates Gen AI/LLM from other AI, are model reusability, application flexibility and low ML experience. Team with no previous AI experience and hard work can deploy Gen. AI MVP App.

Solution

Here is possible solution which any enterprise team can adopt to get started on AI journey. It uses opensource AI models and tools. Data, models and code is stored and runs on companies hardware inside data center. Limiting data privacy and compliance issues and challenges.

Context data collection —Data tools like Kafka seed context data in vector database. This is companies private knowledge base, for example Customer Gen AI App. Vector database will be seeded with customers data, required PII policies will be added.
Submit/Request Prompt — Client submits request, ask questions via real-timeAPI/Kafka. Client can be internal chat app, Web App, mobile or API app.
Enriched Data — Question + Context data/Knowledge base will be send to LLM Server running locally in datacenter. Langchain can help to combine question and context data from vector database.
Model (Gen AI) Server — Model server runs appropriate model, it accepts request and processes it and sends response back. We have many opensource model to choose from e.g. Llama2 in one such model with great potential.
Response — Gen AI reponse is send back to client.

Open-Source Ecosystem

Opensource ecosystems around data engineering and AI/ML is vast. Team has options to choose from. For sake of solution completeness lets looks at some of viable options and how it can be integrated to work as Gen. AI platform.

Model — There are many open source LLM models to choose from huggingface is model repository similar to what Git is for code repository. Team reviews and select appropriate LLM model. Models can be downloaded to local server.
Model Runtime/Server — Tools like Huggingface, Langchain and Ollama provides a runtime engine. API Client library allows interface to outside world. Each model has different data transformation required, tools provide a great data conversion APIs.
Vector Database — Vector databases like Chroma DB, Llamaindex or Langchain provides database to store companies private data.
Data Pipes — Data pipes like Kaka will collected all relevant data and seed vector database. Being real-time data pipe is better data connection option.
Servers — Running LLM server requires requires enough GPU, CPU, Memory and disk capacity. Nvidia GPU is costly and has order backlog. For starter solution go with latest entry level Nvidia Ada L40 GPUs.
Langchain — Langchain is glue which integrates entire solution. Its a flexible opensource tool depending on scenario team can choose to use langchain capabilities or use some complementary capabilities tools set https://www.langchain.com/ .

Team with good platform and data engineering skills can deploy a Minimum Viable AI Platform within companies firewall.

Data Lake Reimagined

June 29, 2024 · 3 min read

Amit Joshi

Engineer

Data lake is a hub location “lake” to source analytics and operational data. Data is typically separated per domain for more modularity and governance controls. Data is sources, clean, transformed and stored in data lake for long period of time.

World is realtime and interconnected a key feature, described here — AWS re:Invent 2022 — Keynote with Dr. Werner Vogels . Trying to refit traditional batch, ETL and data warehouse concepts in new realtime data lake creates more challenges than solutions. Realtime world need a realtime datalake. In this blog we will reimagining some of key building blocks of data lake and create a more robust solution

Typical data lake architecture has following layers

Data Lake

Traditional data lake architecture will have following components

Source data — Typical consists of one or many data sourcing tools like sftp services (files), Data migration tools, Streaming data tools, API services and CDC tools.
Storage Layer/Staging Layer — will be typical object store like S3 which can store huge amount of data In this blog we will focus on first 2 layers

rather than using point solution we will use apache kafka for sourcing and storing data . It has lots of advantages over traditional approach

Kafka connectors and CDC has over 150 connectors types which can start pulling data with simple configuration
Kafka has rich client APIs with lot of libraries & tools support. Each consumers offset tracks consumed data

With “kappa” architecture we only need to build one pipeline for batch and realtime data processing
Kafka has commercial available tiered storage option — Confluent Tiered storage and upcoming opensource KIP — 405 . Data is indirectly stored in S3 like object store with infinite amount of data retention capability
Schema registry provides data structure for producer, consumers and governance tools. Domain specific canonical schemas will be stored in registry
Support millions of transaction per sec, highly scalable, fault tolerant and ready for DR

More Kafka Specific Architecture Diagram

Data Lake

Source connectors are used to source data and put in Kafka topic. Once data is ready in Kafka topic the consumer side can be one of many API consumer from streaming platform like Spark, flink, connector, KSQL, Kafka streams, which will clean, transform and keep domain specific data. The pipeline can evolve over period of time, we can add new consumers as business requirements evolves. New consumers will be able to reply old data from last month or year back all depending on how back data is available on that topic.

Overall this solution simplifies data lake making it more cost effective, efficient and more capable.

References: Kafka based DataLake Kafka Data Lake.

Kafka for Data Migration

June 28, 2024 · 3 min read

Amit Joshi

Engineer

Data Connections

If you search internetfv for data migration tools, Kafka will not feature as top data migration tool. This article we will look how Kafka is equally capable and more versatile tool for all types of data migration. Lets consider some of key migration requirements

Ability to read and write to different types of data sources
Transform data, Source and target datasource will have different data structure
Process millions of transaction within SLA
Process historic data and realtime data
Enterprise features like data governance, reliability and security to name a few

KAfka Ability to read and write to different types of data sources

Ability to read and write to different types of data sources

Kafka connect is low code mechanism to connect to 150+ data sources. Many connectors are available in open-source ecosystem. Commercial connectors are available with vendors like confluent and others. Cloud providers services have option to connect to Kafka for example, AWS Eventbrdige pipes has Kafka broker endpoint. Open source ESB like apache camel has connect framework where they make all camel components run in Kafka connect cluster. Apache Kafka Connectors I will do a detail of Kafka connect future blog post

Data Transformations

Transform data, Source and target datasource will have different data structure . Kafka can transform data in couple of ways. If its a single message transform Kakfa connect has good data transformation layer. You can even write a Java/Scala custom transformation in Kakfa Connect transformations Kafka Streams is option when more joining to multiple data streams (sources) or data aggregation is required. Either Kafka Streams or KSQL can be used to transform data. Kafka Streams Transformations Most of time source and target data sources of different types for example source can be RDBMS and destination can be Elastic or Mongodb. Transformation help to map data to target requirements

Scalable

Process millions of transaction within SLA Kafka has such a robust data capacity even with small Kafka cluster/capacity it can easily process millions of records in a minute. With medium cluster 5 brokers can process million tps. With data safely saved in topic consumer can process data at its own pace without interfering with the producer
Process historic data and realtime data. Data migration will need some historic point of time data and from that day onward will require near real-time data. A single Kafka data pipeline can work for both cases by simply changing one line of source connect configuration. Database connect comes with 2 mode one “bulk” mode which is used for one time data transfer and “timestamp or incremental” mode when used for realtime data transfers. The database CDC connectors also have similar feature.
Enterprise features like data governance, reliability and security. Most of security, reliability and data governance is possible with core Kafka platform. Additional required features can be added as external tools. For example, calendar based scheduling is not possible in Kafka connect. We can use external scheduler to start and stop kafka connect in schedule window. Kafka open architecture help makes things more flexible

Kafka for Multi Data Centers (DC) world

June 28, 2024 · 3 min read

Amit Joshi

Engineer

Multi-DC World

Today multi DC is real scenario, most companies will run products and services in more than one data center. Distributed workloads in either multiple self managed DC or cloud (SaaS, IaaS or PaaS). A universal data bus (data pipe) is needed for seamless data exchange and integration. Apache Kafka is one of top tool for this job. World leading cloud providers, applications, tools and technologies have ability to connect to Kafka. So it makes sense to use a ubiquitous, scalable and opensource technology than using point solutions. Typical medium and large enterprise application and services deployment

Multi DC Deployment

Company has combination of has self managed DC and different cloud providers in different geography. Self managed DC will have databases traditional RDBMS (Oracle, PostgreSQL, MySQL, MS SQL and DB2) also newer database technologies like MongoDB, Elastic, Splunk, Cassandra and more. Messaging like JMS, MQ, RabbitMQ, Tibco and even Kafka topics. Microservices — Java/Node/Go/Python based microservices which produce and consume data. Cloud services like AWS — Eventbridge, Kinesis, Dynamodb, Aurora, Lambda, Cloud Watch and more or Azure — Events, Functions, Databases or PaaS offering like Splunk, Mongo Atlas, Kafka PaaS and many many more. or SaaS vendors like Salesforce, Maketo and tons of others.

Cloud services, popular technologies and databases have Kafka connector. Connect with basic configuration can produce and consume data to Kafka Topic/ecosystem. Kafka also provides connect framework to write your own Java/Scala connectors with predictable lifecycle and Microservices like container/K8s deployment options. Plugin new application and start exchanging data is matter of days not months. Once data is connect to Kafka cluster becomes a reliable, secure, scalable and future proof universal data transfer layer. Lets explore some of top attributes.

Scalable — Transfer few realtime events of process millions of transaction per second. Store data for weeks or years.
Reliable — Various QoS quality of services supported like exactly once. Even of consumers are down for extended period automated data catchup
Secure — Protected APIs, TLS encrypted channels, Encrypt and decrypt specific field, drop or tokenize certain data attribute. Run on encrypted disk.
Governance — configurable and pluggable data governance. Schema registry of design governance. Industry and solution governance like PCI, PII and others can be implemented.
Decoupled — Isolated data exchange, changes to one side is not going to impact on other source or target side
Data Transformation — Data schemas in source and target are never same, Kafka helps to reformat data for consumer. Only data which is required is passed along

Looking at value of using Kafka bases solution should be one of enterprise top choice. In upcoming articles we will explore specific data integration use-cases and how Kafka add enormous value.

API Processing millions of transactions per min​

Exploring Kafka Streams as Caching Layer​

Solution​

Round Trip,​

Generative AI and Data Privacy​

Solution​

Open-Source Ecosystem​

Data Connections​

Data Transformations​

Multi-DC World​

API Processing millions of transactions per min

Exploring Kafka Streams as Caching Layer

Solution

Round Trip,

Generative AI and Data Privacy

Solution

Open-Source Ecosystem

Data Connections

Data Transformations

Multi-DC World