Data privacy, Generative AI and Possible Solution.

June 29, 2024 · 3 min read

Engineer

Generative AI and Data Privacy

Generative AI tools like ChatGPT are trained on public data sets. It answers question based on public data, similar to google search. However challenge is how-to use Gen AI with private data. Gen. AI companies have provided guidance and framework to address some of privacy concerns but overall this a grey area, potential risk many companies are not ready to take. Here is data privacy report from commonsense.org. https://privacy.commonsense.org/privacy-report/ChatGPT

AI and specifically Gen AI brings lots of goodies, internet is flooded with many such examples. Standout features that separates Gen AI/LLM from other AI, are model reusability, application flexibility and low ML experience. Team with no previous AI experience and hard work can deploy Gen. AI MVP App.

Solution

Here is possible solution which any enterprise team can adopt to get started on AI journey. It uses opensource AI models and tools. Data, models and code is stored and runs on companies hardware inside data center. Limiting data privacy and compliance issues and challenges.

Context data collection —Data tools like Kafka seed context data in vector database. This is companies private knowledge base, for example Customer Gen AI App. Vector database will be seeded with customers data, required PII policies will be added.
Submit/Request Prompt — Client submits request, ask questions via real-timeAPI/Kafka. Client can be internal chat app, Web App, mobile or API app.
Enriched Data — Question + Context data/Knowledge base will be send to LLM Server running locally in datacenter. Langchain can help to combine question and context data from vector database.
Model (Gen AI) Server — Model server runs appropriate model, it accepts request and processes it and sends response back. We have many opensource model to choose from e.g. Llama2 in one such model with great potential.
Response — Gen AI reponse is send back to client.

Open-Source Ecosystem

Opensource ecosystems around data engineering and AI/ML is vast. Team has options to choose from. For sake of solution completeness lets looks at some of viable options and how it can be integrated to work as Gen. AI platform.

Model — There are many open source LLM models to choose from huggingface is model repository similar to what Git is for code repository. Team reviews and select appropriate LLM model. Models can be downloaded to local server.
Model Runtime/Server — Tools like Huggingface, Langchain and Ollama provides a runtime engine. API Client library allows interface to outside world. Each model has different data transformation required, tools provide a great data conversion APIs.
Vector Database — Vector databases like Chroma DB, Llamaindex or Langchain provides database to store companies private data.
Data Pipes — Data pipes like Kaka will collected all relevant data and seed vector database. Being real-time data pipe is better data connection option.
Servers — Running LLM server requires requires enough GPU, CPU, Memory and disk capacity. Nvidia GPU is costly and has order backlog. For starter solution go with latest entry level Nvidia Ada L40 GPUs.
Langchain — Langchain is glue which integrates entire solution. Its a flexible opensource tool depending on scenario team can choose to use langchain capabilities or use some complementary capabilities tools set https://www.langchain.com/ .

Team with good platform and data engineering skills can deploy a Minimum Viable AI Platform within companies firewall.

Generative AI and Data Privacy​

Solution​

Open-Source Ecosystem​

Generative AI and Data Privacy

Solution

Open-Source Ecosystem