Infrastructure as Code for AI: Cluster Configuration for Heavy Workloads
Those deploying Large Language Models (LLMs) or complex deep learning pipelines in production …
TL;DR
Artificial Intelligence (AI) is the new standard, but using cloud APIs like OpenAI (ChatGPT) or Anthropic comes with a significant catch: data privacy and “data gravity.” Sending sensitive company data, source code, or customer information to US servers is often a GDPR nightmare and a strategic risk. Ollama changes the game. It is an extremely lightweight engine to run powerful open-source models (like Meta’s Llama 3, Mistral, or Gemma) directly in your own cluster. By using Ollama, you get the full power of generative AI—without a single byte leaving your network.
When developers today incorporate AI features into applications, they usually use external APIs. This means every prompt, every context, and every uploaded PDF file leaves your infrastructure, travels through the internet, and is processed on a third-party server.
Ollama brings the brain to the data, not the data to the brain.
The biggest hurdle for switching to local AI has been the code. Many apps are tightly integrated with OpenAI SDKs (e.g., in Python or Node.js).
Ollama elegantly solves this problem.
BASE_URL in your code from https://api.openai.com/v1 to the internal address of your Ollama service (e.g., http://ollama.ai-namespace.svc.cluster.local:11434/v1). The application “thinks” it’s communicating with ChatGPT, but in reality, it’s interacting with your private Llama 3 model.The open-source AI world is evolving rapidly. Today Model A is the best, tomorrow Model B. With SaaS providers, you are bound to their model cycles.
ollama run mistral) is enough, and the new model is ready. You can use specialized, smaller models for coding, translations, or text summarizations.Here it is decided whether AI becomes an incalculable ongoing expense or a scalable infrastructure asset for you.
Scenario A: OpenAI API (The Token Cost Trap)
Cloud APIs are convenient for prototypes but tricky when scaling.
Scenario B: Ollama with Managed Kubernetes from ayedo
In the ayedo App Catalog, Ollama is provided as a robust microservice.
| Aspect | Cloud AI (OpenAI / Anthropic) | ayedo (Managed Ollama) |
|---|---|---|
| Costs | Pay-per-Token (Unpredictable) | Infrastructure (Flat Rate) |
| Data Privacy / GDPR | High Risk (US Servers) | 100% Secure (In-Cluster) |
| Model Selection | Vendor-Specific (Closed Source) | Free Choice (Llama 3, Mistral, etc.) |
| App Integration | Proprietary SDKs | OpenAI API Compatible |
| Dependency | High (Vendor Lock-in) | None (Open Source) |
| Internet Requirement | Yes (Always-on) | No (Air-Gapped Possible) |
Is an open-source model as good as GPT-4?
For general, highly complex logic tasks, GPT-4 (or Claude 3.5) is often still slightly advantageous. BUT: For 90% of business use cases (summarizing texts, classifying support tickets, extracting data from JSON, RAG queries on internal documents), models like Llama 3 (8B or 70B) or Mistral are absolutely equivalent—and much faster and cheaper.
Do I necessarily need expensive GPUs (graphics cards)?
Not necessarily, but it is highly recommended. Ollama can compute smaller models (like Llama 3 8B) purely on the CPU, which is sufficient for simple background jobs (about 5-10 tokens per second). For interactive chat applications where the user expects real-time answers, nodes with NVIDIA GPUs (e.g., T4 or A10) are the standard in the ayedo cluster to guarantee lightning-fast inferences.
How large are the models on the hard drive?
Surprisingly small thanks to quantization (compression). A very capable 8-billion-parameter model often requires only 4 to 5 gigabytes of storage space. An extremely powerful 70B model is about 40 gigabytes.
Is there also a chat interface for employees?
Ollama itself is just the “engine” (the API). In the ayedo stack, we often combine Ollama with frontends like Open WebUI. This gives your employees an interface that looks and feels exactly like ChatGPT—only all data remains securely on your servers.
Those deploying Large Language Models (LLMs) or complex deep learning pipelines in production …
The hype around proprietary SaaS AI models gives way to a sober cost-benefit analysis by 2026. …
Why the Open-Source Technology is More Than Just Container Orchestration When digital sovereignty …