About KATIM
KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world’s leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale.
Job Purpose (specific To This Role)
The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM’s AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy — from model governance and deployment automation to compliance enforcement — ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps — ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.
You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.
You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.
AI-Augmented Product Development Model (Context for the Role)
We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3–4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.
Core Principles
- Security is integrated into every decision, from architecture to deployment.
- Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
- Quality is measurable, enforced, and automated at every stage.
- All system behaviors—including AI-assisted outputs—must be traceable, reviewable, and explainable. We do not ship “black box” functionality.
- Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.
Key Responsiblities
AI MLOps Architecture & Governance (30%)
- Define the MLOps architecture and governance framework across products.
- Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
- Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
- Lead architectural designs and reviews for AI pipelines.
- Design and maintain LLM inference infrastructure
- Manage model registries and versioning (MLflow, Weights & Biases)
- Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
- Optimize model performance and cost (quantization, caching, batching)
- Build and maintain vector databases (Pinecone, Weaviate, Chroma)
- Hardware and inference optimization awareness
Agent & Tool Development (25%)
- Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
- Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
- Build tool integrations for LLM agents (function calling, APIs)
- Implement retrieval-augmented generation (RAG) pipelines
- Create prompt management and versioning systems
- Monitor and optimize agent performance
CI/CT/CD Pipelines (20%)
- Build continuous integration pipelines for models and code
- Implement continuous training (CT) workflows
- Automate model deployment with rollback capabilities
- Create staging and production deployment strategies
- Integrate AI-assisted code review into CI/CD
- Building a continuous evaluation loop
Infrastructure & Automation (15%)
- Manage cloud infrastructure (Kubernetes, serverless)
- Implement Infrastructure as Code (Terraform, Pulumi)
- Build monitoring and observability systems (Prometheus, Grafana, DataDog)
- Automate operational tasks with AI agents
- Ensure security and compliance (OWASP, SOC2) - AI-specific security
Developer Enablement (10%)
- Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
- Document AI/ML best practices and patterns
- Conduct training on MLOps tools and workflows
- Support engineers with AI integration challenges
- Maintain development environment parity
- AI Privacy, Governance, and Compliance
Education and Minimum Qualification
- BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred.
- 8+ years in DevOps, SRE, or platform engineering
- 5+ years hands-on experience with ML/AI systems in production
- Deep understanding of LLMs and their operational requirements
- Experience building and maintaining CI/CD pipelines
- Strong Linux/Unix systems knowledge
- Cloud platform expertise (AWS, GCP, or Azure)
- Experience with container orchestration (Kubernetes)
Key Skills
MLOps & AI:
- LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
- Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
- Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
- Model Registries: MLflow, Kubeflow, AWS SageMaker
- Vector Databases: Pinecone, Weaviate, Chroma, Milvus
- Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
- Fine-tuning: LoRA, QLoRA, prompt tuning
Data Engineering:
- Pipelines: Airflow, Prefect, Dagster
- Processing: Spark, Dask, Ray
- Streaming: Kafka, Pulsar, Kinesis
- Data Quality: Great Expectations, dbt
- Feature Stores: Feast, Tecton
DevOps & Infrastructure:
- Containers: Docker, Kubernetes, Helm
- Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)
- IaC: Terraform, Pulumi, CloudFormation
- CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
- Orchestration: Kubernetes operators, Kubeflow
Monitoring & Observability:
- Metrics: Prometheus, Grafana, CloudWatch
- Logging: ELK Stack, Loki, CloudWatch Logs
- Tracing: Jaeger, Zipkin, OpenTelemetry
- Alerting: PagerDuty, Opsgenie
- Model Monitoring: Arize, Fiddler, Evidently
Programming:
- Python: Primary language for ML/AI
- Libraries: NumPy, Pandas, PyTorch/TensorFlow, scikit-learn
- FastAPI, Flask for serving
- Go: For high-performance services and tooling
- Shell Scripting: Bash, Python for automation
- SQL: Advanced queries, optimization
AI-Assisted Operations:
- Autonomous agents for incident response
- AI-powered log analysis and anomaly detection
- Automated root cause analysis
- Intelligent alerting and noise reduction
Other Highly Desirable Skills:
- Experience with LLM fine-tuning and deployment at scale
- Background in data engineering or ML engineering
- Startup or high-growth environment experience
- Security certifications (CISSP, AWS Security)
- Contributions to open source MLOps projects
- Experience with multi-cloud or hybrid cloud
- Prior software engineering experience
Success Metrics
Uptime: 99.9%+ availability for AI services Deployment Frequency: Daily or on-demand deployments Model Performance: Latency (p95 < 500ms), accuracy tracking
Cost Efficiency: Cost per inference, infrastructure utilization Developer Velocity: Time to deploy new models, AI feature adoption rate Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)
#KATIM