Job Description

About KATIM

KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world’s leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale.

Job Purpose (specific To This Role)

The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM’s AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy — from model governance and deployment automation to compliance enforcement — ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps — ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.

You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.

You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.

AI-Augmented Product Development Model (Context for the Role)

We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3–4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.

Core Principles

  • Security is integrated into every decision, from architecture to deployment.
  • Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
  • Quality is measurable, enforced, and automated at every stage.
  • All system behaviors—including AI-assisted outputs—must be traceable, reviewable, and explainable. We do not ship “black box” functionality.
  • Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.

Key Responsiblities

AI MLOps Architecture & Governance (30%)

  • Define the MLOps architecture and governance framework across products.
  • Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
  • Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
  • Lead architectural designs and reviews for AI pipelines.
  • Design and maintain LLM inference infrastructure
  • Manage model registries and versioning (MLflow, Weights & Biases)
  • Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
  • Optimize model performance and cost (quantization, caching, batching)
  • Build and maintain vector databases (Pinecone, Weaviate, Chroma)
  • Hardware and inference optimization awareness

Agent & Tool Development (25%)

  • Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
  • Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
  • Build tool integrations for LLM agents (function calling, APIs)
  • Implement retrieval-augmented generation (RAG) pipelines
  • Create prompt management and versioning systems
  • Monitor and optimize agent performance

CI/CT/CD Pipelines (20%)

  • Build continuous integration pipelines for models and code
  • Implement continuous training (CT) workflows
  • Automate model deployment with rollback capabilities
  • Create staging and production deployment strategies
  • Integrate AI-assisted code review into CI/CD
  • Building a continuous evaluation loop

Infrastructure & Automation (15%)

  • Manage cloud infrastructure (Kubernetes, serverless)
  • Implement Infrastructure as Code (Terraform, Pulumi)
  • Build monitoring and observability systems (Prometheus, Grafana, DataDog)
  • Automate operational tasks with AI agents
  • Ensure security and compliance (OWASP, SOC2) - AI-specific security

Developer Enablement (10%)

  • Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
  • Document AI/ML best practices and patterns
  • Conduct training on MLOps tools and workflows
  • Support engineers with AI integration challenges
  • Maintain development environment parity
  • AI Privacy, Governance, and Compliance

Education and Minimum Qualification

  • BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred.
  • 8+ years in DevOps, SRE, or platform engineering
  • 5+ years hands-on experience with ML/AI systems in production
  • Deep understanding of LLMs and their operational requirements
  • Experience building and maintaining CI/CD pipelines
  • Strong Linux/Unix systems knowledge
  • Cloud platform expertise (AWS, GCP, or Azure)
  • Experience with container orchestration (Kubernetes)

Key Skills

MLOps & AI:

  • LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
  • Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
  • Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
  • Model Registries: MLflow, Kubeflow, AWS SageMaker
  • Vector Databases: Pinecone, Weaviate, Chroma, Milvus
  • Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
  • Fine-tuning: LoRA, QLoRA, prompt tuning

Data Engineering:

  • Pipelines: Airflow, Prefect, Dagster
  • Processing: Spark, Dask, Ray
  • Streaming: Kafka, Pulsar, Kinesis
  • Data Quality: Great Expectations, dbt
  • Feature Stores: Feast, Tecton

DevOps & Infrastructure:

  • Containers: Docker, Kubernetes, Helm
  • Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)
  • IaC: Terraform, Pulumi, CloudFormation
  • CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
  • Orchestration: Kubernetes operators, Kubeflow

Monitoring & Observability:

  • Metrics: Prometheus, Grafana, CloudWatch
  • Logging: ELK Stack, Loki, CloudWatch Logs
  • Tracing: Jaeger, Zipkin, OpenTelemetry
  • Alerting: PagerDuty, Opsgenie
  • Model Monitoring: Arize, Fiddler, Evidently

Programming:

  • Python: Primary language for ML/AI
  • Libraries: NumPy, Pandas, PyTorch/TensorFlow, scikit-learn
  • FastAPI, Flask for serving
  • Go: For high-performance services and tooling
  • Shell Scripting: Bash, Python for automation
  • SQL: Advanced queries, optimization

AI-Assisted Operations:

  • Autonomous agents for incident response
  • AI-powered log analysis and anomaly detection
  • Automated root cause analysis
  • Intelligent alerting and noise reduction

Other Highly Desirable Skills:

  • Experience with LLM fine-tuning and deployment at scale
  • Background in data engineering or ML engineering
  • Startup or high-growth environment experience
  • Security certifications (CISSP, AWS Security)
  • Contributions to open source MLOps projects
  • Experience with multi-cloud or hybrid cloud
  • Prior software engineering experience

Success Metrics

  • Uptime: 99.9%+ availability for AI services
  • Deployment Frequency: Daily or on-demand deployments
  • Model Performance: Latency (p95 < 500ms), accuracy tracking
  • Cost Efficiency: Cost per inference, infrastructure utilization
  • Developer Velocity: Time to deploy new models, AI feature adoption rate
  • Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)

  • #KATIM


    Job Details

    Role Level: Entry-Level Work Type: Full-Time
    Country: United Arab Emirates City: Abu Dhabi
    Company Website: http://www.katim.com Job Function: Information Technology (IT)
    Company Industry/
    Sector:
    Computer and Network Security

    What We Offer


    About the Company

    Searching, interviewing and hiring are all part of the professional life. The TALENTMATE Portal idea is to fill and help professionals doing one of them by bringing together the requisites under One Roof. Whether you're hunting for your Next Job Opportunity or Looking for Potential Employers, we're here to lend you a Helping Hand.

    Report

    Disclaimer: talentmate.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@talentmate.com.


    Recent Jobs
    View More Jobs
    Talentmate Instagram Talentmate Facebook Talentmate YouTube Talentmate LinkedIn