Job Description

Responsibilities

  • Build evaluation datasets and harnesses - golden sets, regression suites, and LLM-as-judge harnesses (verified against human labels).
  • Design observability so anyone can answer, "Did the agent get worse this week? " in under 10 minutes, with a chart, not vibes.
  • Gate every prompt PR on automated eval scores embedded in design, not as a downstream gate.
  • Partner with AI engineering on red-teaming adversarial datasets for PII, jailbreaks, and prompt injection.
  • Run load and chaos tests on async LLM pipelines.

Requirements

  • 2-5 years in QA / SDET / SET/quality engineering, with at least 1-1.5 years on backend / API / systems testing.
  • Strong fluency in any modern language - TypeScript, Java, Go, or Python. Language is not a barrier.
  • Modern test framework with non-trivial fixtures and plugins: Vitest / Jest / Pytest / JUnit + RestAssured / equivalent.
  • One of the following: contract testing (Pact / Postman / Schemathesis), load testing (k6 / Locust / JMeter), distributed tracing (OpenTelemetry / Datadog / Honeycomb), or CI test infrastructure. Comfort reading backend code in PRs and using test management tooling (JIRA / Zephyr / TestRail). Bonus (genuinely a bonus, not silent rejects).
  • Hands-on with LLM / RAG/agent/voice systems Eval tooling: Langfuse, LangSmith, Phoenix, Braintrust, Ragas, DeepEval, OpenAI.
  • Evals Voice/Telephony Testing - call quality, latency, ASR/TTS evaluation. Regulated-domain QA - PII, audit trails, compliance gates, Hindi or other Indic language testing. Open-source contributions in test or eval tooling.
  • Stack we use today - AI integration: Anthropic SDK in TypeScript, embedded in our existing application. Eval/observability: Langfuse, LangSmith, OpenTelemetry, plus internal harnesses.
  • Languages: TypeScript preferred for AI app code; Python, Go, or Java elsewhere as the problem demands.
  • Coding assistants: Codex and Claude Code are part of normal development.
  • We hire on primitives - evaluation rigour, observability, contract literacy, and failure-mode imagination. Tools turn over; primitives dont.

What Were Looking For

  • Systems thinking over screen thinking. You reason about contracts, retries, latency, streaming, and async, not just whats on the page. Eval-first instinct. Asked to test a chatbot, you reach for a golden dataset, not Selenium.
  • You write code. Not glue scripts code that survives a senior engineers review.
  • You debug from telemetry. Youve found the root cause from logs and traces. Youve killed a flaky test and have an opinion on why most flaky tests are actually bad tests.
  • You work alongside coding agents (Codex, Claude Code) and review their output as critically as a human would.

This job was posted by S M Nandakishore from CAW Studios.


Job Details

Role Level: Not Applicable Work Type: Full-Time
Country: India City: Hyderabad ,Telangana
Company Website: https://www.cawstudios.com Job Function: Engineering
Company Industry/
Sector:
Software Development

What We Offer


About the Company

Searching, interviewing and hiring are all part of the professional life. The TALENTMATE Portal idea is to fill and help professionals doing one of them by bringing together the requisites under One Roof. Whether you're hunting for your Next Job Opportunity or Looking for Potential Employers, we're here to lend you a Helping Hand.

Report

Disclaimer: talentmate.com is only a platform to bring jobseekers & employers together. Applicants are advised to research the bonafides of the prospective employer independently. We do NOT endorse any requests for money payments and strictly advice against sharing personal or bank related information. We also recommend you visit Security Advice for more information. If you suspect any fraud or malpractice, email us at abuse@talentmate.com.


Recent Jobs
View More Jobs
Talentmate Instagram Talentmate Facebook Talentmate YouTube Talentmate LinkedIn