Job Description
We are seeking a Senior AI Platform Engineer to design, build, and operate scalable AI/ML platform infrastructure on AWS, with a strong emphasis on platform reliability, visibility, and observability.
In this role, you will enable data scientists and application teams to safely deploy and operate AI workloads by providing resilient infrastructure, standardized tooling having deep operational insight across environments.
This is a hands-on senior engineering role that blends cloud infrastructure, DevOpsSec principles, and AI platform enablement.
Responsibilities
AI Platform & AWS Infrastructure
Design, build, and operate a cloud-native AI/ML platform on AWS supporting training, inference, and experimentation workloads, spanning orchestration layers, agents, MCP tools, internal APIs, and databases.
Build and maintain core multi-tenant services that enable the development, testing, deployment, monitoring, and lifecycle management of LLM-based agents.
Take AI proto@types to production-grade deployments with a strong ownership and self-starter mindset.
Implement verification layers, citations, and security guardrails to ensure predictable and safe agent behavior, treating prompts as untrusted inputs and model outputs as unverified code requiring validation.
Architect scalable, secure environments using AWS services such as EKS, ECS, EC2, S3, SageMaker, IAM, VPC, CloudWatch, Lambda, Bedrock, and related native tooling.
Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, or equivalent frameworks.
Optimize platform performance, cost efficiency, reliability, and scalability.
Champion best practices for MLOps, agent evaluation, and system observability.
Observability & Platform Visibility
Lead the design and implementation of end-to-end observability across AI platforms, including metrics (infrastructure, application, and ML-specific), centralized structured logging, and distributed tracing across service layers and model-serving paths.
Define and maintain golden signals, SLIs/SLOs, and alerting strategies for AI workloads and platform components.
Integrate observability tooling such as CloudWatch, OpenTelemetry, Prometheus, Grafana, Datadog, New Relic, or similar platforms, including support for agentic workflows and external-facing MCPs.
Enable visibility into model performance, data drift, inference latency, error rates, and resource utilization, and help drive corrective actions.
Reliability, Security & Governance
Apply DevOps best practices to improve platform reliability, resilience, and operational excellence.
Partner closely with security and IT audit teams to deliver secure-by-default infrastructure, including IAM least-privilege access, network isolation, encryption, and compliance controls.
Support enterprise governance, auditability, and regulatory compliance requirements for AI systems.
Build guardrails that enable safe experimentation while ensuring secure and compliant production deployments.
Routinely enforce and validate adherence to security and governance standards.
Developer Enablement & Collaboration
Create reusable platform components, templates, and self-service tooling to accelerate adoption and reduce time to market.
Collaborate with data science, ML engineering, and application teams to understand requirements and unblock platform adoption.
Participate in on-call rotations and incident response, contributing to root cause analysis and continuous improvement.
Provide technical leadership, architectural guidance, and mentorship to other engineers, raising the bar for AI engineering practices across the organization.
Contribute to documentation, runbooks, and operational best practices.
Stay current with emerging AI research and translate novel techniques into production-ready solutions.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Skills and Requirements
B.S. or M.S. in Computer Science, Engineering, or a related field, or equivalent practical experience
7+ years of experience in cloud infrastructure, platform engineering, DevOps, or SRE roles.
Deep hands-on experience with AWS in production environments, especially for Agentic AI.
Proven experience designing and implementing observability and monitoring systems at scale.
Strong working knowledge in supporting agentic workflows including setup and support of external facing MCPs
Expertise in Infrastructure as Code (Terraform or CDK).
Solid understanding of networking, and cloud security fundamentals.
Proficiency in at least one programming or scripting language (Python, Javascript).
Proficiency with Relational Databases and data warehouses and Vector Databases
Experience supporting AI/ML platforms (e.g., Bedrock AI services, AgentCore on AWS, GCP) and the ability to work with different cloud providers to remain cloud agnostic.
Familiarity with MLOps concepts such as model deployment, monitoring, and lifecycle management.
Experience with modern observability architectures.
Knowledge of cost optimization strategies for large-scale cloud and AI workloads.
Experience working in large, regulated, or asset-heavy enterprise environments. - Work with SLMs and external facing MCPs would be a huge plus.