AN
Senior ML Infrastructure Engineer
Job Description
Anthropic is seeking a Senior ML Infrastructure Engineer to build and maintain the robust, scalable infrastructure required to train and deploy our state-of-the-art AI systems, including our large language models. You will be instrumental in ensuring our ML workflows are efficient, reliable, and can handle massive computational demands.
**Responsibilities:**
* Design, develop, and maintain scalable ML infrastructure, including distributed training systems, data pipelines, and model serving frameworks.
* Optimize infrastructure for performance, cost, and reliability.
* Collaborate with ML researchers and engineers to understand their infrastructure needs and provide effective solutions.
* Implement best practices for MLOps, including CI/CD, monitoring, and version control for ML models and data.
* Troubleshoot and resolve complex infrastructure issues.
**Minimum Qualifications:**
* Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
* 5+ years of experience in software engineering, with a significant focus on infrastructure or ML infrastructure.
* Strong proficiency in Python and experience with cloud platforms (AWS, GCP, or Azure).
* Experience with containerization technologies (Docker, Kubernetes).
* Familiarity with distributed computing frameworks (e.g., Ray, Spark) and ML frameworks (e.g., PyTorch, TensorFlow).
**Preferred Qualifications:**
* Master's degree or PhD in a relevant field.
* Experience with large-scale ML training infrastructure.
* Deep understanding of GPU computing and optimization.
* Experience building and managing CI/CD pipelines for ML systems.
Anthropic offers competitive compensation, generous benefits, and the opportunity to work on safety-critical AI research and development.
Skills & Tags
mlopsinfrastructuredistributed systemspythonaws