OP
Machine Learning Engineer, Infrastructure
Job Description
OpenAI is looking for a talented Machine Learning Engineer to join our infrastructure team. You will play a critical role in building and scaling the systems that power our state-of-the-art AI models. This involves designing, developing, and maintaining robust, efficient, and reliable ML infrastructure, including training platforms, distributed systems, and data pipelines. Your work will directly impact our ability to train larger, more capable models and serve them to millions of users.
Responsibilities:
- Develop and optimize distributed training systems for large-scale deep learning models.
- Build and maintain high-performance ML infrastructure, including compute, storage, and networking.
- Design and implement data processing pipelines for massive datasets.
- Collaborate with research scientists and other engineers to improve model training speed and efficiency.
- Troubleshoot and resolve complex issues in production ML systems.
Qualifications:
- BS/MS degree in Computer Science, Electrical Engineering, or a related field, or equivalent practical experience.
- Strong software engineering skills with experience in Python and C++.
- Experience with distributed systems, cloud computing (AWS, Azure, GCP), and containerization (Docker, Kubernetes).
- Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow) and ML infrastructure tools.
Benefits:
- Competitive salary and equity. Excellent health, dental, and vision insurance. Generous paid time off. Opportunity to work on impactful AI technologies with a world-class team.
Skills & Tags
machine learninginfrastructuremlopsdistributed systemspython