Software Engineer - Infrastructure
Emerald
Location
Bay Area, Boston, Washington D.C.
Employment Type
Full time
Location Type
On-site
Department
Engineering
About Emerald
We’re at a pivotal moment in AI and energy: compute demand is surging, but power constraints threaten to stall innovation. Emerald AI operates at the nexus of AI and energy, pioneering solutions that let AI factories scale without overwhelming the grid. By making data centers flexible through the Emerald Conductor software platform, we can unlock immense AI growth with limited capital costs, all while stabilizing the grid & enabling more renewables.
Our team of AI, cloud, software, and energy experts is on a mission to unlock AI’s potential sustainably, backed by premier investors and industry leaders like Radical Ventures and NVIDIA. Read more about our team, story, and backers at https://www.emeraldai.co/.
About the Role
As an Infrastructure Software Engineer at Emerald AI, you’ll build and operate the core systems that let customers run large-scale ML workloads reliably and flexibly without overwhelming the grid. Your work will directly support our mission to help AI “factories” scale while stabilizing the power system and enabling more renewables.
Day to day, you’ll configure and maintain cutting-edge ML compute clusters; design orchestration components that integrate with customer and partner environments; deploy secure, reliable cloud services; build scalable data pipelines for real-time control of large distributed systems; and strengthen our testing and continuous deployment practices.
You’ll leverage a major cloud provider (e.g., AWS or GCP) and modern IaC/Kubernetes tooling to deliver resilient, scalable platforms.
Key Responsibilities
Configure and maintain state-of-the-art ML compute clusters for research, product development and testing.
Develop orchestration components for integration with customer/partner clusters running large-scale ML workloads.
Deploy secure and reliable cloud infrastructure and services.
Build scalable data pipelines that support real-time control of large distributed systems.
-
Enhance testing and continuous deployment capabilities to promote software quality and reliability.
Minimum requirements
BS/MS or PhD in CS/EE or any other relevant field with 3+ years of industry experience in backend software development and cloud/ML infrastructure.
Proficiency in one or more programming languages: Python, Go, Rust, or C/C++.
Expert level knowledge of at least one major cloud platform (AWS/GCP/etc.).
Understanding of distributed systems and AI/ML infrastructure.
Significant experience with terraform and other Infrastructure as Code tools.
Experience with Kubernetes in a production environment.
Preferred requirements
Experience building CI/CD pipelines with Github Actions, Jenkins, or similar.
Experience with large-scale distributed AI/ML platform tooling (e.g., DeepSpeed, Horovod, Ray, etc.).
-
Exposure to telemetry systems like Prometheus, Grafana.
What We Offer
A chance to join a team of industry leaders and experts working at the nexus of two pivotal industries in a collaborative and collegial environment.
Building from zero to one: help the team build from the ground up. In addition to this role, you have the opportunity to contribute to strategy development, GTM planning, org design, and customer and investor interfacing.
Comprehensive benefits package including insurance for medical, dental, vision and life in addition to 401(k).
Location flexibility between our three hubs in the San Francisco Bay Area, D.C., and Boston with 1 WFH day/week.