Reddit is a community of communities. It’s built on shared interests, passion, and trust and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 101M+ daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit redditinc.com. Location: This role is completely remote-friendly. If you happen to live close to one of our physical office locations, our doors are open for you to come into the office as often as you’d like. Who We Are: The Machine Learning Platform team at Reddit is a high-impact team that owns the infrastructure that powers recommendations, content discovery, user and content quantification, while directly impacting other teams such as Growth, Ads, Feeds, and Core Machine Learning teams. What You’ll Do: As a Staff Software Engineer, ML Training Platform, this person will be instrumental in architecting, implementing, and maintaining foundational Machine Learning Training infrastructure that powers Feeds Ranking, Content Understanding, Recommendations and much more to fulfill Reddit’s mission of bringing community and belonging to everyone in the world. You will oversee GPU optimization in AI/ML batch workloads and debug performance bottlenecks in GPU workloads. You will build and own systems and tools that enable MLEs and data scientists, and continuously improve the ML software development lifecycle.
- Optimize model training on GPUs
- Lead the building, testing, and maintenance of ML infrastructure at Reddit
- Propose, design, and implement high-performance ML platform solutions that significantly advance the deployment of models that serve millions of redditors a seamless experience for MLEs
- Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
- Analyze bottlenecks in distributed systems and optimize for performance and cost-efficiency
- Work with management on team goal setting, planning, and de-risk project execution
- Mentor other team members in adopting a rigorous DevOps approach to maintain and/or improve ML platform components and services health and quality
- 8+ years of work experience in a production software development environment or building data systems
- Experience with XLA for Tensorflow or torch.inductor for pytorch for kernel fusion during training
- Experience with optimization of data workloads using collosal.AI or Deepspeed
- Experience with distributed Training optimization using deepspeed, horovod or collosalAI
- Experience with design and architecture of large scale ML Systems
- Experience with training workflows, hyperparameter tuning, and resource optimization on CPU and GPU
- Experience with MLOps practices and tools such as Ray and MLFlow
- Hands-on experience with Kubernetes, Docker, or other container orchestration systems
- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401k Match
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Flexible Vacation & Reddit Global Days off
- Generous paid Parental Leave
- Paid Volunteer time off