As an Infrastructure Engineer at Oculus, you will help us design and build a cutting-edge server environment that powers the Oculus platform. You will create an environment that can scale for millions of users, complete with integrating monitoring and management tools to keep the team informed 24/7 of service status. You will be responsible for making sure that Oculus services perform and scale with zero downtime. The ideal candidate is a stellar engineer with expertise in web platform technical operations and a passion for building new services.
- Help design, implement, and manage the platform and dev ops architecture from end to end.
- Collaborate with the live team, ensuring that the customer experience is constantly monitored, measured, and improved.
- Champion overall up-time, resolving application, performance, and systems incidents and errors as quickly as possible.
- Own our operational strategies around disaster recovery, migration, roll-back, expansion, routine deployments, and system upgrades.
- Evaluate the latest monitoring tools and automation systems.
- Develop and maintain our dev ops infrastructure to manage the platform.
- Gather current service availability metrics for review and identify opportunities for improvement.
- BS or MS in Computer Science or relevant field
- 5+ years of software engineering experience working as an engineer on live web services
- Experience building and maintaining a cloud computing architecture in Software as a Service (SaaS) and/or Platform as a Service (PaaS) categories
- Experience developing distributed networked applications that make use of protocols such as TCP, UDP, HTTP, etc.
- Experience with event-driven programming in any language (e.g., Node.js) is a plus
- Experience scaling services to handle large number of users and resulting traffic
- Experience developing and maintaining continuous integration, deployment, and testing services
- Implementation of configuration management methods with Chef/Puppet in a large scale distributed deployment
- Strong analytical and troubleshooting skills
- Ability to take ownership and exercise good judgment