Site Reliability Engineer II

Requisition ID
US-CA-Agoura Hills
Position Type


This is a 24/7 team responsible for production systems health monitoring, deployment of code changes, escalation handling and standardized communication of all change management within the technical operations organization. Multi-task and prioritize system events according to severity and escalation procedures. Communicate accurately in the event of production emergencies, both with internal and external groups. This individual must also be comfortable navigating through both Unix and Windows environments and be involved in actively troubleshooting and/or resolving production issues.

Job Description

  • Establish round-the-clock health monitoring of Unix and Windows environments hosting various platforms, (web, mobile and telephony)  using server, network and application monitoring systems 
  • Own escalation process, bringing production issues to resolution via troubleshooting, communication, and subsequent updates. Issues are owned from start to finish and tracked in the enterprise change management ticketing system. The operation center is responsible for gathering troubleshooting information either for direct resolution or for an escalation destination party.
  • Build deep, full-stack knowledge of the complexity of our platforms and applications. Work to simplify and automate deployment processes, run-time operations, and provide non-disruptive releases.
  • Handle stressful situations, such as initiating emergency conference bridge calls and sending quick and accurate outage notifications
  • Use standardized communications for code releases, schedule maintenances and service interruptions
  • Monitor the infrastructure change management policies and procedures
  • Communicate with departments, vendors and partners as a central repository for information regarding production site, customer support, help desk and core systems issues across the entire organization 
  • Deploy/release engineering codes across multiple environments - communicate and apply to staging and production environments all builds/releases, according to standard operating procedures
  • Maintain application reliability and uptime SLAs throughout the application lifecycle using programmatic self-healing and software automation
  • Improve the durability, simplicity, performance, and maintainability of our integrated landscape
  • Apply your expertise to our most complex digital projects. Meet with department and business unit clients to understand their application needs.
  • Responsible for driving feedback and new design ideas back into the platform team. Help identify and remediate architectural problems, scalability, availability, and performance issues.
  • Provide application support for Unix and Windows applications, including performing various system administration tasks and performing standard operating procedures as needed to maintain system health
  • Perform other related duties as required and assigned
  • Demonstrate behaviors which are aligned with the organization’s desired culture and values

Ideal Candidate will have the following:

  • 2+ years of previous operations center or equivalent experience 
  • 2+ years of direct experience (running scripts, grepping logs, troubleshooting errors) 
  • 2+ years of direct Windows experience (running scripts, processing event log messages, troubleshooting errors) 
  • 2+ years of direct VMware Horizon 7
  • Hands-on experience working with Amazon Web Services (AWS)
  • Hands-on experience working with backup solution (Commvault, Veeam)
  • Must be comfortable working in a command line as well as GUI environments
  • Excellent written and oral communication skills
  • Strong business acumen and ability to interface with executive management


Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed

Need help finding the right job?

We can recommend jobs specifically for you! Click here to get started.