Team Lead - Site Reliability Engineering
Overall Job Summary
As a Team Lead of Site Reliability Engineering you will manage and oversee the engineering teams supporting a large-scale distributed application portfolio across on-prem and Cloud environments. With focus to increase efficiency, eliminate downtime, optimize cost, and maintain performance at scale, you will provide leadership to our performance management, application security & Reliability processes, while managing the health of core E-Commerce systems, site performance and reliability solutions.
Essential Duties and Responsibilities
Manages end-to-end availability and reliability of E-commerce services, systems, platforms, and infrastructure and ensure they are designed and operated in an optimal manner
Maintains security and performance of mission-critical applications and services that are part of the E-Commerce ecosystem
Partners with Information Security with managing application security, vulnerabilities fix remediation, and site compliance
Partners with Cloud and Infrastructure teams to build and maintain environments, optimize usage and cost with optimal scaling strategy
Manages the performance strategy, test executions and remediation of critical site findings
Establishes application and synthetic monitoring, alerting and execution of failover capabilities and automated self-healing and recovery.
Ensures day-to-day support for multiple environments, ensuring readiness for project development and test activities
Employs strong site reliability principles and practices, and continuous improvement of processes via automation.
Partner with internal & external teams & ensure all change & release activities reviewed for trouble-free roll out & reduce risk.
Owns day-to-day health, uptime, monitoring and reliability of the website and related services
Lead continuous improvement that create an operating environment that includes dynamically monitoring, alerting, Failover capabilities and automated self-healing and recovery.
Participate & Maintain 24x7 on call rotations for Site Reliability.
May perform other duties as assigned *
Required Qualifications
Experience: 9+ years' experience around performance engineering, application monitoring & security for an organization with large and complex information systems is preferred. 6+ years' experience in B2B or B2C customer facing software design, development. 3+ years' experience in cloud PaaS/IaaS environments (Azure, GCP), release management, vulnerability management and automation.
Education: Bachelor's degree in Computer Science or related field is required. Any suitable combination of education and experience will be considered.
Preferred knowledge, skills or abilities
Strong experience with IBM/HCL WebSphere Commerce, IBM Sterling Commerce, SOLR and related build and deployment processes. HCL Commerce Version 9 Experience is a plus.
Strong experience with IBM Http Server, IBM WebSphere Application Server, IBM MQ & Deployment manager ND/Liberty software.
Strong experience in developing and implementing comprehensive monitoring solutions to provide full visibility to the different platform and application components using tools and services like Kubernetes, Prometheus, Grafana, ECK/ELK, Dynatrace, Rigor, Quantum Metrics, and other similar tools.
Evaluate Performance trends and expected changes in demand and capacity and establish the appropriate scalability Plans.
Evaluate production traffic pattern and tune the performance test workload mix and strategy to keep the systems and application in continuous readiness mode.
Experience with Kubernetes, AKS & Azure Cloud platform design, implement & maintain though cost efficient models.
Experience with containerization, certificates management, Kafka, Zookeeper & Vaults & pipeline automation, Fisheye, Crucible, Performance & QA Test Tool Integrations.
Strong Experience with cloud PaaS/IaaS environments Azure.
Strong ability to work independently, work in a fast-paced environment, and manage workload prioritization to deliver high quality work products on time with minimal direction.
Strong communication skills, both written and verbal.
Strong critical thinking skills with the ability to use proven problem-solving approaches to most solutions
Experience with Mobile App IOS & Android security and Performance management is a plus.
Working Conditions
Hybrid / Flexible working conditions
Must be able to work some nights and weekends
Occasional travel required
Physical Requirements
Sitting
Standing (not walking)
Walking
Kneeling/Stooping/Bending
Reaching overhead
Lifting up to 20 pounds
Disclaimer
This job description represents an overview of the responsibilities for the above referenced position. It is not intended to represent a comprehensive list of responsibilities. A team member should perform all duties as assigned by his/ her supervisor.
Company Info
ALREADY A TEAM MEMBER?
You must apply or refer a friend through our internal portal
Click here (https://performancemanager4.successfactors.com/sf/home?company=tractorsup)
CONNECTION
Our Mission and Values are more than just words on the wall - they're the one constant in an ever-changing environment and the bedrock on which we build our culture. They're the core of who we are and the foundation of every decision we make. It's not just what we do that sets us apart, but how we do it.
Learn More
EMPOWERMENT
We believe in managing your time for business and personal success, which is why we empower our Team Members to lead balanced lives through our benefits total rewards offerings. fot full-time and eligible part-time TSC and Petsense Team Members. We care about what you care about!
Learn More
OPPORTUNITY
A lot of care goes into providing legendary service at Tractor Supply Company, which is why our Team Members are our top priority. Want a career with a clear path for growth? Your Opportunity is Out Here at Tractor Supply and Petsense.
Learn More
Join Our Talent Community
Nearest Major Market: Nashville