[上海] 普华永道 SRE 招聘-PwC AC 上海 - SRE/Devops
主要职责:
Provide SRE support for multiple distributed software applications(client-facing – internal & external)
Manage and continually improve platform infrastructure andapplications with high reliability, resiliency, performance & quality, andfaster time-to-market taking a holistic view of system health into account
Gather and analyze metrics from both systems and applications forperformance tuning and fault finding
Partner with development teams to improve services throughrigorous testing and release procedures meeting security, compliance &performance requirements
Participate in systems design, platform management, and capacityplanning. Ensure that platforms are designed with “operability “ inmind
Preemptively pursue the discovery of system faults throughout theapplication lifecycle – before & after release
Define, Implement and be accountable for Velocity &Reliability (SLIs, SLOs, Error Budgets)
Create & support sustainable systems and services throughautomation (to drive the problems away not just mere automation) and upliftsfor infrastructure, testing, failover solutions, failure mitigation etc
Writing, updating, and using documentation, includingrunbooks/playbooks
Using Chaos Engineering to test the robustness of the systems andapplications
为多个分布式软件应用程序提供SRE支持(面向内部与外部客户)
在考虑到系统的整体健康状况的同时,以高可靠性、高韧性、高性能和高质量以及更快的上市时间来管理和不断改善平台的基础架构和应用程序
收集并分析来自系统和应用程序的指标,以查找故障并进行性能调整
与开发团队合作,通过严格的测试和发布程序来改善服务,以满足安全性、合规性和性能要求
参与系统设计,平台管理和容量规划。确保在设计平台时考虑到“可操作性”
在应用程序整个生命周期中(即应用程序发布前以及发布后)及时发现系统故障
定义、实施并对系统的速度和可靠性(SLI,SLO,误差范围)负责
通过自动化(并不仅仅是自动化解决问题)来创建和支持可持续的系统和服务,并提高基础架构,测试,故障转移解决方案,缓解故障的能力等
编写,更新和使用文档,包括Runbook / Playbook
使用Chaos Engineering测试系统和应用程序的稳定性
背景及技能要求:
3+ years professionalexperience with various flavors of Linux and/or Windows
3+ years experience insupporting and troubleshooting full stack applications (monolithic andmicroservices), infrastructure and legacy applications (root cause analysisthrough identifying, analyzing and remediating service(s) performance andavailability issues to ensure maximum service uptime and availability)
3+ years experience inbalancing service reliability, metrics, sustainability, technical debt,and operational toil for live services running at scale
3+ years experience withcloud computing technology and its concepts (Azure, AWS, GCP)
3+ years experience withcontainer technologies and orchestration (Docker, Kubernetes-AKS, EKS,GKE)
3+ year implementingDevOps practices at scale
Experience in one or moreof the following: Go, Python, Ruby, Java, Perl, Shell, or Power Shell
Experience with CI/CDtool chain- Git, Jenkins, Azure DevOps. Veracode, SonarQube, JFrog Artifactory
Experience with IaC withTerraform, ARM templates, and/or AWS CloudFormation templates
Experience withconfiguration management tools like Ansible, Puppet and/or Chef
Experience withDBaaS/Managed Cloud database technologies such as CosmosDB, DynamoDB, ManagedSQL (RDS, SQL Database), In-memory (Cache for Redis, ElastiCache)
Experience withapplication performance monitoring tools (AppDynamics, Azure applicationinsights, Dynatrace, or Datadog) and log management tools (Azure Monitor’s loganalytics, Elastic Stack, and/or Splunk) defining, creating and configuringmetrics for dashboards and alerts.
Experience withdistributed storage technologies like Azure (Blob, Files, Tables), S3, NFS,HDFS
Experience with Webserver technologies- HTTP, Nginx, Apache, Tomcat
Experience in Kafka,Azure Event hubs or similar message queue technologies
Experience with Servicemesh platforms such as Istio, Hashicorp Consul
Experience with SecretsLifecycle management (Azure Keyvault, Hashicorp Vault)
Experience on minimal ornear zero downtime deployments as Blue-Green, Canary, rolling upgrades etc
Define and implement HA,DR and rollback strategies along with the product and build teams
Proficiency in Networkingconcepts (HTTP/S, TCP/IP, DNS, Virtual Networks (VNet, VPC), Subnets, Routing,Firewalls, and Network Security, triaging packet loss etc and knowledge onRESTful APIs
Experience with 24x7x365monitoring, incident response and oncall support
Experience introubleshooting that spans systems, network, and code
Experience determining& negotiating Error budgets, SLIs, SLOs, and SLAs with product owners
Systematicproblem-solving approach, coupled with strong communication skills
Ability to workindependently and as a member of a greater team, including cross-teamactivities
Worked in Agile Scrum,Kanban methodologies in SDLC
Undergraduate degree orequivalent experience/certification
Experience withindevelopment of the complete application stack inclusive of software engineeringand systems engineering responsibilities (e.g. full-stack development)
Requirement gathering,validation, fulfillment and change management Infrastructure operationsexperience including self-healing autonomy
Experience working withinregulatory frameworks such as SOX, SOC2 etc
Experience in Chaosengineering
Experience withintegration technologies like SnapLogic
Experience with a varietyof databases and basic DBA skills (MySQL, SQL Server, Oracle, Postgres, Redis,Couchbase and/or Cassandra)
3年以上的Linux和/或Windows专业经验
在全栈应用程序(单体和微服务)、基础架构和旧版应用程序的支持与故障排除方面拥有3年以上的经验(进行根本原因分析,通过识别、分析和修复服务性能和可用性问题以确保最大的服务运行时间和可用性)
在平衡大规模运行的实时服务的服务可靠性、指标、可持续性、技术负债和运营方面拥有3年以上的经验
3年以上云计算技术及其概念的经验(Azure, AWS,GCP)
3年以上容器技术和编排方面的经验(Docker,Kubernetes-AKS,EKS,GKE)
3年以上大规模实施DevOps的实践经验
下列一项或多项的经验:Go,Python,Ruby,Java,Perl ,Shell或Power Shell
具有CI / CD工具链的经验,包括Git,Jenkins,Azure DevOps。Veracode,SonarQube,JFrog Artifactory
具有Terraform,ARM模板和/或AWS CloudFormation模板的IaC经验
具有如Ansible,Puppet和/或Chef等配置管理工具的经验
具有DBaaS /托管云数据库技术的经验,例如CosmosDB,DynamoDB,Managed SQL(RDS,SQL数据库),In-memory (Cache for Redis, ElastiCache)
具有定义、创建和配置应用程序性能监视工具(AppDynamics, Azure application insights,Dynatrace或Datadog)和日志管理工具(Azure Monitor的日志分析,Elastic Stack和/或Splunk)仪表板和警报指标的经验。
具有分布式存储技术,例如Azure (Blob, Files, Tables),S3,NFS,HDFS的经验
具有Web服务器技术,例如HTTP,Nginx,Apache,Tomcat的经验
具有Kafka,Azure Event hub或类似消息队列技术的经验
具有服务网格平台,例如Istio,Hashicorp Consul的经验
具有秘密生命周期管理,例如Azure Keyvault,Hashicorp Vault的经验
具有最少或几乎为零的停机时间部署经验,如Blue-Green,Canary, rolling upgrades等
与产品和构建团队一起定义和实施HA,DR和回滚策略
精通网络概念(HTTP / S,TCP / IP,DNS),虚拟网络(VNet,VPC),子网,路由,防火墙和网络安全,分类丢包等以及有关RESTful API的知识
具有24x7x365监控、事件响应和呼叫支持的经验
具有跨系统,跨网络和跨代码进行故障排除的经验
具有与产品负责人一起确定和协商误差范围、SLI、SLO和SLA的经验
具有系统性解决问题的方法,同时具有较强的沟通能力
具有独立工作以及团队工作的能力,包括跨团队合作的能力
在SDLC的Agile Scrum和 Kanban methodologies中的工作经历
本科学位或同等经验/证书
具有完整的应用程序栈开发经验,包括软件工程和系统工程职责(例如,全栈开发)
具有需求收集、验证、实现和变更管理基础架构的运营经验,包括自我修复自主权的经验
具有在SOX,SOC2等监管框架内工作的经验
具有混沌工程经验
具有SnapLogic等集成技术的经验
具有各种数据库和基本DBA技能(MySQL,SQL Server,Oracle,Postgres,Redis,Couchbase和/或Cassandra)的经验
注明来源:v2k8s
推荐文章: