[上海] 普华永道 SRE 招聘-PwC AC 上海 - SRE/Devops

主要职责:

  • Provide SRE support for multiple distributed software applications(client-facing – internal & external)

  • Manage and continually improve platform infrastructure andapplications with high reliability, resiliency, performance & quality, andfaster time-to-market taking a holistic view of system health into account

  • Gather and analyze metrics from both systems and applications forperformance tuning and fault finding

  • Partner with development teams to improve services throughrigorous testing and release procedures meeting security, compliance &performance requirements

  • Participate in systems design, platform management, and capacityplanning. Ensure that platforms are designed with “operability “ inmind

  • Preemptively pursue the discovery of system faults throughout theapplication lifecycle – before & after release

  • Define, Implement and be accountable for Velocity &Reliability (SLIs, SLOs, Error Budgets)

  • Create & support sustainable systems and services throughautomation (to drive the problems away not just mere automation) and upliftsfor infrastructure, testing, failover solutions, failure mitigation etc

  • Writing, updating, and using documentation, includingrunbooks/playbooks

  • Using Chaos Engineering to test the robustness of the systems andapplications

  • 为多个分布式软件应用程序提供SRE支持(面向内部与外部客户)

  • 在考虑到系统的整体健康状况的同时,以高可靠性、高韧性、高性能和高质量以及更快的上市时间来管理和不断改善平台的基础架构和应用程序

  • 收集并分析来自系统和应用程序的指标,以查找故障并进行性能调整

  • 与开发团队合作,通过严格的测试和发布程序来改善服务,以满足安全性、合规性和性能要求

  • 参与系统设计,平台管理和容量规划。确保在设计平台时考虑到“可操作性”

  • 在应用程序整个生命周期中(即应用程序发布前以及发布后)及时发现系统故障

  • 定义、实施并对系统的速度和可靠性(SLI,SLO,误差范围)负责

  • 通过自动化(并不仅仅是自动化解决问题)来创建和支持可持续的系统和服务,并提高基础架构,测试,故障转移解决方案,缓解故障的能力等

  • 编写,更新和使用文档,包括Runbook / Playbook

  • 使用Chaos Engineering测试系统和应用程序的稳定性

背景及技能要求:

  • 3+ years professionalexperience with various flavors of Linux and/or Windows

  • 3+ years experience insupporting and troubleshooting full stack applications (monolithic andmicroservices), infrastructure and legacy applications (root cause analysisthrough identifying, analyzing and remediating service(s) performance andavailability issues to ensure maximum service uptime and availability)

  • 3+ years experience inbalancing service reliability, metrics, sustainability, technical debt,and operational toil for live services running at scale

  • 3+ years experience withcloud computing technology and its concepts (Azure, AWS, GCP)

  • 3+ years experience withcontainer technologies and orchestration (Docker, Kubernetes-AKS, EKS,GKE)

  • 3+ year implementingDevOps practices at scale

  • Experience in one or moreof the following: Go, Python, Ruby, Java, Perl, Shell, or Power Shell

  • Experience with CI/CDtool chain- Git, Jenkins, Azure DevOps. Veracode, SonarQube, JFrog Artifactory

  • Experience with IaC withTerraform, ARM templates, and/or AWS CloudFormation templates

  • Experience withconfiguration management tools like Ansible, Puppet and/or Chef

  • Experience withDBaaS/Managed Cloud database technologies such as CosmosDB, DynamoDB, ManagedSQL (RDS, SQL Database), In-memory (Cache for Redis, ElastiCache)

  • Experience withapplication performance monitoring tools (AppDynamics, Azure applicationinsights, Dynatrace, or Datadog) and log management tools (Azure Monitor’s loganalytics, Elastic Stack, and/or Splunk) defining, creating and configuringmetrics for dashboards and alerts.

  • Experience withdistributed storage technologies like Azure (Blob, Files, Tables), S3, NFS,HDFS

  • Experience with Webserver technologies- HTTP, Nginx, Apache, Tomcat

  • Experience in Kafka,Azure Event hubs or similar message queue technologies

  • Experience with Servicemesh platforms such as Istio, Hashicorp Consul

  • Experience with SecretsLifecycle management (Azure Keyvault, Hashicorp Vault)

  • Experience on minimal ornear zero downtime deployments as Blue-Green, Canary, rolling upgrades etc

  • Define and implement HA,DR and rollback strategies along with the product and build teams

  • Proficiency in Networkingconcepts (HTTP/S, TCP/IP, DNS, Virtual Networks (VNet, VPC), Subnets, Routing,Firewalls, and Network Security, triaging packet loss etc and knowledge onRESTful APIs

  • Experience with 24x7x365monitoring, incident response and oncall support

  • Experience introubleshooting that spans systems, network, and code

  • Experience determining& negotiating Error budgets, SLIs, SLOs, and SLAs with product owners

  • Systematicproblem-solving approach, coupled with strong communication skills

  • Ability to workindependently and as a member of a greater team, including cross-teamactivities

  • Worked in Agile Scrum,Kanban methodologies in SDLC

  • Undergraduate degree orequivalent experience/certification

  • Experience withindevelopment of the complete application stack inclusive of software engineeringand systems engineering responsibilities (e.g. full-stack development)

  • Requirement gathering,validation, fulfillment and change management Infrastructure operationsexperience including self-healing autonomy

  • Experience working withinregulatory frameworks such as SOX, SOC2 etc

  • Experience in Chaosengineering

  • Experience withintegration technologies like SnapLogic

  • Experience with a varietyof databases and basic DBA skills (MySQL, SQL Server, Oracle, Postgres, Redis,Couchbase and/or Cassandra)

  • 3年以上的Linux和/或Windows专业经验

  • 在全栈应用程序(单体和微服务)、基础架构和旧版应用程序的支持与故障排除方面拥有3年以上的经验(进行根本原因分析,通过识别、分析和修复服务性能和可用性问题以确保最大的服务运行时间和可用性)

  • 在平衡大规模运行的实时服务的服务可靠性、指标、可持续性、技术负债和运营方面拥有3年以上的经验

  • 3年以上云计算技术及其概念的经验(Azure, AWS,GCP)

  • 3年以上容器技术和编排方面的经验(Docker,Kubernetes-AKS,EKS,GKE)

  • 3年以上大规模实施DevOps的实践经验

  • 下列一项或多项的经验:Go,Python,Ruby,Java,Perl ,Shell或Power Shell

  • 具有CI / CD工具链的经验,包括Git,Jenkins,Azure DevOps。Veracode,SonarQube,JFrog Artifactory

  • 具有Terraform,ARM模板和/或AWS CloudFormation模板的IaC经验

  • 具有如Ansible,Puppet和/或Chef等配置管理工具的经验

  • 具有DBaaS /托管云数据库技术的经验,例如CosmosDB,DynamoDB,Managed SQL(RDS,SQL数据库),In-memory (Cache for Redis, ElastiCache)

  • 具有定义、创建和配置应用程序性能监视工具(AppDynamics, Azure application insights,Dynatrace或Datadog)和日志管理工具(Azure Monitor的日志分析,Elastic Stack和/或Splunk)仪表板和警报指标的经验。

  • 具有分布式存储技术,例如Azure (Blob, Files, Tables),S3,NFS,HDFS的经验

  • 具有Web服务器技术,例如HTTP,Nginx,Apache,Tomcat的经验

  • 具有Kafka,Azure Event hub或类似消息队列技术的经验

  • 具有服务网格平台,例如Istio,Hashicorp Consul的经验

  • 具有秘密生命周期管理,例如Azure Keyvault,Hashicorp Vault的经验

  • 具有最少或几乎为零的停机时间部署经验,如Blue-Green,Canary, rolling upgrades等

  • 与产品和构建团队一起定义和实施HA,DR和回滚策略

  • 精通网络概念(HTTP / S,TCP / IP,DNS),虚拟网络(VNet,VPC),子网,路由,防火墙和网络安全,分类丢包等以及有关RESTful API的知识

  • 具有24x7x365监控、事件响应和呼叫支持的经验

  • 具有跨系统,跨网络和跨代码进行故障排除的经验

  • 具有与产品负责人一起确定和协商误差范围、SLI、SLO和SLA的经验

  • 具有系统性解决问题的方法,同时具有较强的沟通能力

  • 具有独立工作以及团队工作的能力,包括跨团队合作的能力

  • 在SDLC的Agile Scrum和 Kanban methodologies中的工作经历

  • 本科学位或同等经验/证书

  • 具有完整的应用程序栈开发经验,包括软件工程和系统工程职责(例如,全栈开发)

  • 具有需求收集、验证、实现和变更管理基础架构的运营经验,包括自我修复自主权的经验

  • 具有在SOX,SOC2等监管框架内工作的经验

  • 具有混沌工程经验

  • 具有SnapLogic等集成技术的经验

  • 具有各种数据库和基本DBA技能(MySQL,SQL Server,Oracle,Postgres,Redis,Couchbase和/或Cassandra)的经验

[上海]普华永道SRE招聘-PwC AC 上海 - SRE/Devops

注明来源:v2k8s

人因痛苦而改变,人因受益而坚持。

(= ̄ω ̄=)··· 暂无内容!

请勿发布不友善或者负能量的内容。与人为善,比聪明更重要!