Strategically moving infrastructure assets to Google Cloud and evolve infrastructure architecture into a hybrid on-prem/cloud model with the goal of optimizing infra and data asset ROI

  • Ability to spin up large ad-hoc data processing jobs
  • Flexible capacity for production workloads
  • Direct chargeback
  • Decoupled storage and compute
  • Managed Hadoop and Spark and Beam(Dataflow)
  • Get around Hadoop file count issues
  • Disaster Recovery
  • Usability of cold data
  • Simplified data access
  • Potentially replace Vertica with BigQuery
  • Compliance - SOX, HIPAA, PCI

Initial Phases

Migrate COLD & ADHOC clusters’ HDFS data to GCS

Automated deployment of Hadoop Clusters on GCE,

Management framework for users and groups, that sits on top of GCS ACLs

Redesign job and workflow tooling to support object store and new GCE cluster(s)

Support hourly on-premise hadoop data ingest to GCS

Develop change management framework to prepare and support teams using the GCE cluster(s) backed by object store


Technology Stack:

  • Hadoop Infrastructure on GCP
  • Tensorflow
  • Dataflow
  • BigQuery
  • Dataproc
  • Presto