I'm a platform and site reliability engineer with hands-on experience operating production Kubernetes, multi-cloud infrastructure, and CI/CD at scale. I care most about service reliability, observability, and automating away toil.
At Cognitive Network Solutions, I built and maintained Kubernetes platforms with Helm deployments, Ingress controllers, and namespace isolation for reproducible, reliable service delivery, on multi-cloud infrastructure I designed and deployed across GCP and Azure with Terraform. I engineered CI/CD pipelines with GitLab runners while embedding secrets management and least-privilege IAM.
I established the monitoring and observability layer — Prometheus, structured logging, and liveness/readiness probes with SLI/SLO instrumentation — that reduced time-to-detection and kept services reliable in production.
On the ML infrastructure side, I provisioned GPU node pools and inference auto-scaling with CUDA and MLflow — reliability and platform work for ML systems, not model development. Previously at Dfinitiv, I built cloud-native data pipelines on AWS and GCP that cut processing time by over 60%.
I'm looking for an Associate SRE role on cloud-native, container-based platforms, where I can keep production systems reliable and automate away toil.