Matthew Fitzgerald

Platform / Site Reliability Engineer | Kubernetes, Multi-Cloud Infrastructure, CI/CD | Reliability & Observability

Hands-on experience operating production Kubernetes, multi-cloud infrastructure, and CI/CD at scale, with a focus on observability and automating away toil.

Education

Florida Tech - B.S. in Computer Science, 2021-2024

Work Experience

Platform Engineer at Cognitive Network Solutions - February 2025 - November 2025

  • Built and maintained Kubernetes platforms with Helm deployments, Ingress controllers, and namespace isolation for reproducible, reliable service delivery
  • Established monitoring and observability stacks (Prometheus, structured logging, liveness/readiness probes, SLI/SLO instrumentation) that reduced time-to-detection and improved service reliability
  • Designed and deployed multi-cloud infrastructure (GCP + Azure) with Terraform, provisioning secure storage, service accounts, VPC networking, and private clusters
  • Engineered CI/CD pipelines with GitLab runners, automating builds, scans, and deployments while embedding secrets management and least-privilege IAM
  • Developed and secured databases with role-based access controls and integrated them into microservices securely
  • Provisioned GPU node pools and inference auto-scaling (CUDA, MLflow) for ML workloads, tuning for both cost and reliability

Junior Fullstack Engineer at EarthCam - February 2026 - May 2026

  • Re-architected the data-fetching layer of a core customer-facing component around asynchronous API calls, reducing fetch overhead for a 3–5x improvement in load time
  • Integrated data-querying APIs across the application, improving data resolution and cutting redundant round-trips to lower latency
  • Delivered new pages and feature work across the stack in TypeScript and Node.js

Software Engineer, Intern at Dfinitiv.io - Summer 2023, 2024

  • Engineered secure, cloud-native pipelines on AWS and GCP to automate ingestion and curation of digital media assets, reducing processing time by over 60%
  • Built and maintained asset metadata databases in PostgreSQL and MongoDB, enabling fast, reliable querying across thousands of records
  • Deployed applications and microservices using boto3, google-cloud-storage, psycopg2, and pymongo, ensuring scalability and portability
  • Built automated web scraping pipelines using Selenium and Playwright to gather and structure data from publicly available sources

About Me

I'm a platform and site reliability engineer with hands-on experience operating production Kubernetes, multi-cloud infrastructure, and CI/CD at scale. I care most about service reliability, observability, and automating away toil.

At Cognitive Network Solutions, I built and maintained Kubernetes platforms with Helm deployments, Ingress controllers, and namespace isolation for reproducible, reliable service delivery, on multi-cloud infrastructure I designed and deployed across GCP and Azure with Terraform. I engineered CI/CD pipelines with GitLab runners while embedding secrets management and least-privilege IAM.

I established the monitoring and observability layer — Prometheus, structured logging, and liveness/readiness probes with SLI/SLO instrumentation — that reduced time-to-detection and kept services reliable in production.

On the ML infrastructure side, I provisioned GPU node pools and inference auto-scaling with CUDA and MLflow — reliability and platform work for ML systems, not model development. Previously at Dfinitiv, I built cloud-native data pipelines on AWS and GCP that cut processing time by over 60%.

I'm looking for an Associate SRE role on cloud-native, container-based platforms, where I can keep production systems reliable and automate away toil.

Skills

Reliability & Observability

Prometheus Structured Logging Liveness / Readiness Probes SLI/SLO Instrumentation Incident Detection Time-to-Detection Reduction Auto-Scaling On-Call-Ready Monitoring

Containers & Orchestration

Kubernetes Helm Docker Ingress Controllers Namespace Isolation

Infrastructure as Code & CI/CD

Terraform GitLab CI/CD Cloud Build

Cloud

AWS (Lambda, S3, IAM, Secrets Manager) GCP (GKE Autopilot, Cloud Run, IAM, Secret Manager) Azure (AKS, ACR, Blob Storage)

Linux & Networking

Linux Administration VPC Networking Private Clusters VPN / IAP TLS / HTTPS DNS Kong API Gateway IAM / RBAC Workload Identity Secrets Management

GPU / ML Infrastructure

CUDA MLflow GPU Node-Pool Provisioning Inference Auto-Scaling

Programming & Scripting

Python Go Bash Java C++ SQL (PostgreSQL) NoSQL (MongoDB)

Projects

LLM Fine-Tuning Pipeline

Fine-tuned Mistral-7B on financial news with LoRA adapters, served as a streaming inference API with SSE token delivery and an automated ROUGE-L evaluation gate.

Libraries Used: MLX, mlx-lm, FastAPI, LoRA, Hugging Face

View Repository

ML Platform

Feature store and model registry backed by PostgreSQL and Redis, with scheduled feature pipelines, data quality validation, and MLflow experiment tracking.

Libraries Used: FastAPI, PostgreSQL, Redis, MLflow, APScheduler, SQLAlchemy

View Repository

ML Drift Monitor

Real-time drift detection service using ADWIN on live inference streams, with scheduled retraining, webhook alerting, and automatic model promotion via the registry.

Libraries Used: FastAPI, River, MLflow, APScheduler, PostgreSQL

View Repository

RAG Pipeline

Hybrid retrieval pipeline combining BM25 keyword search with dense vector retrieval, cross-encoder reranking, SSE token streaming, and per-sentence citation tracking.

Libraries Used: FastAPI, ChromaDB, BM25, sentence-transformers, mlx-lm

View Repository

LLM Agent

ReAct-style agent with tools spanning a RAG pipeline, feature store, and drift monitor, with multi-turn conversation memory and live SSE streaming of each reasoning step.

Libraries Used: FastAPI, mlx-lm, SQLAlchemy, PostgreSQL, httpx

View Repository

LLM Guardrails

Security proxy layer in front of the LLM agent enforcing semantic injection detection, PII scrubbing, per-client rate limit tiers, replay protection, and full audit logging.

Libraries Used: FastAPI, Redis, PostgreSQL, sentence-transformers, SQLAlchemy

View Repository