Cloud Architecture & Design Blog – by Abhishek Kothari

Microservices vs Monoliths: Making the Right Choice

Distributed SystemsBy Abhishek Kothari December 1, 2025 Leave a comment

The choice between a monolithic and a microservices architecture is one of the most critical decisions a software engineering team faces. It impacts everything from development velocity and team structure to operational complexity and scalability. For Site Reliability Engineers, Software Engineers, and Software Architects, understanding the nuances of these paradigms is paramount to designing resilient,…

Details

On-Call Best Practices: Preventing Burnout

Distributed SystemsBy Abhishek Kothari December 1, 2025 Leave a comment

The relentless hum of production systems, the constant vigilance required to maintain their health, and the inevitable late-night pages are an inherent part of the modern software engineering landscape. For Site Reliability Engineers (SREs), Software Engineers, and Software Architects, on-call duty is not just a responsibility; it’s a foundational pillar of operational excellence. Yet, this…

Details

Capacity Planning in the Cloud Era

cost-management, Site Reliability EngineeringBy Abhishek Kothari October 28, 2025 Leave a comment

In the dynamic landscape of cloud computing, managing infrastructure effectively is less about static provisioning and more about intelligent, adaptive resource orchestration. Capacity planning, once a periodic, often tedious exercise of spreadsheet projections and hardware procurement, has transformed into a continuous, data-driven discipline. For Site Reliability Engineers (SREs), Software Engineers, and Architects, mastering capacity planning…

Details

Incident Management: Post-Mortem Culture That Works

Site Reliability EngineeringBy Abhishek Kothari October 27, 2025 Leave a comment

The world of complex distributed systems is inherently unpredictable. Despite our best efforts in design, testing, and deployment, incidents are not a question of “if,” but “when.” For Site Reliability Engineers, Software Engineers, and Architects, the true measure of an organization’s maturity isn’t the absence of incidents, but rather its response to them. This response,…

Details

Observability vs Monitoring: Understanding the Difference

Site Reliability EngineeringBy Abhishek Kothari October 27, 2025 Leave a comment

In the rapidly evolving landscape of distributed systems, microservices, and cloud-native architectures, the terms “observability” and “monitoring” are often used interchangeably, leading to confusion and, more critically, to systems that are difficult to understand and troubleshoot. For Site Reliability Engineers, Software Engineers, and Architects, understanding the nuanced yet fundamental differences between these concepts is not…

Details

Chaos Engineering in Production: A Practical Guide

Site Reliability EngineeringBy Abhishek Kothari October 27, 2025 Leave a comment

We’ve all been there. It’s 3 AM, and the pagers are screaming. A critical service is down, customers are impacted, and the on-call team is scrambling through logs and dashboards, trying to piece together a puzzle in the dark. The postmortem later reveals the cause: a rare, cascading failure triggered by a minor network blip—a…

Details

Navigating Cloud FinOps: Tools and Strategies for Financial Efficiency

cost-management, opsBy Abhishek Kothari January 13, 2024 Leave a comment

Introduction In the ever-evolving landscape of technology, organizations are constantly searching for ways to optimize their operations and enhance financial efficiency. The advent of cloud computing has brought unparalleled opportunities for scalability and innovation, but it has also introduced new challenges in managing costs effectively. Enter FinOps, short for Financial Operations, a discipline that plays…

Details

System Design – A feature-rich Rate Limiter system

system designBy Abhishek Kothari December 28, 2023 Leave a comment

Introduction A Rate limiter service is a system that would help control the rate of requests hitting the end application to which the request is being sent out. A rate limiter could be as simple as DDoS attack protection and as complicated as evaluating user specific criteria before allowing requests to pass through. In this…

Details

A Comprehensive Guide to Trace Sampling Strategies in Distributed Tracing

observabilityBy Abhishek Kothari December 4, 2023 Leave a comment

Introduction In the dynamic world of distributed systems, gaining insights into application performance often relies on effective trace sampling strategies. Distributed tracing provides a holistic view of transactions across microservices, helping to identify bottlenecks, troubleshoot issues, and optimize overall system health. In this article, we’ll explore various trace sampling strategies, their benefits, and considerations for…

Details

A Comprehensive Guide to Chaos Testing Tools for Kubernetes

kubernetesBy Abhishek Kothari December 3, 2023 Leave a comment

Introduction In the intricate world of Kubernetes, ensuring the resilience of applications is paramount. The need for Chaos Testing tools arises from the inherent complexities and uncertainties in distributed systems. In this article, we’ll explore various Chaos Testing tools designed specifically for Kubernetes environments, providing insights into their features, benefits, and how they contribute to…

Details