Vertex AI Search for Commerce Shopify App Reliability Explained

Our Target SLA is > 99.9%

When we started Nimstrata, we set out to build a best-in-class architecture in collaboration with the Google Cloud team to achieve performance equal to or surpassing what leading retailers could achieve with a custom-built solutions for Vertex AI Search for Commerce.

Google Cloud's published Retail API SLA is 99.9%, and we aim to exceed that.

Third-Party Dependencies

Because we focus on providing tools and services to connect Shopify to Google Cloud, we cannot influence services that are upstream or downstream to us.

For example, if Google has a significant outage that breaks their search service, it doesn’t matter if our app is working or not, because Google won’t return any results. Similarly, if part of the Shopify platform goes down, we are at the mercy of Shopify to come back online - and this does happen occasionally.

Because of these dependencies, we define unplanned downtime as something within our control that breaks. If we push a bad update or fail to scale appropriately, that’s on us. If Google’s global load balancer or search engine goes down, we don’t count that towards our internal error budgets.

Holiday Planning

We work with Google Cloud to ensure that we have the necessary capacity to scale our services to meet customer demand - especially through Black Friday and Cyber Monday. Similarly, for heavy users of our service, we proactively contact our customers to ensure that they have individually planned for the QPS, or “queries per second” that they may send to the Retail API through our storefront service. This allows our biggest customers to create their own relationships with the Google Cloud capacity teams and trust the end-to-end data flow on mission critical days.

The Technical Details

We started with a focus on separating our microservices based on the number of potential failure points. Like any cloud architecture, if you rely on multiple services, you create more opportunities for something to break. Our architecture is only as strong as our weakest link, so we’ve aimed to eliminate the weaknesses wherever possible.

These are the Google Cloud components that comprise the Retail Cloud Connect storefront service, which is responsible for delivering results quickly and reliably on your storefront:

Service	SLA	Notes
Retail API	99.9% globally	This is the core of our service
Cloud Run	99.95% per region	We deploy our services in multiple regions globally
Secret Manager	99.95% globally	We replicate secrets globally
Load Balancer	99.99% globally	This is the same service that sends traffic to Google.com

Elastic Autoscaling

Our end users don’t notice when we elastically scale our storefront service up and down. This is because we create new instances as soon as we hit 60% capacity on an existing instance and leverage Cloud Run CPU boosting during startup to reduce our cold start time to 2000ms.

Riding Google’s Network

By leveraging Google’s global load balancer, retailer traffic enters Google’s network at the Google Front End, or GFE, closest to them. The GFEs manage our global anycast storefront IP address and ensure that all TLS connections are terminated with correct certificates and perfect forward secrecy.

Google doesn’t talk about this much, but published research papers have disclosed the B2 and B4 network infrastructure that Google owns and operates. At a high-level, the B2 network handles public-facing internet traffic, while the B4 network is a private WAN that connects all of Google’s data centers across the planet.

If a retailer who uses our app is already hosting their infrastructure on Google Cloud, such as merchants on Shopify, then the traffic sent from a retailer’s storefront (via Nimstrata) to Google’s Retail API never actually leaves Google’s internal (B4) network. While our IPs are publicly accessible, the traffic path is essentially private.

Our storefront service is internet-facing and would traditionally land on the B2 network, but Google’s software-defined WAN immediately detects the destination of our traffic and keeps it on the internal Google backbone. This means that even when the greater internet is struggling from AS routing issues or under heavy load, our service still provides a premium experience to connect to Google Cloud’s Retail API.

Cloud Armor

Because we maintain a set of public IP addresses, we enforce WAF policies at the entry point of our storefront service to protect against DDoS attacks. We’re continuously monitoring our logs and alerting to ensure we’re not rejecting legitimate traffic to balance security and performance.

Minimizing Latency

We deploy our services in several Google Cloud regions around the world to ensure that storefronts never have to make a high-latency API request to retrieve results, regardless of where your customers are shopping from.

We are constantly assessing our traffic patterns and deployment locations as our customer base expands. We want our retailers with a presence or consumers outside of the United States to receive the same experience regardless of where they’re browsing from.

So, what happens if there's an issue?

First, we prioritize communication to our customers. As soon as we're alerted about a severe issue, you'll be the first to know. In fact, we often beat Google and Shopify in their own public outage-related communications, even if we don't know the root cause of the issue.

Then, we determine what's in our control and what's out of our control. If there's something we can fix, we get to work. If we're stuck waiting on Google or Shopify to resolve the issue, we open the emergency communication channels we have established with each company, aim to prioritize Nimstrata's customers, and maintain an open line of dialogue until the issue is fully resolved.

Finally, when the issue is resolved, we run a postmortem to learn how to prevent similar problems in the future and communicate any major changes with our customers.

Summary

I hope this gives you confidence in our platform, along with how we run our engineering practices and our business. We take our customer's businesses as seriously as we take our own, and we know our infrastructure is a critical component in your ecommerce technology stack.

If you have any additional questions about our architecture, reliability, or how we scale for peak events - please schedule a demo.

Vertex AI Search for Commerce Shopify App Reliability Explained

Never Miss an Update