Box Tech Blog - Medium

From Service Metrics to User Reality: Building Meaningful User Availability at Box

Anuraag Shah — Tue, 24 Feb 2026 21:49:32 GMT

Introduction

At Box, one of our core values is to blow our customers’ minds — and high availability is table stakes for doing that in enterprise SaaS. But “high availability” measured how? We measured availability using metrics that answered the question our systems could easily answer: “What percentage of requests succeeded?” The harder question — “What fraction of users were able complete their intended actions?” — went largely unmeasured.

We measured availability using Critical Service Availability (CSA), a framework that calculated “critical minutes lost” when traffic throughput dropped below baseline. CSA was designed to answer a critical business question: when something goes wrong, how much impact did it cause? By measuring throughput degradation and translating it into minutes lost against a quarterly budget, CSA gave leadership a consistent, quantifiable signal they could act on.

But as Box’s feature set expanded and matured, and interactive workflows became central to how customers used the product, we needed to shift our reliability lens — and we started observing patterns that monitoring wasn’t designed to catch:

Latency degradations without throughput drop: HTTP 200s with multi-second response times kept throughput charts stable, but users were abandoning sessions.
User-clustered failures: Issues affecting a small percentage of requests but a large percentage of users — edge cases in specific workflows — fell below throughput thresholds while causing real pain.
Interactive traffic masked by automation: High-volume integrations, bots, and sync jobs dominated the request-volume signal, making it harder to see problems affecting typical interactive users.

These weren’t flaws in our reliability program — they were new patterns emerging from how the product had grown. We needed a complementary approach: one that measured availability from the user’s perspective.

A New Perspective: What Does “Available” Really Mean?

Consider: a service that’s technically accessible 99.99% of the time but is slow to load, has frequent errors, or fails to complete user workflows isn’t truly “available” in any meaningful sense. Our users don’t care about server health — they care about whether they can successfully upload their files, collaborate with their team, or access their content when they need it. This realization led us to explore meaningful User Availability (UA), inspired by Google’s Meaningful Availability white paper. Instead of measuring what our systems were doing, we decided to measure what our users were actually experiencing.

Defining User Availability

UA answers one operational question: “In this minute, what fraction of active users had a good experience?”

The formula is deceptively simple:

UA = 100 × (1 − Impacted Users / Active Users)

Where, for each one-minute window:

Active User: A distinct user who made at least one qualifying request during the minute
Impacted User: An active user who experienced at least one “impacting” event — a server error, timeout, or response slow enough to be user-breaking

The key design principle is one-user-one-vote: each user counts exactly once per minute, regardless of request volume. A power user generating 1,000 API calls has the same weight as someone making a single upload. UA is intentionally stricter than request-based metrics. It doesn’t average away user pain — if one request out of twenty fails for a user, that user’s minute was bad. It surfaces latency as impact — a slow-but-successful response still degrades experience. And it exposes problems that request metrics hide — an issue affecting 2% of requests might impact 20% of users if those failures cluster on specific user patterns.

Defining Impact: The Hard Part

Deciding what counts as “impact” is where theory meets reality. We needed rules that were:

Defensible: based on observable signals, not intuition
Consistent: applied uniformly across services
Tunable: adjustable as we learned more

Error/exception Impact

Press enter or click to view image in full size

Latency Impact

Errors alone miss “slow but successful” — often the bulk of user pain. We establish an Apdex-inspired model to classify latency:

Press enter or click to view image in full size

How we set T (the latency threshold):

We use two methods to determine what constitutes an acceptable latency threshold:

Conservative global baseline: Start with T = 2s (so Frustrated > 8s) as a defensible default
Percentile-derived per-workflow thresholds: Analyze historical p99 and p99.9 latency distributions during stable periods to establish what “normal slow” looks like for each critical workflow

This hybrid approach grounds impact in actual observed behavior (percentiles) while maintaining a consistent, interpretable threshold definition (Apdex 4×T multiplier). Over time, we evolved from a single global T to workflow-specific thresholds — uploads tolerate more latency than API calls.

From Prototype to Production: Building the Aggregation Pipeline

Press enter or click to view image in full size

Building UA wasn’t a waterfall process — it was an iterative loop. The diagram above captures our workflow, split into two phases: a development loop on the left and the production system on the right.

In the development loop, Query Updates are changes to how UA is computed — adjusting thresholds, adding signals, fixing edge cases. Those changes go through Validate Query Changes (unit tests, schema checks), then Visualize to inspect the output. If the signal doesn’t match reality, iterate. Once validated, the logic gets promoted into the 3-level aggregation UA Query — our production architecture, optimized for scale — which powers Realtime alerting and ETL for batch reporting.

Validation: Does the Signal Match Reality?

We ran this development loop hundreds of times before UA was production-ready. We prototyped with manual BigQuery queries against tier-1 services — no infrastructure to provision, fast iteration, immediate access to real production data.

The validation questions: Does the data exist? Is the math computable at scale? Does the signal correlate with user pain?

That last question was the real test. We backtested UA against historical incidents — outages we knew had caused user pain, support escalations, customer complaints. Would UA have detected them? How early? With what specificity?

Backtesting revealed both strengths and blind spots. UA caught “slow but successful” incidents that throughput-based metrics missed entirely — latency degradations that generated support tickets but never triggered alerts. It also exposed where our initial thresholds were either aggressive (false positives during normal load) or lenient (missing real degradations). Each failure sent us back through the loop until UA reliably detected the incidents we knew mattered.

The Three-Level Aggregation Query

The core computation transforms raw request events into per-minute UA through three level nested aggregation levels:

Press enter or click to view image in full size

The key design choices: any-impact semantics (one bad request ruins the user’s minute), one-user-one-vote (a power user counts the same as a casual user), and separated error types (responders need to know if they’re chasing errors vs latency).

Demonstrative Example: How the Three Levels Work

Consider a single minute (minuteX) with the following activity:

Setup:

5 users interact exclusively with Feature1 (10 requests each → 50 requests)
4 users interact exclusively with Feature2 (10 requests each → 40 requests)
1 common user interacts with both features (5 requests to Feature1, 5 requests to Feature2)

Totals:

Feature1: 6 users (5 exclusive + 1 common), 55 requests (50 + 5)
Feature2: 5 users (4 exclusive + 1 common), 45 requests (40 + 5)
Combined: 10 distinct users, 100 requests

Press enter or click to view image in full size

The punchline: Request availability is 99%. UA is 90%. Same data, different lens. If this were a real incident affecting 10% of users, request-based alerting might not fire — but UA would.

Real Production Data Output Example:

Here’s actual output from our production UA pipeline:

Press enter or click to view image in full size

Reading a single row (2026–01–14 23:03:00 UTC):

Users:     418,434 active  →  574 impacted  →  UA = 99.86%
Requests:  4,532,751 total →  958 errors   →  Request Availability = 99.98%

The gap tells the story: request availability says 99.98% (nearly perfect), but UA says 99.86% (574 users had a bad minute). Both numbers are correct — they’re answering different questions.

Breaking down the 574 impacted users:

userCodeError: 16 users hit 5xx errors
userLatencyError: 563 users hit latency thresholds
commonUserError: 5 users hit both

Latency is the dominant source of user pain — 35x more users were impacted by slowness than by errors. Traditional 5xx monitoring would show green dashboards while 563 users waited through frustrating delays. Those requests eventually succeeded, so they don’t register as failures in request metrics. But the users were impacted, and UA surfaces that.

Transforming UA into a True Interactive User Signal

Traffic is power-law distributed: a few users, integrations, and automations can generate huge volume. Even with per-user aggregation, we don’t want UA dominated by “users” that are really machines (service accounts, bots, sync jobs, retry storms). We address this through interactive-eligibility filtering:

Interactive-only eligibility (keep automation from masquerading as users)

User-weighting helps, but doesn’t prevent machine actors — from skewing the signal. We filter non-human traffic through a small set of explicit, versioned eligibility rules:

Rate gate: Exclude identities exceeding X requests/minute. Legitimate human interaction rarely sustains this rate; most traffic above this threshold is automation.
Identity gate: Exclude known service accounts, test tokens, and requests without reliable user attribution.
Client gate: Exclude machine-to-machine patterns identified via user-agent or registered application ID.
Endpoint gate: Optionally exclude endpoints that are primarily background activity (health checks, sync operations) rather than user-visible workflows.

The goal isn’t perfect classification — it’s a stable, human-representative signal. We accept some false positives (humans occasionally filtered) and false negatives (bots occasionally counted) in exchange for operational simplicity.

We monitor two key health indicators:

Coverage: What fraction of traffic/users pass eligibility filters? Unexpected drops suggest over-filtering of human users.
Incident correlation: Do known outages that affected humans show up in UA? If not, our filters may be hiding real problems.

We review coverage quarterly and investigate any significant drift and adjust thresholds when we find blind spots to ensure UA remains representative of actual user experience.

Key architecture decision records

Before implementing UA at scale, we made several foundational architectural choices. Each involved tradeoffs — we document them here as decision records to explain the rationale and constraints that shaped the system.

Decision 1: Server-Side Events Over Client Telemetry

Trade-off: Client-side telemetry is the gold standard for user experience — it captures the “last mile” of user experience (network latency to the client, rendering time) that servers never see. But client instrumentation required coordinating changes across multiple teams, platforms, and release cycles — a multi-quarter effort before we’d have usable data.

Our choice: Use server-side request logs as the primary signal source. This gave us immediate coverage across all services without cross-team coordination, at the cost of missing “last mile” experience.

What we’d do differently: Start client instrumentation earlier, in parallel with server-side UA. We now have user experience visibility gaps that will take quarters to close.

Decision 2: Per-Minute Granularity

Trade-off: Finer granularity (sub-minute) enables faster detection but increases computational cost and alert noise. Coarser granularity (5+ minutes) reduces cost but delays incident detection unacceptably.

Our choice: One-minute windows. Fast enough for operational alerting, coarse enough to be stable, aligned with how responders think (“problems started around 10:42”).

What we learned: One minute is near-optimal for incident detection but too noisy for executive reporting. We aggregate to hourly/daily for leadership dashboards.

Decision 3: Dual Pipeline Architecture

We run two parallel pipelines because “fast” and “accurate” have different requirements:

Real-time pipeline: A custom service queries the data warehouse every minute, computing near-real-time UA for alerting. Accepts some inaccuracy from late-arriving data in exchange for speed.

Batch pipeline: Scheduled jobs reprocess with full data completeness for historical reporting. Produces the authoritative record.

Why both? Real-time UA told us an incident was happening; batch UA told us the true impact for postmortems and SLA reporting. Attempting to serve both needs from one pipeline would require compromises that degraded both use cases.

Press enter or click to view image in full size

What Didn’t Work

Our initial latency thresholds needed iteration. We started with a uniform “frustrated” threshold across all workflows based on industry guidance. It took three months of tuning — reducing false positives from legitimately slow bulk operations and catching false negatives from latency regressions in fast-path APIs — before we arrived at workflow-specific thresholds. Lesson: prototype with real incident data before committing to threshold values.
We underestimated pipeline reliability requirements. UA depends on data pipelines that must themselves be highly available. Early on, we had incidents where data lag caused UA to report phantom degradations, sending responders on wild goose chases. We eventually invested ~30% of engineering effort on pipeline monitoring, alerting, and graceful degradation — far more than initially planned.
Cross-team alignment required as much investment as the technical work. Getting service teams to adopt UA-based alerting meant building shared understanding and trust. Teams naturally had questions about a new metric computed by pipelines outside their ownership. We invested time in explaining the methodology, walking through incident backtests together, and incorporating their feedback. Successful adoption depends on partnership, not just code.
Our eligibility filters had blind spots. We encountered edge cases where legitimate enterprise users were filtered out by rate gates (power users doing bulk operations) and cases where sophisticated bots passed all filters. Rather than pursuing perfect classification, we focused on achieving representative accuracy with monitoring in place to catch drift over time.

Operational Lessons

Build confidence signals into every dashboard. Every UA chart should answer: “Can I trust this data right now?” We display data freshness, pipeline health indicators, and known issues inline. Stale or unreliable UA is worse than no UA — it erodes trust in the metric.
Plan for graceful degradation. A reliability metric that fails during incidents defeats its purpose. We built resilience into UA: rolling query windows capture late-arriving data, backpressure delays processing rather than dropping events, and batch pipelines provide fallback when real-time fails. Stale but accurate UA is better than no UA during an incident.
Version your rules like code. Impact definitions, eligibility filters, and latency thresholds are configuration that changes over time. We treat them as code: version-controlled, reviewed, and auditable.
Monitor the monitor. We track query execution latency, metric emission success rates, and data completeness. UA that fails silently is a reliability liability.

Results and Impact

Since deploying UA:

Faster incident detection: With UA, we were able to improve our median incident (experience degradation) detection time to under 5mins from 10mins
Improved customer trust: UA provided rich, multi-dimensional customer segmentation that strengthened incident response. By analyzing UA across customer behaviors, automation patterns, and severity levels, teams quickly isolated impact sources, enabled proactive outreach, and engaged affected customers during incidents with clear, data-driven context and remediation timelines — reducing escalations and building trust.
Early warning system: Anomaly detection caught latency-induced degradations before they reached incident severity, enabling intervention before customer impact spread.
Data-driven performance investment: UA analysis identified that 40–50% of tail latency could be addressed through cache-backed APIs and batched calls. Acting on this insight, targeted optimizations reduced p99 latency from 4s to under 800ms for key workflows, with UA data continuing to inform which optimizations deliver the greatest user impact.
Organizational alignment around user experience: UA became the primary health indicator across Box features, embedded into change control policy. Significant UA degradation now gates deployments and triggers incident review. This shifted the conversation from “did something break?” to “how many users are affected?” — giving leadership a single metric that reflects actual customer experience.

Call to Action: How to Start Your Own User Availability Journey

If you’re thinking about adopting a user-centric reliability metric at your organization, the good news is you don’t need a perfect platform to start. You need a clear definition, consistent instrumentation, and a willingness to iterate.

Start with 1–2 critical workflows, not “all traffic.” Build trust in the methodology before expanding scope. All-traffic UA is appealing but masks signal quality issues.
Write down your contracts explicitly. For each workflow: what makes a user “active”? What makes them “impacted”? What’s excluded and why? Ambiguity in definitions creates confusion during incidents.
Backtest against known incidents. Before going operational, run UA against historical data for incidents you know caused user pain. If UA doesn’t detect them, your definitions are wrong.
Run in shadow mode before alerting. Parallel-run UA alongside existing monitoring for several weeks. Investigate every discrepancy. Tune thresholds based on false positive/negative analysis.
Budget heavily for operational tooling. The aggregation logic is maybe 20% of the work. Pipeline reliability, monitoring, dashboards, and runbooks are the other 80%.

What We’d Do Next

Client-side integration: Server-side UA misses rendering failures, network issues, and UI bugs. Integrating client telemetry will give us true end-to-end visibility.
Broader coverage: We’re currently tracking tier-1 workflows. Expanding to long-tail features requires investment in automated onboarding.
Latency-driven optimization: UA has surfaced which tail-latency hotspots impact the most users. Next step is using this data to prioritize performance investments — converting reactive “fix the slow thing” into proactive “improve the experiences that matter most.”

Conclusion

The shift from “what percentage of requests succeeded?” to “what fraction of users had a meaningfully bad minute?” sounds like a small reframing. In practice, it required rethinking our data pipelines, impact definitions, and organizational processes.

The technical implementation — three-level aggregation, per-user-per-minute windowing, eligibility filtering — is relatively straightforward. The hard parts were: defining impact thresholds that balanced sensitivity with stability, building pipelines reliable enough to trust for paging, and socializing a new metric across teams accustomed to request-based thinking.

UA isn’t a silver bullet. It has blind spots (users who give up before making requests, client-side failures we don’t observe), operational overhead (pipelines to maintain, filters to tune), and complexity (more configuration than simple request availability). But it answers the question that actually matters: “Are our users having a good experience?”

That question turns out to be worth the complexity.

References and acknowledgments

Our approach builds on foundational research in user-centric availability measurement:

Google’s “Meaningful Availability” (2016): Hauer et al.’s work on counting users rather than requests was the primary inspiration for UA. Their insight that “availability is not about request success rates; it’s about user success rates” fundamentally shaped our design philosophy. [Read the paper](https://research.google/pubs/meaningful-availability/)
Site Reliability Engineering literature: Concepts from Google’s SRE books — particularly error budgets, SLO-based alerting, and the distinction between SLI/SLO/SLA — informed how we operationalized UA for incident response and executive reporting.
Apdex (Application Performance Index): The Apdex standard’s approach to categorizing response times (Satisfied/Tolerating/Frustrated) provided a proven framework for translating latency into user experience impact. https://en.wikipedia.org/wiki/Apdex
Mentors: I would like to acknowledge this work to the mentorship of Tapas Kumar Mohapatra and Sergio Aguilar. They consistently asked the hard questions, challenged my assumptions, and pushed me to go further — enabling me to take this work beyond what I could have done alone.

From Service Metrics to User Reality: Building Meaningful User Availability at Box was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

SLO out of the Box

ssuresh — Fri, 23 Jan 2026 22:41:58 GMT

Written by Suhas Suresh and Mikey Phun

Illustrated by Navied Mahdavian / Art directed by Erin Ruvalcaba Grogan

Box started as a scrappy startup with a simple goal: make it easier for businesses to store and share their content. But as our customers’ needs evolved, so did we — transforming from an online storage and collaboration tool into a comprehensive content management platform, and ultimately into the Intelligent Content Management solution we are today. Fast forward to now, and we support more than 100,000 companies worldwide, managing exabytes of critical data — equivalent to billions of hours of HD video — for some of the largest organizations in the world.

Each major shift brought new challenges. The move to public cloud forced us to rethink scaling and security. Microservices improved how we built software but fragmented our visibility. Mobile users demanded consumer-grade experiences with enterprise-level security.

These shifts exposed a critical gap: we lacked a uniform way to accurately measure and assess service reliability. Engineering and Operations tracked different metrics across separate dashboards, making it challenging to efficiently answer questions about availability, performance, and customer impact . Without shared standards, incident response and root cause analysis was delayed. We needed a unified framework to define, measure, and improve customer experience — ultimately leading us to implement Service Level Objectives (SLOs) across Engineering.

What are SLOs and why they matter at Box

Service Level Objectives (SLOs) are measurable goals for reliability — think of them as internal “user happiness” scores. Each SLO combines:

Service Level Indicators (SLIs): Metrics like latency, error rate, or uptime
Target: A specific goal (e.g., 99.9% success rate)
Time window: The measurement period (e.g., 28 days)
Error budget: the difference between 100% and the SLO over a period. It represents the permitted proportion of time the service may be unavailable or outside the SLI threshold without breaching the SLO.

Example SLO: “99.9% of API requests complete within 300ms over a 28-day window. Error budget = 100% − 99.9% = 0.1% of the month ≈ 43.2 minutes of allowable downtime”

These internal targets build on each other: Service Level Indicators (SLIs) form the foundation for Service Level Objectives (SLOs), which collectively help us meet our Service Level Agreements (SLAs) with customers.

As Box evolved from on-premises to cloud, each team built custom monitoring for their specific needs. Without a unified approach to measuring reliability, teams fell into a reactive pattern: focus on features until a major incident forced everyone to shift attention to firefighting. Once the crisis passed, teams returned to feature work — only to repeat the cycle when the next issue emerged. This pendulum swing between features and reliability meant that reliability never improved systematically.

SLOs break this cycle by giving service owners a proactive indicator of when to shift focus. Instead of waiting for customer complaints or incidents, teams can see degrading reliability trends early and address issues before they escalate — making the shift from reactive firefighting to intentional, data-driven prioritization.

To better protect the customer experience, our newly formed SRE organization embarked on a journey to create an SLO program with a goal to provide a unified standard for measuring and ensuring the reliability of Box’s most critical applications. This framework not only gave us consistent reliability metrics across all services but also enabled teams to make data-driven decisions about where to invest engineering effort.

Initial Roadblocks

Despite the clear need for a unified method to objectively quantify scores across all services, implementing SLOs across Box wasn’t straightforward. We encountered three significant challenges that put the initiative at risk.

Technical Complexity

Creating a unified SLO solution revealed unexpected complexity. Each team had built custom monitoring over the past decade, creating a maze of incompatible tools and metrics. Previous SLO attempts had failed by being either too rigid (one-size-fits-all approaches that didn’t fit specific team needs) or too flexible (no standards at all, leading to chaos).

We needed to find the sweet spot between standardization and customization — a challenge that proved more difficult than anticipated.

Organizational Resistance

The biggest challenge was cultural, not technical. Teams pushed back with variations of “Our monitoring works fine — why change?” They saw SLOs as extra work competing with feature development priorities.

But the real concern ran deeper: SLOs would surface when their service availability suffered due to issues with dependencies they didn’t control — and they’d be held accountable for failures that weren’t their fault. Many teams were hesitant to prioritize implementing SLOs for services due to bias towards existing monitoring, unconvinced that the effort would justify the investment.

Knowledge Gaps

The SLO project exposed fundamental gaps in how teams understood and measured reliability. It revealed a two-fold problem: teams lacked understanding of what SLOs were and how to implement them properly, and they had never taken a pro-active approach to measuring service reliability, making it difficult to configure the right measurements. We had to educate teams that a low SLO score was not about blame — but objective measurements with an emphasis on accuracy, that helped identify problems in both their services and dependencies. We spent considerable effort simplifying how teams get started with SLOs and creating better training materials, but adoption remained slow. This required not only making the setup process easier but also shifting mindsets company-wide.

Breaking Through and Scaling: Our Journey from Resistance to Adoption

The Catalyst

The turning point came when several high-visibility teams needed to improve reliability after customer escalations. Their timing aligned perfectly with our newly simplified SLO adoption process. Here’s how momentum built:

Building Early Momentum

We recognized early on that successful adoption required a dual-pronged approach: engaging directly with early adopter service teams from the bottom up to deliver immediate, value-driven examples they could champion — while simultaneously demonstrating business-level benefits to leadership from the top down, empowering them to drive adoption across their organizations.

Early wins with critical teams: We partnered with teams managing customer-critical infrastructure to create their first SLO dashboards. Within a short duration, they could pinpoint performance issues that had been invisible before. They had an objective signal which could be fixed and measured over time to see progress.

Success Story: In one of the SRE engagements, we partnered with a dedicated team focused on improving our critical database access infrastructure. By leveraging SLOs, they achieved remarkable improvement from 95% to 99.98% availability over time, which improved overall availability of almost all user-facing applications as a result. The team rigorously monitored their SLO to identify critical gaps in reliability and invested in long-term projects to address them, ultimately achieving drastic improvements over time.

Executive visibility: Leadership embraced SLOs once they saw unified reliability metrics across all services. For the first time, they had objective data to prioritize areas of investment to improve performance.

The snowball effect: Once the most business-critical services adopted SLOs, it became easier to convince others. Teams saw their dependencies using SLOs and realized they needed the same visibility. Today, new services adopt SLOs without requiring SRE consultation.

Technical Foundation

Real-time SLI Bottleneck

We chose to reuse existing alert mechanisms as SLIs so teams wouldn’t have to rebuild their observability from scratch for this effort. Conceptually, using alerts as SLIs would work well — engineers could reuse existing alerts as SLI definitions. However, using raw alerts proved impractical due to the cost and slowness of real-time scans at scale, especially over long time windows.

The Scale Challenge: Assume we have 30,000 requests per minute for the specified api component as given in the expression below and the system records one sample per minute (60 samples per hour):

1 hour: 60 samples × 30,000 users = 1,800,000 samples

24 hours: 1,440 samples × 30,000 users = 43,200,000 samples

sum(rate(http_requests_total{job="api"}[5m]))

Processing that many samples in real time becomes computationally expensive. Most of our initial attempts to calculate an SLI availability score failed due to the excessive scale of these queries for larger time windows.

Solution

Instead of computing SLIs on demand, we continuously update compact, indexed aggregates (pre‑computed counters and time‑windowed metrics) from incoming alerts. Dashboards and SLI calculations read these stores, while raw alerts remain available for debugging and audit.

System workflow (at-a-glance)

Metrics/alerts are ingested as before.
A policy evaluates incoming alerts and updates pre-aggregated counters/time windows in near real time.
SLI reads use the aggregated stores; links to raw alerts support deep dives.

Key Benefits

Faster reads and predictable scalability
Preserved alert semantics and traceability back to raw events

Trade-offs

We lose the ability to change past events (recorded scores become immutable)

Phased Rollout Strategy

Introducing a new process requires clear thinking and planning in an enterprise setup. We realized quickly that mandating SLOs would not lead to adoption. Instead, we took advantage of a fortunate alignment: several high-visibility teams had already prioritized improving their reliability when the SLO initiative launched. We partnered with these teams to demonstrate value, using their feedback to champion SLOs within Box.

We saw incremental adoption as we progressed from one phase to another. Interestingly, newer teams were more eager to adopt SLOs as they had no existing monitoring and were easily sold on the value proposition.

Phase 1 — Proof of Concept (1–2 teams, 1–2 weeks)

Validate SLI Quality and Signal-to-Noise Ratio
During this phase, the focus is on ensuring that your Service Level Indicators (SLIs) accurately reflect user experience and system health. We assessed whether the selected metrics provide meaningful signals without generating excessive noise from irrelevant or low-impact events. This validation was critical — poorly chosen SLIs would lead to alert fatigue or missed incidents.

Iterate Weekly Based on Feedback

The pilot was inherently iterative. We planned to hold weekly retrospectives with participating teams to review what was working and what wasn’t. We used this feedback loop to refine SLI definitions, adjust measurement windows, tweak error budget policies, and improve alerting thresholds. Rapid iteration during the pilot ensured that by the time we scaled to more teams, our SLO framework was well-tested and aligned with real-world operational needs.

Phase 2 — Hands-On Rollout

We expanded SLO adoption to critical customer-impacting feature teams by providing dedicated support that included standardized dashboards, detailed runbooks, and hands-on training sessions. We established a continuous improvement loop by systematically collecting usage data and incident feedback from these teams. During post-incident reviews, we verified whether at-fault services actually breached their defined SLOs, and when no breach was detected despite customer impact, we refined the SLO indicators to better reflect real-world reliability expectations.

Phase 3 — Standard Template

We published reusable SLO templates with dashboard and alerts using declarative manifests through our internal observability as code framework. This approach enables version-controlled, reproducible monitoring configurations that can be deployed consistently across services and teams and complemented them with best practices that enabled teams to adopt service level objectives more consistently across the organization. By piloting these resources with minimal coaching, we effectively tested the scalability of our approach without requiring extensive hands-on support for every implementation.

Phase 4 — Self-Service Platform

As services grew, we automated the creation, validation, and provisioning of SLOs for all applications at Box via self-service platforms. The system provided on-demand documentation and lightweight training to help teams get up to speed quickly. We continuously monitored long-term coverage and reliability metrics to ensure our services met their objectives.

Making It Usable

SLOs must be readable and actionable to drive meaningful behavior change. Our dashboard design focuses on 4 key principles:

Clear Visibility: We present an explicit SLO score that teams can understand at a glance, showing current health and trend over time.
Composite View: A composite-SLO view that logically aggregates SLI breaches while avoiding duplicate error-budget counting across related services.
Actionable Drill-downs: Clear navigation from high-level SLO scores to pre-aggregated metrics and raw alerts for audit or deep-dive analysis.
User Journey Mapping: We provide clear descriptions of which customer experience is being measured by each SLI — these are called “User Journeys” and help teams understand the business impact of their metrics.

Fig — All services SLO showing Box wide SLO breach view

A service-wide SLO view gives instant, global reliability status across all services. Benefit: rapid situational awareness — see what’s healthy, at risk, or breaching, and who’s burning error budget fastest. In an incident, use it to gauge blast radius, prioritize the highest-impact services, and drill down to affected dashboards and alerts for root cause.

Fig- Service SLO dashboard

A service SLO dashboard shows a single service’s SLO score and trend, its error‑budget status, and which SLIs are breaching or at risk, with quick drill‑downs to pre‑aggregated metrics and raw alerts for root‑cause analysis.

Fig — Composite SLO calculation

A composite SLO calculation combines multiple SLIs into one logical SLO, aggregating breaches while avoiding double‑counting shared failures. It yields a single, clear reliability score for an end‑to‑end user journey, with traceability to the underlying indicators.

Cultural Transformation

As mentioned earlier, the main challenge we had to tackle was creating a culture of SLO-driven decision making for application-owning teams. This transformation happened over the course of several years, with the SRE organization consistently promoting adoption by demonstrating Engineering value.

Making Reliability Accessible: Our adoption strategy focused on three key principles:

Approachable: Simplified onboarding and clear documentation
Observable: Transparent metrics that teams could understand and act upon
Blameless: Focus on system improvement rather than individual accountability

Tactical Approaches:

Office Hours: SREs held regular sessions to answer questions and provide hands-on support
Reusable Templates: Standardized SLO configurations that teams could easily adapt
Executive Reporting: Trend reports that tied SLO health directly to business priorities
Success Stories: Early wins from critical teams were showcased as case studies across the organization

Breaking Down Barriers: Leadership visibility was crucial in removing political barriers and normalizing SLO discussions in planning sessions and incident postmortems. This shifted the conversation from blame assignment to measurable improvement, fundamentally changing how teams approached reliability.

Conclusion

Google’s SRE book makes SLOs sound like an obvious choice — something every organization should adopt to operate efficiently. But the reality is more nuanced. Every company faces its own unique challenges when implementing SLOs successfully.

Our SRE organization spent several years building a culture where SLOs became part of everyday engineering conversations. It wasn’t a quick win. The initiative required significant time, resources, and persistence to gain traction. We went through multiple attempts before we found the right value proposition that resonated with the broader engineering team.

While SLO implementation at scale presents technical challenges, the greater hurdle was cultural: establishing SLOs as a valuable engineering metric that teams genuinely prioritize.

Lessons Learned

Timing matters: Introducing SLOs is far easier in a company’s early stages. Once organizational culture is established, driving that kind of change becomes significantly harder.
Demonstrate value to gain adoption: Show concrete benefits and quick wins to build momentum and buy-in across teams.
Centralized solutions need consistent foundations: Deploying a uniform, template-based solution across teams is challenging when the underlying data (like metrics) lacks consistency. There are tradeoffs between simplicity of adoption and accuracy — sometimes standardization means accepting less precision in exchange for broader usability.
Cultural change is a marathon: Bringing meaningful change takes time and persistence. Patience and consistent effort eventually pay off, but expect the journey to take years, not months.

What’s next?

Even though SLO has seen a broader adoption within Engineering, we are still trying to improve our coverage as not all applications have adopted SLO. We are also working towards automations that regularly validates data quality and correctness of an application’s SLO and thereby creating a feedback mechanism to take action.

SLO is a continuous process and needs a culture of reliability and blamelessness to succeed.

Acknowledgments

We would also like to acknowledge Noah Gorka and Wojciech Krysmann for their valuable feedback on the content of the blog post.

SLO out of the Box was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building your first AI product: A practical guide

Shubhro Roy — Tue, 01 Jul 2025 14:52:37 GMT

Illustrated by Navied Mahdavian / Art directed by Erin Ruvalcaba Grogan

AI is no longer a moonshot. Every company, small or large is trying to figure out how to create value for their customers using AI. They must do this to adapt to the tectonic shift that is occurring in how consumers and enterprises operate. At the same time new startups are coming up building tools and products that leverage AI to solve existing problems or simplify workflows. But too many companies make the same mistakes when building their first AI product: oversized teams, unclear roadmap / product vision, complex architectures and not clear way to evaluate progress.

In this article, I will share some hard-earned lessons from my experience as a senior engineering leader within Box’s AI organization, where I helped build Box AI — our flagship generative AI product — from the ground up. If you’re embarking on building your first AI product or assembling your initial AI team, these insights are for you.

Anchor to a Hero Usecase

Pick a painful real world problem that can be meaningfully solved using AI. It is important to spend time finding the first hero usecase coz that sets the AI engine in motion. Success of this first use-case determines a lot of things including further business investment in this area, expansion of team and roadmap and most importantly team morale. I have come across many instances where everything was done right to solve the wrong usecase leading to low adoption / lack of impact on business metrics. Here are some strategies that worked for us when identifying the first usecase:

Conduct customer interviews to understand how they are already using AI or where are the painpoints currently that AI can help with
Run internal surveys within your company ( works better for mid-size / large companies ) to understand internal workflows that can be simplified with AI
Perform market survey to understand how similar products or companies are innovating using AI. This can either help to find the right product-market fit or avoid building a redundant product.

Once you have identified a use-case, find an early adopter or design partner. This can be a bullish customer or an internal team that is willing to dogfood. Treat them as collaborators and not testers only. This speeds up real customer feedback cycles allowing you to quickly discard approaches that are unfavorable than waiting till alpha / beta stages. For Box AI we had both and it was immensely helpful. Now we are doing the same with AI Agents.

Lastly it is critical to connect any new product (especially AI products ) to relevant business metrics. For startups / entirely new product lines this could be simply WAU / MAU. But in some other cases AI products could improve operational efficiency. For example if you are building a customer success AI chatbot for your product you should measure the % reduction of actual customer tickets filed or mean time to resolution for filed tickets etc. Every AI product is different and the business metric should correctly reflect the value proposition of applying AI to the problem space. Incorrect business metrics or lack thereof can lead to late stage product failures or issues with justifying further investment or scope expansion.

Start with a small team

When we started building out Box AI few years back, we kicked off with 4 borrowed engineers from other teams and a product manager ( part time from another team ). Thats all we needed to build out the first MVP. This requires a few things:

Identify a single hero usecase
Clearly define the MVP requirements both on the product and engineering side
Reuse existing infra where possible
Keep it simple

In our case we picked single document question answering as our first hero usecase and scoped it down to only small documents that fit within the context window of gpt3.5 ( the SOTA model at that time ). It was limited but sufficient to prove out the value proposition to our design partners.

Having a small laser focussed team make it easy to navigate changing priorities initially and optimize for speed and alignment. Dont worry about headcount till you have had your first customer Beta atleast. The smallest AI tiger team needs only the following roles:

1 product minded tech lead / eng manager
1–2 10x engineers with ML exposure ( no you dont need to hire an ML engineer yet )
1 full-time / part time product manager with background in AI / ML / Search / Recommendations or related fields
1 UX Designer

At a small startup some of these roles can also be merged or played by the co-founders.

The most important thing to ensure with such tiger teams is maintaining execution velocity. We achieved this with a culture of weekly demos: build, demo, gather feedback and iterate and measure constantly. This brings me to my next advice.

Define Metrics and build evaluation sets early

Evaluation is the key to building any AI / ML product. When you use an LLM to generate a response / action, you need to measure if it was done correctly to the satisfy the users informational need. There are two types of evaluation:

Offline Evaluation: where you validate your approach against ground truth data typically generated / validated by humans offline.
Online Evaluation: where you validate the performance of approach on unseen data where ground truth is not available. Typically this is done using LLM-assisted grading approaches with some sampled human-in-the-loop validations.

As you start building out your product you will first need an offline evaluation dataset. Leverage your early adopters to build this. Initial evaluation sets do not need to be extensive and definitely dont need LLM-assisted approaches. Rather I would suggest staying away from AI assisted dataset generation for offline evaluation initially as this can bake in hallucinations early on. A few hundred examples with human judgements goes a long way initially than thousands of AI generated examples that are hard to validate for humans. For Box AI we started with gathering our initial evaluation dataset using a Google form and a few teams within the company that had strong examples for our initial hero usecase. Depending on the size of your company, crowdsourcing the initial dataset can be very effective.

Fig 1: AI Evaluation Flow

Word of Caution: Public datasets are available for common AI tasks such as summarization and QA. But the data may not be reflective of what your customers have. For example Box customers use AI on enterprise documents: contracts, company manuals, medical images etc. Hence if we tried to evaluate using Wikipedia based QA datasets it would be ineffective. Hence in most cases having your own representative dataset can be instrumental.

Once you have started gathering your dataset, define your evaluation strategy and metrics. Evaluation metrics can be of three types:

Standard ML metrics: Precision, Recall, F1-score, AUC/ROC etc
Generative AI metrics: Perplexity, ROGUE, BLEU etc
Human Evaluation metrics: Correctness, coherence, conciseness etc

Depending on the task you are applying AI to, one or more of these metrics can be used to measure how your approach is working. Using standard industry metrics ensures that you can benchmark your scores against others and share this data publicly or with your customers to build confidence and trust in your product.

Lastly establish a continuous evaluation strategy. Evaluate against your benchmark dataset every time:

You modify your system prompt for the LLM
Add / modify additional processing steps in your pipeline such as RAG, tool use etc
Change the LLM model as new SOTA models are released or choose to fine-tune an existing one.

Start Simple and add complexity only when the metrics justify it

Building your AI architecture does not need to be really complex for your first product, atleast not initially. Based on your hero usecase and product requirements for your MVP figure out what the simplest engineering architecture.

For Box AI we initially started with 2 services:

Intelligence: This service exposed external APIs for the Front End to integrate with and performed permissions checks to ensure the user had access to the content they were using for their AI needs.
llm-gateway: This service served as the interface with LLM providers like OpenAI, Anthropic etc. This service also managed the prompts for our initial use-case and performed online grading of answers returned by the models using an LLM based grader.

This initial setup allowed us to quickly iterate and gather feedback from customers. As we added additional features likes RAG, conversation history, citations etc. we revisited this architecture and started splitting up the core services where it made sense and added additional components like vector stores and indexing pipeline.

Fig 2: Simple AI Architecture

Below table provides a quick guide to to common aspects of AI Architecture and when to consider them in your journey to build the first AI product:

Table 1: AI Architecture Decision Table

Once you have a baseline architecture, add complexity incrementally. After each addition:

Run offline evaluation (against ground truth)
Monitor online metrics (user feedback, satisfaction, usage drop-offs)
Estimate engineering + ops cost

If a new approach doesn’t clearly outperform the last — go back or stay where you are.

Stay Model/Provider Agnostic

One of the most strategic decisions you can make early on is to avoid locking your AI stack into a single model provider. The landscape is evolving rapidly — new foundation models are being released every few weeks, each with different strengths in reasoning, speed, latency, cost, or compliance. If your architecture is tightly coupled to a specific provider’s API, switching later can be time-consuming, expensive, and risky.

Instead, design your platform to be modular and provider-agnostic. Abstract model interactions behind a service layer or gateway so that your application code isn’t directly dependent on any one API schema or response format. This allows you to experiment with multiple models and switch when performance, cost, or capabilities justify it — without rewriting your core logic. It also enables A/B testing across providers, fallback strategies when one provider fails, and custom routing based on task type.

To stay truly provider-agnostic, your architecture should include:

LLM Integration Layer: Supports major providers like OpenAI, Anthropic, Google Gemini, Mistral, or AWS Bedrock via APIs or SDKs.
Prompt Translation Layer: Lets you manage prompts per provider or model family — either through prompt templates or translation logic.
Model Router: Dynamically selects the model + prompt based on use case, user tier, performance, or cost constraints.

This flexibility provides major benefits:

Easier Upgrades: When newer, more powerful models become available, you can integrate them seamlessly.
Cost Efficiency: You can take advantage of pricing differences across providers by routing requests dynamically based on cost or performance.
High Availability: Avoid long downtimes when a single provider has an outage ( we have seen our fair share of these from major SOTA model providers )
Future-Proofing: Your product remains adaptable and resilient against rapid shifts in the AI ecosystem.

Final Thoughts

Once the first usecase is delivered to customers in Beta or GA state, its time to focus on operational aspects:

Scalability: identify bottlenecks in the architecture that impact availability. This could involve adding adding autoscaling capabilities to deployed LLMs, caching repeated computations such as embedding / RAG responses to reduce overall load on downstream systems or simply adding rate limiters on externally exposed APIs. Focussing on observability in the initial days will make this easier
OnCall: Establish team rituals around operational aspects such as on-call rotation to keep track of customer impact. Write runbooks for commonly encountered issue to bring down recovery time etc
Customer Feedback: Monitor quality metrics and customer feedback to understand cases where the MVP approach doesnt work. Now is the time to start the quality hill climb from the MVP.

This is also when the next use case often emerges — sometimes organically. At Box, our first AI use case focused on document Q&A. But customer usage revealed a new need: metadata extraction. Our existing solution didn’t fit. So we kicked off the AI Extract Agent — a purpose-built system leveraging named-entity recognition and tool use. It built upon the platform primitives we had already invested in, like our provider-agnostic llm-gateway, but pushed us further in architectural complexity and product scope.

That single MVP, built by a tiger team of four, eventually became a multi-team AI Platform organization. What began as two services grew into indexing and query pipelines, with dedicated services, vector stores, and specialized teams focused on agents, RAG, foundation models, and prompt engineering.

And the foundational decisions we made early — to stay provider-agnostic, to treat prompts as code, to build a robust evaluation loop — unlocked speed and agility as the landscape evolved. They made it easy to adopt the latest state-of-the-art models without major investment in time or resources.

So if you’re at the beginning of your journey: start simple, but build smart. Focus on solving one real problem well. Lay the groundwork — with evaluation, modularity, and observability — for what comes next.

Building your first AI product: A practical guide was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Box IT Blog Series: How We Deploy AWS Network Technologies Inside Box IT

Zhen Chen — Tue, 28 Jan 2025 20:46:56 GMT

Illustrated by Navied Mahdavian / Art directed by Erin Ruvalcaba Grogan

Welcome to the Box IT Blog Series, where we dive into our journey of transforming from traditional on-premises IT data centers to AWS Cloud technologies. Historically, Box IT relied on datacenter-grade routing, switching, and network security solutions to build and manage robust on-premises networks. Transitioning to AWS introduced a paradigm shift — many of the familiar concepts from traditional network infrastructure no longer applied in the cloud’s virtualized environment. For network engineers accustomed to physical hardware, cable management, and conventional high availability (HA) and resiliency methods, the shift posed unique challenges. In this series, we’ll share how Box IT embraced AWS-native technologies, followed best practices, and redefined our network architecture to meet the demands of the cloud era.

Traditional Data Center Design Overview

To better understand how Box IT has utilized AWS network technologies, it’s crucial to first revisit our starting point: the traditional data center model. By exploring the key components of this setup and their role within the cloud ecosystem, we can establish a solid foundation for our migration journey and highlight the unique challenges and opportunities it presented.

In traditional network design, the layout of an IT data center can vary significantly depending on organizational needs. However, several foundational components are universally critical for ensuring secure, high-performance, and efficient operations. These components form the backbone of a on-premises network infrastructure:

Border Network: The gateway to the Internet, this component interfaces directly with the ISP, managing inbound and outbound traffic. As the entry and exit point for external communication, it is crucial for handling and directing network traffic securely and efficiently.
Security Protection: In today’s threat landscape, robust security measures are non-negotiable. Application firewalls are a standard defense, providing a shield against external threats and controlling the flow of traffic into sensitive parts of the network.
Intra Data Center Connection: The core of internal data traffic, this component demands high-performance, low-latency, and highly available networks. Modern switching and routing technologies ensure seamless internal communication to meet the demands of contemporary applications.
Interconnection with Other Data Centers or Office Networks: For organizations with multiple data centers or remote office networks, reliable interconnectivity is essential. These connections, often built with high-capacity links and secure protocols, enable smooth resource sharing and data integrity across all sites.

Each of these elements is designed with principles of high performance, high availability, resiliency, and cost efficiency. As businesses increasingly adopt cloud solutions, it’s worth exploring how cloud-based technologies address these traditional principles, delivering cutting-edge network solutions that align with today’s operational needs. Let’s dive deeper into this evolution!

Border Network: From Traditional Border Routers to AWS Internet Gateways (IGW) and NAT Gateway (NGW)

In traditional network setups, the border network layer relies on border routers to connect to multiple ISPs using EBGP for redundancy and traffic load sharing. Additionally, due to the global shortage of IPv4 addresses, network teams typically deploy NAT at this layer to translate RFC1918 private IP addresses into public, routable IPs for internet communication.

AWS simplifies the management of internet connectivity with Internet Gateways (IGWs) and NAT Gateways (NGWs). According to AWS, “An internet gateway is a horizontally scaled, redundant, and highly available VPC component that allows communication between your VPC and the internet.” Once an IGW is deployed and attached to a VPC, AWS handles all aspects of internet connectivity, including high availability and redundancy — no additional configuration is needed. It’s a hassle-free solution that eliminates much of the manual effort required in traditional setups. More IGW detail can be found here.

The IGW in AWS serves a role similar to border routers in on-premises networks. It facilitates internet access for resources within the VPC. Complementing this, the NAT Gateway is critical for enabling internal resources, such as instances in private subnets, to access external services. As AWS explains, “You can use a NAT gateway so that instances in a private subnet can connect to services outside your VPC, but external services cannot initiate a connection with those instances.” While not a direct equivalent to an on-premises internet firewall, an NGW performs all outbound NAT functions, ensuring seamless internet access for private subnets.

Unlike IGWs, NGWs are tied to specific Availability Zones. If an AZ experiences an outage, the NGW in that zone becomes unavailable. To mitigate this, AWS strongly recommends deploying an NGW in each AZ to maintain high availability. Setting up an NGW involves specifying a subnet within the desired AZ and assigning an Elastic IP (EIP) — a public IPv4 address reserved in your AWS account. EIPs are essential for creating public-facing services and enabling NAT functionality for NGWs. For more details, refer here. Note that EIPs are a paid resource, so their usage should be planned carefully.

The NGW creation dashboard, illustrated below, emphasizes these two essential components for the setup: the subnet and the Elastic IP. Both are critical for configuring the NGW.

The diagram below illustrates a typical setup for planning your VPC’s internet-facing layer, incorporating both IGWs and NGWs.

In summary, AWS offers a robust, scalable, and highly available approach to managing the internet-facing layer of your VPC. IGWs simplify the process of establishing external connectivity, while NGWs provide secure, controlled internet access for private subnets. These tools, when used strategically, modernize and streamline border network management in the cloud. Up next, we’ll explore additional AWS networking solutions!

Security Protection: AWS Network Firewall for Layer 7 Protection

In contrast to the traditional on-premise model, where security at the network border relies heavily on hardware firewalls and Access Control Lists (ACLs) on routers and switches, AWS reimagines security with a broad suite of native tools designed for the cloud. On-premises setups typically achieve high availability (HA) through firewalls operating in HA mode with Virtual IPs (VIPs) and rely on traffic logging for troubleshooting and maintaining oversight. AWS provides similar functionalities but shifts the focus to cloud-native solutions, offering more flexibility and scalability. Let’s delve into these tools and examine their advantages and trade-offs.

One of the first security tools encountered in AWS is the Security Group, defined by AWS as “a set of firewall rules that controls the traffic to and from your instance. Inbound rules control the incoming traffic to your instance, and outbound rules control the outgoing traffic from your instance.” However, from a network security engineer’s perspective, this description oversimplifies their role. Unlike traditional firewalls, Security Groups do not enforce segmentation or protect systems within specific network zones. They also lack the granularity to implement complex permit/deny logic, which is often essential in enterprise environments. If you’re familiar with IPtables, Security Groups function in a similar manner.

Despite their constraints, Security Groups are indispensable in any AWS security architecture. They excel at controlling east-west traffic (internal traffic between resources), a capability not fully addressed by other AWS features. With their default “deny-all” policy, they allow only explicitly defined traffic, ensuring a baseline level of security. Below is a typical Security Group configuration interface, which supports only allow rules but enforces a deny-all policy by default.

Network ACLs (NACLs) function as stateless filters at the boundary of each network segment, controlling both inbound and outbound traffic. Unlike Security Groups or AWS Network Firewall (NFW), NACLs require explicit rules for traffic in both directions to ensure proper operation ( stateless vs stateful ).

For network engineers, NACLs resemble traditional Access Control Lists (ACLs) on devices like Cisco. However, they come with limitations in granularity, particularly when it comes to protocol-specific rules. For instance, NACLs cannot filter traffic based on specific TCP flags, such as allowing only TCP SYN packets. This makes them less suitable for complex security requirements often encountered in enterprise environments, where advanced firewall technologies are preferred.

Despite these constraints, NACLs offer a cost-effective solution for basic traffic control, as they are provided at no additional charge. While they may not replace full-featured firewalls, they can serve as a useful tool for straightforward access management. For further details on configuring NACLs, refer to the official AWS documentation. The following is example to setup NACL:

While Security Groups and NACLs provide basic security controls in AWS, they lack the advanced Layer 7 capabilities required for deep packet inspection and application-level filtering. To address this gap, Box IT adopted AWS Network Firewall (NFW). According to AWS, “AWS Network Firewall is a stateful, managed, network firewall and intrusion detection and prevention service for your virtual private cloud (VPC) that you create in Amazon Virtual Private Cloud (Amazon VPC).”

Working with AWS firewall technology can be quite challenging. Why? In most enterprises, there is typically a preferred firewall vendor in place, and the first question to address is whether to stick with the same brand for the AWS setup. Keeping the same brand offers benefits such as simplified management, reuse of existing firewall policies, and easier support from the team already familiar with the solution. However, when you explore deploying third-party firewalls in AWS, replicating the on-premises setup often proves complex and difficult. For instance, on-premises setups typically use a pair of firewalls in HA mode to handle single points of failure. Implementing similar setups in AWS can be challenging, requiring creative solutions to ensure resilience and redundancy.

At Box IT, we prioritized simplicity, efficiency, and cost-effectiveness, which made AWS’s native firewall the ideal choice. AWS NFW is straightforward to set up, requires minimal management, and is highly scalable. However, its operation differs significantly from traditional on-premises setups, introducing a learning curve for network engineers.

One of the primary challenges with AWS NFW lies in maintaining symmetric routing. In traditional environments, HA firewalls typically serve as the default gateway, ensuring that traffic flows symmetrically — following the same path in both directions. This setup minimizes routing complexity and avoids issues with stateful inspection.

In AWS, symmetric routing is essential but requires meticulous planning. Consider the following example, which aligns with AWS’s recommended design principles. To protect customer-built systems that face the Internet directly, AWS strongly recommends using an Application Load Balancer (ALB) in the Public Zone, which serves as the direct interface with the Internet. The customer-built systems can then reside in a DMZ zone behind a well-architected AWS Network Firewall.

In this setup, both intra-zone traffic and cross-zone traffic come into play. For instance, an ALB deployed in AZ1 can route traffic to backend EC2 instances in both AZ1 DMZ and AZ2 DMZ to achieve optimal load balancing and high availability.

Intra-zone traffic is straightforward, as it always passes through the NFW within the same Availability Zone.
Cross-zone traffic, however, introduces more complexity. There are two independent firewalls — AZ1 NFW and AZ2 NFW — which do not share state information.

When traffic flows from the ALB in AZ1 to an EC2 instance in AZ2, it may take one of two possible paths:

AZ1 ALB → AZ1 NFW → AZ2 EC2
AZ1 ALB → AZ2 NFW → AZ2 EC2

While both paths may initially function, it is crucial to ensure that return traffic follows the same firewall it originally passed through. If return traffic traverses a different firewall, it will be dropped due to stateful inspection. Managing this subtle but critical routing requirement is essential for maintaining a reliable and secure AWS architecture.

To clarify, there’s no issue with choosing either Option 1 or Option 2; however, it’s vital to enforce a rule to maintain symmetric routing at all times. For instance, you can always prioritize the AZ1 firewall for cross-zone traffic. This approach helps ensure consistent routing behavior and avoids unintended traffic drops.

A quick note on cross-zone traffic, as demonstrated here: AWS typically charges for cross-zone traffic, whereas intra-zone traffic is usually free. However, when a Load Balancer is involved, the cost model changes and differs from standard cross-zone charges. You can find more details here. That said, it’s always a good practice to be cautious and mindful of any cross-zone traffic that might be triggered.

A significant learning curve for network security engineers transitioning to AWS Network Firewall (NFW) is mastering Suricata-based firewall policies. Suricata may not be familiar to most network security engineers, but compared to other major firewall vendors that offer a “black box” solution, Suricata is an open-source intrusion detection (IDS) and intrusion prevention (IPS) system. Since late 2020, Suricata IPS rules have been part of the AWS Network Firewall service. For more details, you can find more information here.

When setting up NFW policies, you are presented with three options below:

Compared to standard stateful rule or domain lists, creating Suricata rules is notably more complex. Here are a few examples to illustrate the format and functionality, more details can be found here.

Blocking malicious site traffic: drop tls any any -> any any (tls.sni; content:”.badsite url”; nocase; endswith; sid:100;)
Allowing DNS traffic: pass dns $PRIVATE any -> $DNS 53 (msg:”allow DNS”;sid:200;)
Allowing HTTPS traffic to the Internet for internal hosts: pass tcp 0.0.0.0/0 any -> !$RFC1918 443 (flow:to_server, established; sid: 300;)

Initially, working with these rules may feel daunting, particularly for those accustomed to simpler rule-based interfaces. However, once familiarized with the syntax for pass, reject, and drop actions, the process becomes more intuitive. At Box IT, our network team quickly adapted to Suricata’s structure after overcoming the initial learning curve, leveraging its powerful filtering capabilities effectively.

Another critical component of any effective security stack is the ability to log and categorize events. There is a wide range of both commercial and open-source tools available, and AWS is also making strides in this area. While AWS Network Firewall offers logging capabilities, these are still behind the more advanced tools provided by leading third-party firewall vendors. For effective troubleshooting, it’s crucial to have administrator-friendly logging with real-time analysis and comprehensive debugging tools, which NFW currently lacks.

NFW does offer basic logging, which can be enabled during setup, but its functionality is limited for in-depth troubleshooting or incident response. Enhancements in this area would significantly boost NFW’s usability for enterprise-level network management. Below is an example of NFW logging, more details about logging can be found here

Despite these limitations, with the right training and experience, teams can effectively manage and optimize AWS NFW policies, ensuring both security and performance in cloud environments.

By addressing these challenges and leveraging the capabilities of AWS NFW, Box IT has achieved a secure and scalable solution tailored for cloud environments.

Intra-Data Center Connections in AWS: What Happened to VRRP/HSRP?

Now that we’ve reviewed the overall design and components of an AWS border network and security, let’s turn our attention to intra-data center connectivity. In AWS, this type of connectivity is managed differently compared to traditional on-premises setups, though there are still some familiar elements. In traditional networks, VRRP (Virtual Router Redundancy Protocol) and HSRP (Hot Standby Router Protocol) are essential for ensuring Layer 2 redundancy and gateway failover. These protocols provide seamless failover by assigning a virtual IP to active network devices, minimizing disruptions. How do these configurations translate into AWS?

Before addressing this question, let’s examine the typical network setup for an EC2 instance in AWS. When launching a new EC2 instance, there are three fundamental network settings to configure:

Network: Specifies the virtual network (VPC) where the EC2 instance will reside, effectively determining its logical location.
Subnet: Assigns the primary network interface of the EC2 instance, ensuring proper connectivity within the selected Availability Zone.
Auto-Assign Public IP: An optional setting used if the instance requires internet-facing access.

Also, as a network engineer, you’ve likely come across the term “multi-homed host” — a server equipped with multiple network interfaces and IP addresses, allowing it to connect to multiple networks simultaneously. Does AWS support this functionality? Absolutely. AWS enables multi-homing through its EC2 network settings.

To configure multiple network interfaces for an EC2 instance, simply access the advanced network configuration options during setup. For a detailed guide on setting this up, refer to this. This capability is particularly useful for scenarios like separating management traffic from application traffic or connecting to different subnets within a VPC.

By now, you may have realized that AWS does not utilize network-level HSRP or VRRP configurations. So, how does AWS handle failure?

Instead of relying on traditional IP or MAC-level redundancy, AWS advocates implementing failover and redundancy mechanisms at a higher level. The most straightforward approach is to use an Elastic Load Balancer (ELB) and distribute services across multiple Availability Zones (AZs). In the event of an AZ failure, the servers in the remaining AZs automatically take over, ensuring seamless service delivery. This load-balancing strategy is integral to Box IT’s high-availability architecture, enabling services to remain operational even during component failures.

AWS also highlights that the Elastic Load Balancer is inherently designed for high availability, eliminating the need for additional HA mechanisms. For more details, refer to here.

If you prefer a system design based on Virtual IPs (VIPs), AWS supports this approach, but it requires more complexity compared to using an ELB. To implement HA failover with multiple instances across AZs using VIPs, refer to the relevant AWS documentation here.

But what happens if an entire AWS region faces challenges? While rare, such incidents have occurred. The solution is failover to another region. AWS’s global infrastructure spans multiple regions, enabling you to configure failover from one region to another. This ensures uninterrupted service even during a regional failure. Selecting a backup region depends on your performance and cost requirements, and for additional redundancy, you could even consider another cloud provider.

Ref: AWS Regions and Available Zones

The key takeaway is to think ahead, plan for resilience, and choose a failover strategy that aligns with your business needs. The flexibility of AWS, coupled with thoughtful design, ensures robust and highly available systems.

Interconnection with Other Data Centers or Office Networks: Achieving Redundancy with Transit Gateway

Managing enterprise networks can be a complex endeavor, especially for organizations with multiple locations and diverse network infrastructures. Box is no exception, operating global offices, on-premises data centers, and now incorporating cloud-based data centers via AWS. While traditional office and on-premises data center networks are fully managed and maintained by the network team, AWS offers a different approach. It simplifies much of the heavy lifting by managing the underlying infrastructure. However, administrators still play a critical role in designing and configuring the interconnections between AWS regions and integrating AWS networks with non-AWS environments. This hybrid setup demands strategic planning and expertise to ensure seamless communication across all components.

AWS introduces several key technologies for enterprises to consider when architecting their network interconnections: Virtual Private Network (VPN), Transit Gateway, and Direct Connect.

VPN is a familiar technology for most network engineers, widely used to establish secure connections between site A and site B. In AWS, VPN connections can be configured between AWS VPCs and other non-AWS networks using Virtual Gateway or Transit Gateway.

Virtual Gateway: This was the go-to solution for VPN connectivity in AWS before Transit Gateway emerged. While functional, it lacks the advanced features of its successor — Transit Gateway.
Transit Gateway: Now AWS’s recommended solution, Transit Gateway acts as a “regional central hub” for connecting AWS VPCs with other non-AWS networks. One key advantage of using Transit Gateway is its built-in high availability (HA), eliminating the need for additional HA considerations for this setup.

For enterprises seeking higher network performance and reliability, AWS Direct Connect provides a dedicated physical circuit between on-premises networks and AWS. It supports load sharing and redundancy by enabling multiple circuits, offering benefits such as reduced latency and consistent performance. Configuring Direct Connect requires collaboration with AWS or its certified providers. While setup involves additional effort, the long-term performance gains often justify the investment.

Interconnecting AWS VPCs is another critical element of network architecture. AWS recommends Transit Gateway Peering as a scalable and efficient solution for enabling seamless data exchange between VPCs. This approach simplifies the architecture while maintaining performance and security standards. In addtion, it is a scalable solution considering future VPCs growth.

Each of these technologies addresses specific challenges in enterprise network management, and they are often used in combination to build robust and scalable architectures. For example, a hybrid network might use Direct Connect for on-premises to AWS connectivity, Transit Gateway Peering for regional VPC interconnectivity, and VPN for secure connections with external non-AWS networks.

Below is an example diagram illustrating how these components work together to create an interconnected enterprise network. As always, careful planning and consideration of specific organizational needs are key to success. By leveraging these AWS technologies, enterprises can streamline their network operations and focus on driving business innovation.

Box IT leverages the full spectrum of AWS networking technologies, tailoring them to meet the unique interconnection requirements of various sites. For our major offices, which house a significant number of engineers and users, we prioritize setting up AWS Direct Connect. This approach ensures high-performance, reliable connectivity to AWS, supporting the demanding workloads and collaboration needs of these locations. On the other hand, for smaller sales offices with lighter network demands, VPN over the Internet has proven to be a cost-effective and reliable solution. This hybrid strategy enables Box IT to balance performance, cost, and scalability, ensuring optimal connectivity for all office types without unnecessary over investment.

Final Thoughts: Bridging Traditional and Cloud Network Concepts
Transitioning from traditional on-premises network designs to cloud-based architectures like AWS can be both exciting and challenging for network engineers. The concepts of redundancy, failover, security, and performance remain critical, but the implementation methods often differ significantly. Understanding these differences is the key to designing efficient, resilient, and secure cloud-based systems.

AWS eliminates many of the traditional network configurations such as HSRP/VRRP, relying instead on higher-level redundancy mechanisms like Elastic Load Balancers and multi-AZ deployments. While this approach simplifies certain aspects of high availability, it introduces new complexities, such as managing symmetric routing and learning cloud-native tools like AWS Network Firewall.

Security paradigms also shift with cloud environments. Engineers must adapt from traditional firewalls and ACLs to AWS-specific constructs like Security Groups and NACLs, and embrace the advanced capabilities of tools like the AWS Network Firewall. While these tools offer powerful features, they often come with a learning curve, requiring teams to invest time in understanding and implementing them effectively.

The absence of region-wide redundancy in traditional terms pushes engineers to think globally, leveraging AWS’s regional and multi-cloud capabilities for failover and disaster recovery. This forward-thinking approach ensures system resilience even during rare but impactful events like regional outages.

Ultimately, bridging the gap between traditional and cloud networking concepts requires a mindset shift. It’s not just about translating existing designs into the cloud; it’s about embracing the opportunities and flexibility cloud platforms offer to create robust, scalable, and future-ready architectures. By blending the best practices from both worlds and continuously adapting to evolving cloud technologies, network engineers can drive innovation and reliability in modern infrastructure.

Cloud networking is not a replacement for traditional networking — it’s an evolution. And with the right tools, strategies, and mindset, you can confidently navigate this transformation.

Box IT Blog Series: How We Deploy AWS Network Technologies Inside Box IT was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Creating Cloud Managed Platform Services

ssuresh — Mon, 05 Aug 2024 16:15:54 GMT

Illustrated by Navied Mahdavian / Art directed by Erin Ruvalcaba Grogan

Infrastructure Paradigm Shift

Every tech company goes through a phase of deciding to adopting Public Cloud or managing their own infrastructure. After more than a decade of self-hosting it’s cloud infrastructure, Box decided to embrace the Public Cloud and chose Google Cloud Platform(GCP) as the preferred Cloud provider. It was a shift in the operating model for almost all Engineering teams as we migrated from our on-premise infrastructure to managed services offerings by our Cloud Provider to handle all our compute, storage and networking needs. In this article we will go through one of the frameworks Box adopted to navigate and integrate with Google Managed Services like BigTable, PubSub, BigQuery, Cloud SQL etc to operate natively in the cloud.

One of the cloud migration challenges was to meet our increasing demand for a solution that can handle relational database needs for both internal and non-customer data. Even though Box has a dedicated database team, their primary focus was to provide custom database solutions to our core applications by optimizing and scaling self-managed database fleets that needed a higher level of performance and custom integration with our data access layers. These customizations were not applicable for most of the non core applications and often required service owners to rely on our database teams for any database related customizations and support.

In order to solve the ever increasing database needs of the non core applications in a consistent and reliable manner in GCP, Box Engineering decided to design a platform engineering approach to provide a secure, mature and an enterprise grade data store solution. This approach had several positive outcomes like

Reduced database creation time by more than 50%.
Eliminated custom database configurations via pre-configured settings leveraging Infrastructure as Code(IaC). Read about our IaC design in Box CMF: Infrastructure as Code, Then What?!?
Standardizing access patterns to follow security best practices.

Service typically refers to an application running on containers or Virtual Machines(VMs)

Rise of Platforms

The shift to Public Cloud provided us opportunities to ensure future infrastructure needs will follow a platform engineering approach where teams can buy into an platform solution created by leveraging managed services by Google and adding Box specific customizations to ensure we adhere to our strict Security, Legal, Privacy and Compliance(SLPC) requirements Eg DataPlatform handles all applications requiring data transformation and management capabilities leveraging solutions like BigQuery, Cloud SQL, Pub Sub, BigTable etc.

The Relational Challenge

MySQL is used extensively at Box as a Relational Database(RDB) solution. Mojito(Box’s open source localization platform), our in-house real time user availability measuring tool, our automated infrastructure diagnosis tool etc are some of the internal services running on MySQL. The on-premise version of managed MySQL was operation heavy for both the database team and the service owner as implementation was highly customized per team. This approach created significant challenges for datastore integration and management for service owners and often required support from database team.

Service Owner Challenges

Limited expertise and operational knowledge to own and maintain a relational database in GCP.
Lack of standardized processes to configure database resources and manage their usage within GCP.
Getting approval to Box’s extremely diligent SLPC requirements for data storage in GCP per database.

Solution

It was a collaboration that brought together teams from different functions, working in harmony to reach a consensus on the way forward. After engaging in discussions and deliberations, these were the decisions that emerged.

Source of image cgdream.ai

In order to efficiently operate in GCP, we decided to leverage platform engineering practices to reduce friction by streamlining processes to improving efficiency and provide seamless integration across different infrastructure components. Google’s Cloud SQL fit this use case perfectly as a platform based solution to solve all future non customer related RDB needs with appropriate standardization, governance and security. Site Reliability Engineering(SRE) team alongside Database Infra teams collaborated closely to design and develop an intuitive solution leveraging our Dataplatform(DP).

Shared Ownership Model

DP to design and guide in enabling Cloud SQL as a capability via IaC.
SRE team to collaborate with database teams to deliver a mature, enterprise grade RDB solution emphasizing security, performance and operational ease.
Security team to ensure the solution adheres to company’s standardized SLPC requirements for data storage

Platform Bring Up

Due Diligence : A Comprehensive Review Process

Source of image cgdream.ai

During the initial proposal for our Cloud SQL solution, we encountered several unanswered questions that needed addressing. To ensure a robust implementation, we engaged in an enlightening meeting with the GCP team. This session shed light on upcoming features in GCP’s roadmap for Cloud SQL — features that were already supported by Box’s self-managed MySQL fleet. One of the key requirements was to ensure we periodically rotate our encryption key version used in any data store, at the time, it wasn’t a generally available feature from Google.

At Box Engineering, we take a meticulous approach in analyzing any solution with SLPC (Security, Legal, Privacy & Compliance) at the forefront. As guardians of data security within our organization, maintaining high standards is non-negotiable in the matters of approving a new application. One of the major considerations for even considering Cloud SQL as a viable managed service was the FedRamp High Compliance provided by Google. After careful examination of all the regulatory requirements for a data store, our security architects approved the proposal to use Cloud SQL at Box with few caveats:

No customer related data(as encryption key rotation was not generally available at the time of implementation)
Identity and Access Management (IAM) based authentication
Standard encryption leveraging Customer-managed encryption keys (CMEK) to ensure secure access to data.

It was still a win for Box Engineering as it paved a way for future micro-services to easily integrate with a relational data store with limited operational overhead for service owners.

Implementation

In order to ensure ease of use, we needed to exposed only the required inputs for database provisioning in our IaC. The inputs needed to be agnostic of the underlying database engine as we had few use cases apart from MySQL.

The final result was an intuitive IaC interface that could be effortlessly configured with just a few lines of configuration code. By abstracting most of the provisioning logic from the user, we were able to significantly reduce instance setup time. This powerful platform capability empowered every member of our Box team to effortlessly create Relational Database for their applications, with minimal operational overhead.

Creating a new relational database is as simple as defining a block of configuration with necessary attributes as defined below.

variable "cloudsql_instances" {
  type = map(object({
    database_version      = string,         // API enum string (e.g. MYSQL_8_0, POSTGRES_14)
    tier                  = string,         // API tier string (e.g. db-custom-1-3840)
    region                = string,         // us-west1 or us-central1
    disk_size             = number,         // In GB (e.g. 10, 40) Cannot be less than 10 GB
    ha_enabled            = bool,
    read_replicas         = map(number),    // Setting the number of read replicas for zone and/or cross-regions
    readers               = list(string),   // List of Service Accounts with read permission
    writers               = list(string),   // List of Service Accounts with write permission
    admins                = list(string),   // List of Service Accounts with admin permission
    maintenance_config    = map(number),    // Specify the day of the week (1-7) and hour of the day (0-23) for maintenance
    retain_backups        = bool,          
    backup_retention_days = number,         // Please specify 3 at a minimum
    additional_databases  = list(string)    // list of databases to create
   }))
  default = {
  }
}

Secure and Simple access

Imagine a solution to make it impossible for anyone to breach database passwords! Sounds amazing right? That’s exactly the access model for Cloud SQL. All access to cloud native applications follow our IAM-based authentication and authorization best practices framework.

Adoption and Support

At Box, we believe in embracing an “Be an Owner” mindset, which is reflected in our approach to the solution process. As a platform user, every Box engineer has the power to leverage solutions and onboard to the required platform through a self-service document. It is the responsibility of the platform owner to create a comprehensive user document, while as a service owner, you have the opportunity to provide feedback and make contributions for future users.

Initially, the requirement was limited to a handful of identified services. However, the adoption continues to increase as more teams at Box recognize the value of leveraging the platform based relational database solution for building robust backends for their applications.

To ensure continuous improvement, we actively collect feedback through a dedicated Slack channel created for Cloud SQL adoption. This valuable input stream allows us to refine our solution on an ongoing basis.

Conclusion

Having a platform-based solution has really helped service owners at Box accelerate their development process and focus on the application logic to ship features at a rapid rate. Our internal tooling has matured considerably as new solutions were able to easily integrate with Cloud SQL for all the relational database needs.

Developing platform capabilities requires both problem-solving skills and long-term vision towards scalability and adoption. It is crucial to always maintain open lines of communication with users; no product can ever be considered truly complete as there are always opportunities for enhancement.

There is a growing ask to add support to data storage requiring PII/Customer leveraging Cloud SQL. A roadmap is in place to leverage Enterprise plus Cloud SQL for these use cases.

The widespread adoption of any new solution hinges upon its genuine necessity. There was an inherent need for a streamlined, robust, and user-friendly relational database within GCP which led to the creation of a simple yet intuitive solution for future SQL based backends at Box.

PS: This article was refined using Box AI, learn more about it here https://www.box.com/ai

Interested in learning more about Box? Checkout our careers page!

Creating Cloud Managed Platform Services was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Software Rot: Why Exercise is Important for Your Software

Priyanka Reddy — Fri, 26 May 2023 21:34:07 GMT

Illustrated by Navied Mahdavian / Art directed by Sarah Kislak

Once upon a time (well really in 2016), we built a life-changing service named Spatula that would automatically detect and flip failing databases out of the customer path. We deployed it and waited. It did nothing most of the time, except when a database failed. And that’s when it shined, deftly flipping traffic away from the ailing database to one that was fit as a fiddle. All within a matter of minutes, without human involvement. It was a sight to behold and we were all so proud of it and ourselves.

But then arrives the fateful night when we came face-to-face with our own mortality. It starts out just like many nights in the past. A database host fails. Of course it holds the data that powers the most critical parts of the site. Of course it’s Saturday. Of course it’s 3 am (seriously, why do all database failures happen between the hours of 3 and 4 am?). But no matter. Spatula springs into action and detects the failure as it should. But then when it came time to executing the flip, Spatula fails!

Luckily, Spatula pages the oncall DBA that the auto-remediation didn’t succeed. The oncall DBA proceeds to drag herself out of bed and because it’s been so long since she’s had to manually remediate a failed database (yay, Spatula!), she has to relearn the process of doing so while being half asleep. A short while later, we’re back up.

In this situation, not only has Spatula’s existence not made things better, it’s actually made things much worse (boo Spatula). By automating this critical process, we’ve erased it from the collective minds of the DBAs.

To make matters worse, Spatula is still broken. It’s still 3 am on a Saturday. And, we’re at risk of reoccurrence until Spatula is functional again. After some debugging, it’s discovered that Spatula has been dysfunctional since the previous Monday at 2 pm when a maintenance unrelated to Spatula took place. Not only has a key piece of our database availability infrastructure been silently broken for over 4 days waiting to bite us at the worst possible moment, but now we have to fix it in the middle of the night.

What’s brought us to this less-than-ideal situation in the first place? The most direct cause is the Monday maintenance that changed something in the environment, rendering Spatula broken. It’s easy to get overly focused on the fact that an unrelated maintenance broke Spatula. But that’s a fact of life when it comes to software, especially if critical paths in the software sit idle for long stretches of time. It’s a phenomenon known as “dormant software rot”. Dormant software rot is the idea that “software that is not currently being used gradually becomes unusable as the remainder of the application changes”. (wikipedia). Software does not work in a vacuum. It interacts with and relies on upstream and downstream systems owned by different teams. It also lives in an ever-evolving shared environment. Any of those could change and put the software into an unusable state. And the longer the software goes without being executed, the more likely that dormant software rot sets in. The only way to combat it, then, is to ensure that the software is exercised regularly.

How do we apply this to our scenario? Spatula reacts to failed databases, something that thankfully doesn’t happen too often. It’s not something we want to force to happen more often than necessary as each database flip can cause a brief service degradation. Instead, we decided to introduce a synthetic check on a test database.

The diagram below shows the full architecture of the Spatula pipeline and the modifications that we made to ensure a seamless health-checking pipeline. It wasn’t trivial but it wasn’t rocket science either. And, the return far outweighed the investment.

Architecture of the Spatula pipeline and modifications

Although we considered testing only portions of the pipeline for the sake of simplicity, we ultimately decided that in order to fully detect dormant software rot, the entire pipeline had to be tested. Today, we test the pipeline 3 times during the day (8 am, 3 pm and 6 pm), with the end of the workday being most important to ensure that no changes during the day negatively impacted the pipeline.

Once the synthetic check was deployed, here’s how the events unfolded during the next problematic maintenance.

Thursday

6:00 am: A maintenance is performed on the configuration pipeline that Spatula relies on. The maintenance breaks the configuration pipeline, putting Spatula in an unusable state.
8:00 am: Spatula’s synthetic check runs, detects that Spatula is broken and alerts the team. The team sees that the previous night’s 6 pm synthetic check succeeded so something must have changed since then to break Spatula. They comb through all production changes that were made between 6 pm and 8 am, identifying the 6 am maintenance as a possible culprit. Once the team verifies that the maintenance is indeed the cause of the breakage, they make the owning team aware of the issue so they can start investigating.
2:00 pm: After several hours of investigation and remediation, the configuration pipeline is functional once again and so is Spatula.

One doesn’t have to look hard to see how this incident was a vast improvement from the earlier one.

The team was alerted to a dysfunctional Spatula by a proactive alert rather than a customer-impacting database failure.
All of the investigation and remediation was done during the workday rather than at 3 am.
Spatula was dysfunctional for far fewer hours (6 hours vs. 100+ hours)

Although we’ve had several repeats of issues that look much like the Thursday morning incident, the Spatula synthetic check has saved us from repeating that fateful night.

This idea of synthetically executing a rarely exercised code path is not unique to the Spatula pipeline. The same principles can be applied to any system that is not constantly in use and therefore subject to experiencing dormant software rot. A couple examples include service discovery changes, active-passive systems, configuration delivery pipelines, etc.

There’s a pretty good chance that at least one of your own systems has this plague hiding within, waiting to bite you at the least opportune time. Don’t wait until you have your own fateful night. Go forth and seek out your software rot!

If you’re interested in joining us, check out our open opportunities.

Software Rot: Why Exercise is Important for Your Software was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Telling Our Stories: Leading Product Managers

Daphne Zhao — Wed, 29 Mar 2023 19:15:04 GMT

Considerations for PM Managers

Illustrated by Navied Mahdavian / Art directed by Erin Ruvalcaba Grogan

This is our final post in our four-part series that focuses on the Women in Tech (WIT) Employee Resource Community (ERC) at Box. In this series we’ve highlighted Women in Tech (WIT) at Box, our Public API service as led by Marie Rogers, and highlighted the career and work of Boxer Anna Wojsczc. In this final blog we will hear from Daphne Zhao, Director of Product Management.

Navigating ambiguity is an essential part of the role of a product manager (PM). However, for a PM manager, empowering their team to navigate ambiguity may prove to be more ambiguous because there is not a standardized approach to do product management and everyone does it differently. In the past years of leading product managers towards our shared goals, I’ve been exploring ways to navigate this ambiguity. As a partner to the Women in Tech (WIT) Employee Resource Community at Box, which provides support to women working in technology roles at Box, I’m invited to share my insights on how PM managers can effectively lead PMs in navigating ambiguity.

Understanding PMs

Before diving into leading PMs, let’s first get onto the same page of the role of a product manager.

Why PMs exist

Product management emerged as a profession in predominantly software companies as a response to the ever-increasing complexity of product development, particularly when it comes to dealing with speed and scale. As a company grows — that is, more products and features are added, customers become more diverse, and the team expands — complexity grows exponentially. As a result, problems the company needs to solve grows exponentially and choosing where to focus quickly becomes the most important decision a company makes. The need to effectively navigate and balance these factors led to the creation of a dedicated function focused on overseeing the development of products.

What PMs do

PMs are responsible for guiding the entire product development cycle, from ideation and development to launch and scaling, with the aim of delivering value to customers while achieving business goals. A PM’s core competency lies in identifying what problem to solve and working with a team to solve it.

What traits successful PMs share in common

Successful PMs share several key traits that enable them to excel in their roles and drive successful product development.

Good common sense

Successful PMs are right far more often than they are wrong, and when they are wrong, they learn from their mistakes, develop specific insights into the reasons for those mistakes, and share those insights with the rest of the team. PMs with good common sense, or a strong intuition for problem solving, tend to be right a lot. Common sense can be referred to by different names in different contexts, such as business acumen when identifying the business value of solving certain problems, or technical sense when optimizing architectural design. Despite the term “common” in its name, common sense is an uncommon trait that sets successful PMs apart.

Strong ownership

Successful PMs demonstrate a strong desire for ownership, possessing a natural inclination towards achievement. They are highly motivated and seek to take on challenging tasks and projects with a sense of purpose, always striving to deliver results that align with their vision and goals.

Effective communication

Communication is essential for most roles, and especially so for PMs, as they deliver results through collaboration with others. Successful PMs communicate effectively, tailoring the amount of information, level of detail, and way of communication to different stakeholders, ranging from developers to executives, based on each party’s unique needs and priorities. By conveying information in a way that is clear, concise, and relevant, PMs can build trust, establish alignment, and foster a shared understanding of the product’s goals, enabling the team to work together towards a shared vision.

So how can PMs with these traits be identified during interviews?

Hiring PMs

For hiring managers, while working with a PM candidate on a real project would be the ideal way to gauge their common sense, ownership and communication, it’s not feasible during an interview process. Even though assessing PM candidates can be ambiguous, it’s still possible to gain insight into these traits through other means, such as:

Starting with resume

A PM candidate’s resume can serve as an initial indicator of their communication effectiveness.

A well-crafted resume should prioritize its content and be concise. Do they showcase the impact of their achievements, instead of just listing their job responsibilities? Do they highlight their most impactful achievements, instead of listing almost everything they did? Is the layout designed with usability in mind?

The intuitive navigation and information design principles discussed in the book Don’t Make Me Think are about web and mobile design but they very well apply to any form of communication including resume. For example, is it easy to navigate, e.g., using headlines and lists instead of long paragraphs, using consistent visual cues such as bullet points and indents? Is it easy to understand, e.g., using plain language instead of jargon, keeping it short and concise? This is especially important for a PM because it shows their ability to simplify communication for their intended audience, which is a good indicator of how effectively they would communicate with users in their products.

Diving into a past project

During a product manager interview, a deep dive into a past project can provide valuable insights into a candidate’s abilities in various areas — their ability to identify a problem and articulate the business value of solving it, and their skills in prioritization, decision-making, and identifying key metrics to keep the team focused. Overall, a deep dive into a past project provides a comprehensive view of the candidate’s abilities across common sense, ownership, and communication.

Collaborating on a hypothetical project

PM candidates don’t necessarily come from exactly the same product domain, making it difficult to assess their in-depth understanding of a specific project. Additionally, the fast-paced nature of the tech industry means that being able to adapt and learn quickly is essential.

To better assess a PM candidate on these aspects, a domain-agnostic interview approach is used, where candidates are asked to think through a hypothetical project in a domain neither the interviewer nor the candidate worked at. During the interview, the candidate is expected to collaborate with designers and other product managers to discuss the project, allowing the interviewer to assess their ability to work effectively in a cross-functional team. This approach not only provides insight into the candidate’s problem-solving skills but also their communication and collaboration abilities, which are critical for success as a product manager in any domain.

Developing PMs

The ultimate goal of building and developing a team of PMs is to empower them to take ownership and drive their own projects, meaning they should be able to work independently without needing constant direction or guidance. When the team is capable of working autonomously, the PM manager can shift their focus to higher-level strategic initiatives, making themselves redundant in the day-to-day operations of the team.

While engineers tend to follow standardized workflows and use similar tools across companies, product management is still relatively new and diverse, leaving no unified approach to it. Despite so, there are two important aspects to consider when developing product managers:

Developing generalists

PMs are generalists rather than specialists, meaning that their skill set has breadth rather than depth. They are involved in a variety of tasks such as UI design, UX research, financial modeling, marketing, business development, contract negotiation, technical writing, and support, among others, but are not necessarily experts in any one area. They know just enough of each discipline to drive the entire product development cycle forward.

Therefore, each PM should solve the end-to-end customer experience, rather than just a portion of it. This enables them to take a product, no matter how small, all the way to market. As they progress in their career, their scope should expand, allowing them to solve the end-to-end customer experience with increased depth or for broader product areas.

Giving autonomy while providing air cover

It may sound like cliche, but I want to spend some time discussing autonomy because as previously discussed, each PM should own the end-to-end customer experience, and this requires a certain level of autonomy to fully assume responsibility for it.

The PM manager should understand the importance of providing the team with autonomy to bring them to their A game so jointly they can maximize the impact to the company. It is the manager’s responsibility to ensure the team has the needed space to think, drive, and make decisions.

It’s worth noting that giving autonomy doesn’t entail abandoning them with all the problems. Instead, it means

Aligning on the goals while taking several steps back from dictating what to build
Asking the right questions without steering the conversation towards any particular conclusion, so the team can digest and incorporate those questions into their thinking and decision-making
Living with any failures from the team and taking responsibility for them

In rare cases where the team is struggling to hit a timeline, the PM manager should step in and help GSD, which could be brainstorming user flows, discussing technical tradeoffs, addressing legal and compliance requirements, finding beta customers, doing manual QA, or writing product documentation. Not only does this help the team achieve their goals, but it also helps the PM manager earn trust from the team in two ways. First, by demonstrating that they can also do the work, they assure the team of being able to understand the challenges they face and therefore assess their work effectively. Second, their willingness to help out shows that they’re committed to supporting the team’s success.

Once the team achieves impactful accomplishments, it’s important to celebrate and amplify their success by advocating for them in the broader organization. This means recognizing their achievements, sharing their success stories, and highlighting their contributions to the company. By doing so, the manager can create a culture of recognition and inspire the team to continue pushing boundaries and delivering impactful results.

Evaluating PMs

Evaluating the performance of PMs contributes to the ambiguity surrounding the PM role, because PMs deliver results through others and the business impact of their work often takes time to materialize. To properly evaluate PMs, it is important to consider multiple dimensions. Some key dimensions to consider include:

Product metrics

Product metrics are often considered as an indicator of PM performance as they reflect a PM’s ability to set ambitious yet realistic goals and meet or exceed them. Metrics such as customer adoption, user engagement, revenue growth, and retention rate are some of the KPIs that can be used to evaluate PMs. However, it is important to understand this approach has its limitations in practice due to several factors.

The business impact takes time to materialize. An enterprise feature may take several months for a customer that wants it badly to adopt because of all the internal processes they have to go through, let alone customers who haven’t heard about it. Adoption and engagement already takes some time to measure, while metrics with lasting impact like retention and revenue require an even longer time horizon and a larger sample size to measure.
Different product areas and features are interconnected, making it difficult to isolate a specific feature’s impact. For example, an increase in adoption of one feature may have an unintended consequence of decreasing usage of another feature. Customer dissatisfaction with one product area may negatively impact their willingness to try out new features in other product areas.
The same amount of quality work may have varying impact based on the product’s current state, making it difficult to compare the performance of PMs working on different products. For example, a PM that inherits a product area that has been underperforming for a while will have to fix the product before they can do anything to move the needle even a bit. In this case, how does the PM compare with another PM who took a fast-growing product area without accumulated tech debts?

Despite these limitations, product metrics are still an indicator of how a PM performs but should be considered along with other dimensions.

Collaboration effectiveness

Collaboration effectiveness is a strong indicator of a PM’s performance, as they rely on others to deliver results and it’s crucial that others respect them, trust them, and enjoy working with them.

Although it’s difficult to measure precisely, collaboration effectiveness is arguably the most tangible dimension to evaluate a PM’s performance. Feedback from key stakeholders such as their direct team (engineers and designers), PM peers, GTM organizations (marketing, sales, customer success, etc.), and executives can provide valuable insights into a PM’s collaboration skills. By regularly soliciting feedback and acting on it, a PM can improve their collaboration effectiveness over time.

Strategic thinking

A PM’s ability to think strategically and align product goals with the broader business strategy is also an indicator of their performance. It requires considering not only the company’s overall goals but also external factors such as market trends, user needs, and the competitive landscape. Additionally, PMs should be aware of other critical initiatives within the company and collaborate with other product teams to ensure alignment with their roadmap.

Measuring a PM’s strategic thinking is not easy as well, but it can be evaluated through their ability to anticipate market trends, make tradeoff decisions that drive business value, and adjust their product roadmap when necessary to align with the company’s evolving priorities.

A huge thank you to all our contributors for this series highlighting the inspiring work of the talented women leading this ERC and the impact they are making in shaping the future of work. Let’s continue to honor and support women well beyond Women’s History Month!

Telling Our Stories: Leading Product Managers was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Telling Our Stories: Spotlight on Anna Wojszcz

Awojszcz — Mon, 27 Mar 2023 19:31:33 GMT

Illustrated by Navied Mahdavian / Art directed by Erin Ruvalcaba Grogan

March is coming to an end. Being Women’s History Month, it’s a time to celebrate and honor the contributions that women have made throughout history. It’s a time to recognize the women who have broken barriers, shattered stereotypes, and paved the way for future generations. In recent years, there has been a growing push to increase the representation of women in the technology industry. This effort has led to the creation of numerous initiatives, such as the Women in Technology (WIT) initiative, which aims to empower and support women in pursuing careers in tech. This is the third installment of the four-part series focused on the Women in Tech (WIT) Employee Resource Community (ERC) at Box. You can find the first installment here, and the second here.

Spotlight on Anna Wojszcz; Staff Technical Program Manager at Box

Anna Wojszcz

My journey in the tech industry, which has traditionally been dominated by men, began when I had to choose my high school profile. While I excelled in both literature and languages as well as mathematics, my mother encouraged me to pursue the maths, physics, and computer science profile. Despite not being technically inclined herself, she believed in my abilities and never allowed me to doubt myself. Although I initially chose marketing as my major, I quickly realized it wasn’t my calling and switched to telecommunications and computer science. Throughout my journey, my mother stood by my side every step of the way. Looking back, I now understand how crucial it was for me to have another woman supporting me in making choices that may not have been the most obvious at the time.

Throughout my nearly 18-year professional career, I have consistently worked in IT, finding it to be a natural and comfortable environment. Over the years, I have held various roles, including Software Engineer, Analyst, Project Manager, and Program Manager, working with both customer-facing and internal teams. At present, I have been serving as a Staff Technical Program Manager (TPM) at Box for approximately 1.5 years. I’m fully remote, working from Poland and assigned to BoxSign team (located mainly in Amsterdam).

As a TPM, I belong to two teams: my fellow TPMs and the engineering team with whom I collaborate daily. When people ask about the TPM’s role in product development, I often find it challenging to provide a straightforward answer. Essentially, TPMs ensure that all aspects of product development come together seamlessly and that any cross-domain dependencies and risks are properly managed. For Infrastructure TPMs, the boundaries between what a TPM does and does not do may be even less apparent. Nevertheless, TPMs are committed to doing whatever it takes to ensure program success, despite the difficulties and challenges that arise. While the work can be demanding and complex, it is also rewarding and engaging. I could not imagine doing anything else.

My most recent assignment has been BoxSign, introduced in 2021. With BoxSign, you can obtain and apply electronic signatures to your files seamlessly, without requiring any additional app. Moreover, the product facilitates automated messaging and record-keeping throughout the entire signature process. I watched the Sign Engineering Team grow from a single Scrum team to three independent teams with well-established planning and delivery processes and great sense of ownership. My main goal as a TPM of BoxSign Program is to oversee the management of cross-domain dependencies, mitigate potential risks, ensure the reliability of plans, and effectively manage any changes. In my opinion, a key measure of success for TPMs is when their teams can function autonomously and transition smoothly into operational mode. It’s gratifying to see that this team is close to achieving this milestone, and I may soon be able to embark on new and exciting assignments.

Despite my experience in the industry, BOX is the first organization I have encountered that supports Employee Resource Communities on such a large scale, including Women in Tech (WIT). When I was asked to co-chair the WIT EMEA initiative, I eagerly accepted the invitation. At present, more than 25% of EMEA Boxers are women, with almost 17% holding technical positions. The WIT initiative is dedicated to supporting women in tech in various areas, as outlined in a previous blog post. The WIT EMEA program aims to expand these activities throughout the EMEA region via surveys to determine specific needs, dedicated events and meetings, sharing our personal journeys, inviting external guests and coaches, and much more.

Participating in WIT is a new experience for me, and I am eagerly looking forward to the path ahead while also being thrilled to contribute to the community. Although I personally had a relatively smooth journey to my current position, I am aware that it may not be the same for many other women and girls. Therefore, I am committed to making the journey easier for those facing obstacles, motivating those who are still undecided about pursuing a career in tech, and learning from the experiences shared within the WIT community.

Telling Our Stories: Spotlight on Anna Wojszcz was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Telling Our Stories: Public API Service

Marie Rogers — Fri, 24 Mar 2023 22:04:27 GMT

Facilitating the development of public APIs at Box

Illustrated by Navied Mahdavian / Art directed by Erin Ruvalcaba Grogan

Welcome to the second part of our series that focuses on the Women in Tech (WIT) Employee Resource Community (ERC) at Box. Check out the first in this series here.

Get to know our Author, Marie Rogers:

In my years as a software engineer on the Platform Team at Box, I have had the opportunity to create features and products for one of my favorite personas, developers like me. As I have considered the needs of developers building on Box, I have become interested in the API story we are telling. I have advised as a member of our API Working Group, which reviews all new API proposals, and API Architecture Group, which considers long-term, high-level direction for APIs at Box. During my time at Box, I have also been able to invest in communities that I care deeply about, namely our Women in Tech Employee Resource Group, where I have acted as a Global Chair and a member of our Leadership Committee. Building out the Public API Service has been one of my key accomplishments on the Platform Team, and I am excited to share more about the positive impact of building public APIs and how the Public API Service facilitates that development.

Background

API First Architecture has become a buzzword in the software industry as the importance of mobile integration, developer experience, and microservice architecture grows. Here at Box, as we move from our monolithic to a microservice architecture, we find ourselves becoming an internal API first company, with many internal APIs being created to facilitate feature development, but only a few being translated into publicly documented and available APIs. Unfortunately, this narrows the benefit of our APIs and cuts off the growth of important Box products including Platform, Mobile, Integrations, and UI Elements. Though some of our deficit in public API development comes from a historical under-investment in public APIs, we found clear areas for improvement in our internal developer experience as we looked to cut down the friction and overall effort required to build public APIs on top of existing internal APIs. We built the Public API Service to help alleviate some of the pain points around public API development here at Box, and as we move forward, we are using the Public API Service to solidify and grow our public API offering.

So what were the pain points in public API development at Box?

For one, we provided no general support for API development: no team owned API development, no tooling existed in our microservices to build public APIs, and documentation was poor or non-existent. We were putting the burden of engineering design and investigation of public APIs onto the shoulders of our developers, and with multiple internal frameworks that support API creation, there was not a clear, consistent process. We also did not maintain proper public documentation which led to a worse developer experience and frequently broke our autogenerated SDKs. So, we decided to build a service that provides a paved path for development while managing three key aspects of our public API layer: scope checking, payload and schema validation, and request orchestration.

Payload and Schema Validation

The Public API Service is built on our standard BFF (backend-for-frontend) Javascript framework which provides built-in payload validation including checking for missing fields, incorrect types, and enum values. On top of this, it is an appropriate place to include additional, endpoint-specific validation. We hold two separate standards for our internal and public APIs, and the Public API Service provides support for translating these internal APIs into standard public APIs. Often for our newer endpoints the translation is straightforward with limited changes to naming and casing, but having this support is especially useful for some of our older or less standard APIs. For example, when Box acquired the company Sign Request to integrate its e-signature offering, we needed to map a complicated set of API schema fields that did not match our Box public API standards, and the public API service was an ideally centralized location for this schema mapping.

We have always faced the issue of internal developers making API contract changes without updating the schema documentation. As we have grown our SDK offering and developed SDK auto-generation capabilities, outdated schema documentation has become increasingly more impactful as it can directly break our SDKs and cause friction in the auto-generation process. To improve this process, we now keep a single source of truth of our documented public API schemas in our Public API Service repository to be maintained and updated by our developers. These schemas are reviewed and copied into our Box Open API Specification, the publicly-available schema repository that powers our Box Developer Documentation and SDK auto-generation. Overall, the service greatly improves the documentation maintenance workflow, eases the burden on our Technical Writing Team, and stabilizes our SDKs by binding the internal development process closer to our external documentation.

Scopes

Scopes are a key part of any API, as they limit what endpoints a client is authorized to use. In our legacy scope framework, we bind each scope to a set of permissions, resulting in a tight coupling of responsibilities that leaves us with an inflexible and limited framework that can cause difficulties for both API providers and consumers. Our developers that consume APIs struggle to find the appropriate scope for a given API action, and they sometimes have to settle for scopes that are too permissive. Meanwhile our API providers are often limited in creating the most fine-grained scope for an API endpoint, leading them to use scopes that may give access to multiple API endpoints. This lack of visibility into what scopes are needed to perform an API action also translates into less clarity for our admins surrounding what actions an authorized app is allowed to make. For example, the root_readwrite scope not only grants access to Box Item endpoints such as /files and /folders, but also other non-item endpoints such as /comments, because the scope has been mapped to various permissions in monolith. When an application mints a token with the root_readwrite scope to upload a file, it will inadvertently have the permission to post a comment on the file, which is not ideal.

In the Public API Service, we are working to expand the functionality and granularity of our scopes by binding a scope to an API operation. For example, it is now much easier to authorize an app with scopes specific to the sign request or the retention policy endpoints. This granularity also clarifies ownership over the scope. Right now, multiple teams — from Platform to Permissions to feature teams — work on pieces of the scope framework, and very few people have substantial context on how it all fits together, but this shifts ownership squarely to the team that owns the API endpoint.

Signature request and retention policy granular scopes provided in the Box Developer Console are powered by Public API Service’s scope framework.

Orchestration

As Box has grown, our backend architecture has expanded into hundreds of microservices, making it increasingly important to properly route and hydrate API requests and responses. Previously, without an orchestration layer, we saw a proliferation of custom BFFs in order to interface with the necessary internal APIs. These internal APIs are spread widely across microservices and the monolith, and we have multiple mechanisms for connecting them. Now, in the Public API Service layer, we have a single point to enforce standards and clearly outline the components necessary in forming the public API response. We also properly format and return errors from downstream services, and we hope to eventually connect with GraphQL to provide an even cleaner hydration mechanism.

Future

Beyond these initial aspects, we see the opportunity for incredible growth in the value proposition of the Public API Service. On the product side, we have increased buy-in on an executive level to push forward API development as a key piece of feature development. We are seeing increased interest in the developer experience rather than focusing solely on the user experience, and public APIs are a central tenet of the developer experience. We have formed the new API Management Team to continue spearheading the growth of public APIs at Box, and we have many ideas to build out a more robust service:

API Versioning: Without a versioning strategy, the process for ending support of API fields is difficult and leads to stagnancy in our APIs. We are investigating if Public API Service is the appropriate place to implement a new versioning strategy.
Contract Testing: There is often a disconnect between internal feature development and our public APIs, and in order to prevent backend changes from breaking our public APIs and SDKs, we need to run contract tests that call out these misalignments.
Improved Error Messages: Our third party developers often have to decode unclear error messages that link to generic documentation. Public API Service may act as a centralized location to add additional color to our error messaging and couple them more directly with useful documentation.
Paved Path migration from the monolith to microservices: As we move away from our monolithic architecture, we will need to migrate our APIs, and the Public API Service may simplify this process by providing a clean, paved migration with appropriate tooling and generated code.
Load Testing: To maintain quality as we move over some of our core APIs from the monolith into microservices, we may use load testing to provide key insight into the performance of our APIs.

Conclusion: Why do we care about public APIs?

Public APIs drive forward both Box’s business and technical goals. Customers who build on Box Platform are stickier and buy into the Box ecosystem, and we improve and expand the features we can deliver when we create APIs. As we bring public API development into the planning and design of a new feature, we reduce the technical complexity of trying to build out a public API later on a system not engineered for it. We also ensure that our APIs meet the high standards of our clients and are properly documented and maintained. Box has grown from a small content storing company to a cloud content management platform, and as we grow in feature breadth, we need public APIs to help us deliver our vision.

Telling Our Stories: Public API Service was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Telling our Stories: A look into the power of Women in Tech at Box

Shruti Mangipudi — Fri, 10 Mar 2023 18:02:27 GMT

Illustrated by Navied Mahdavian / Art directed by Erin Ruvalcaba Grogan

March marks the start of Women’s History Month, a time to recognize and celebrate the achievements and contributions of women throughout history. To celebrate this month, Box is dedicated to showcasing the crucial role of women in technology through various initiatives. Among these initiatives is a four-part series that focuses on the Women in Tech (WIT) Employee Resource Community (ERC) at Box.

This series highlights the inspiring work of the talented women leading this ERC and the impact they are making in shaping the future of work.

Let’s start from the beginning! What is Box’s Women in Tech (WIT) mission and key pillars that guides the work, programs, and initiatives that members participate in?

The focus of Box’s Women in Tech (WIT) ERC is to provide support to women working in technology roles at Box, particularly in Engineering and Technical Operations. However, the group is open to anyone who is interested in contributing to the group’s mission of promoting gender equality and diversity in the tech industry.

We aspire to make Box the best company for women working in technology!

Four fundamental pillars that serve as the foundation of Box’s Women in Tech initiative.

The key pillars that guide the work, programs, and initiatives of Box’s Women in Tech are:

Community building: The ERC aims to foster a sense of community and belonging among women in technology, both within Box and in the broader tech industry. This involves organizing networking events, mentorship programs, and other opportunities for women to connect and support each other.
Career development: WIT is committed to helping women advance their careers in the tech industry by providing resources and support for skills development, leadership training, and career planning.
Advocacy and awareness: The group works to raise awareness of the challenges faced by women in technology and advocates for policies and practices that promote gender equality and diversity in the workplace.
Outreach and education: WIT emphasizes on inspiring and educating young women about technology careers, as well as implementing initiatives to recruit and retain women in technical positions at Box.

What events or activities is Box WIT doing throughout the year?

WIT ERC plans and hosts a diverse range of events and activities throughout the year to encourage and empower women in the tech industry.

WIT collaborates with other ERCs to promote inclusivity and unity. Each year in March, Box’s Women Network (BWN) and Women in Tech (WIT) come together to celebrate Women’s History Month. In 2022, one such noteworthy event was the Mindfulness Session conducted by Theresa Nguyen, a holistic practitioner who imparted her knowledge on “How to Cure Burnout among Women”.

Here are some of her suggested techniques (or “prescriptions” as she calls them) that anyone can integrate into their daily routine.

WIT is also dedicated to supporting women in tech conferences, including the Grace Hopper Celebrations Conference, Women Impact Tech and many more. In October 2022, WIT members represented Box at the Extraordinary Women In Tech Conference in San Francisco, CA, showcasing Box’s commitment to being an active part of the wider tech community.

Take a look at this snapshot of WIT members in action at the two-day eWIT conference.

Left to Right: (1) Shakun Mehrotra, Director — Cloud Management Platform ; Sudha Lakshmisha, Senior Manager — Conversion & Security Services ; Prasanthi Yarlagadda, Global Technical Operations Engineer ; Shruti Mangipudi, Software Engineer — Storage Infrastructure & Tools ; (2) Shakun with Heike Hiss, Senior Director — Recruiting.

In the past 2 years, Box has shifted to a hybrid working environment. How has the WIT efforts pivoted to help a hybrid landscape?

In response to the shift to a hybrid work environment, WIT has adapted its programs and initiatives to support the unique needs of women in tech who may be working remotely, in a hybrid model or in-office.

Since the onset of the pandemic, WIT prioritized virtual events and programming to ensure that all members can participate in networking, learning, and development opportunities - regardless of their location. This includes hosting virtual coffee chats, career panels and book clubs that are accessible to everyone.

Here is a glimpse into a recent fun-filled virtual pottery master class that WIT organized for its members.

WIT strives to be more inclusive of EMEA members by scheduling events at various times to accommodate different time zones. One such successful event was the Mother’s Day Tea Party held in collaboration with our Family at Box (FAB) group.

Furthermore, WIT also helps new hires (aka boxers) get familiar with Box’s culture through the “WIT Spotlight Series”. Members get to know each other better by watching a short video about a fellow member’s career journey, with photographs highlighting their experiences both inside as well as outside of Box.

Get a sneak-peek into one of our spotlights featuring Tamar Bercovici, the VP of Engineering and backbone of Women in Tech ERC at Box.

What are your hopes for the Women in tech in the future?

By having the support of Box’s Diversity, Equity, and Inclusion (DEI) team, I am optimistic that Women in Tech ERC will contribute to an increase in the number of women and underrepresented groups who enter and excel in leadership roles within the tech industry.

In addition, I envision WIT expanding its reach across the world to serve as a secure and empowering community for its members, enabling them to feel valued and supported as they strive to fulfill their potential. In February 2023, Box’s WIT has already made progress towards this objective by establishing Regional Leads in Poland for the Women in Tech EMEA Chapter.

In my view, a tech industry that is more diverse and inclusive will ultimately foster greater innovation, creativity, and success for all, not just within Box, but also beyond.

I would like to express my gratitude to the amazing individuals who have been a tremendous source of support throughout my WIT journey. Tamar Bercovici, for always being a strong advocate for WIT ERC and the pillar of its foundation at Box. Erin Ruvalcaba Grogan and Lucy Tran for providing constant guidance throughout the blog writing process. Riley Pirinelli, for being an exceptional partner-in-crime in leading the global co-chair responsibilities. And last but not least, thank you to the entire WIT Community at Box for showing so much love.

Telling our Stories: A look into the power of Women in Tech at Box was originally published in Box Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.