When Production Breaks: Debugging the Real World with Laravel Nightwatch
When Production Breaks: Debugging the Real World with Laravel Nightwatch
Throughout this series, I've covered debugging in development — dd(), Ray, Xdebug. These tools are fantastic when you're building features and squashing bugs on your local machine. But here's the uncomfortable truth:
Your local environment lies to you.
It lies about performance because you're the only user. It lies about data because your test data is clean and predictable. It lies about edge cases because you can't possibly imagine all the weird things real users will do.
Staging lies too. It's closer to reality, but it's still not reality.
Only production tells the truth. And production is where things get interesting.
The Problems That Only Show Up in Production
Let me describe a few scenarios I've encountered that no amount of local debugging could have caught:
The N+1 query that only appears with large accounts. In development, every user has 5 orders. In production, one customer has 47,000 orders. Suddenly that dashboard page takes 30 seconds to load.
The cache miss cascade. Your caching strategy works perfectly with moderate traffic. But at peak hours, cache invalidation causes a thundering herd that slams your database.
The third-party API timeout. The payment gateway responds in 200ms during testing. But once a week, at seemingly random times, it takes 8 seconds. Users think the checkout is broken.
The job that silently fails. A queued job throws an exception for a specific edge case in user data. The user never gets their email. No one notices until the support tickets pile up.
The slow query hiding in a polymorphic relationship. It runs fine for most models, but for one specific type that happens to have more records, it's a disaster.
These issues share something in common: you can't reproduce them locally. You need to see what's actually happening in production, with real traffic, real data, and real user behavior.
Enter Laravel Nightwatch
Laravel Nightwatch is Laravel's official production monitoring platform. It's not Telescope (which is for local debugging), and it's not Pulse (which shows aggregated metrics). Nightwatch gives you the complete picture of what's happening in production — every request, every query, every job, every exception — with the context you need to actually fix problems.
The Laravel team built it after realizing that existing APM tools weren't designed for Laravel. They use generic concepts and terminology. Nightwatch speaks Laravel. It understands requests, middleware, Eloquent queries, queued jobs, scheduled commands. It shows you information the way you think about your application.
When I first installed Nightwatch on a production app, I found issues within the first hour that I didn't know existed. Not because I'm a bad developer — because production has a way of surfacing things that testing can't.
Getting Started
Installation is almost trivial:
composer require laravel/nightwatch
Then add your token to .env:
NIGHTWATCH_TOKEN=your-token-here
And start the agent when you deploy:
php artisan nightwatch:agent
That's it. Nightwatch immediately starts collecting data. No complex configuration, no custom instrumentation, no learning a new query language. It just works.
What Nightwatch Actually Shows You
Let me walk through how I use Nightwatch when investigating production issues.
The Dashboard: Your Application at a Glance
The first thing I see when I open Nightwatch is a health overview. At a glance, I can tell:
- How many requests my app is handling
- The error rate over time
- P95 response times (the 95th percentile — more useful than averages)
- Failed jobs and their frequency
- Slow routes that might need attention
This isn't just vanity metrics. When something goes wrong, this is where I notice it first. A spike in error rate, a sudden increase in response time — these are signals that something needs investigation.
Request Tracing: The Full Story
When a user reports "the page is slow" or "I got an error," the first thing I do is find that specific request in Nightwatch.
I can search by route, by user, by time window, by status code. Once I find the request, I see a timeline view that shows everything that happened:
- Middleware execution time
- Controller method execution
- Every database query with its duration
- Cache hits and misses
- Outgoing HTTP requests to external APIs
- Events dispatched
- Jobs queued
This is where issues become obvious. If a request took 3 seconds, I can see exactly why. Maybe there were 150 database queries (hello, N+1). Maybe the payment API took 2.8 seconds to respond. Maybe a cache miss triggered an expensive computation.
I'm not guessing anymore. I'm seeing exactly what happened.
The N+1 Problem in Production (Even With Protections)
Since Laravel 8.43, we've had built-in N+1 detection. In my AppServiceProvider, I have this:
// app/Providers/AppServiceProvider.php
public function boot(): void
{
Model::preventLazyLoading(!app()->isProduction());
Model::preventSilentlyDiscardingAttributes(!app()->isProduction());
Model::preventAccessingMissingAttributes(!app()->isProduction());
}
This is fantastic for catching N+1 issues during development. If I forget to eager load a relationship, Laravel throws an exception immediately. I fix it before it ever gets committed.
But notice that condition: !app()->isProduction().
We disable these protections in production because throwing exceptions for lazy loading would break the app for users. So the safety net disappears exactly where it matters most.
And here's the thing — some N+1 issues slip through anyway:
Issues from packages. That admin panel package you installed? It might have its own lazy loading issues that don't trigger during your testing but explode with real data.
Issues hidden by small datasets. Your factory creates 3 related records. The N+1 runs so fast you don't notice the exception in your test output. In production, a user has 500 related records, and now it matters.
Issues in edge case code paths. You tested the happy path. The error handling path that only runs when a third-party API fails? That has an N+1 you never saw.
Conditional relationship loading. Your code does $user->orders inside a conditional that's rarely true locally but frequently true in production.
Let me give you a real example. I had a dashboard route that was reported as slow by customers. Locally, it loaded in 200ms. I had preventLazyLoading enabled. No exceptions. Everything looked fine.
In production, for some users, it was taking 8-10 seconds.
In Nightwatch, I found one of those slow requests and looked at the query log. There were 847 queries. For a single page load.
The timeline showed me exactly what was happening:
QUERY 1.2ms SELECT * FROM projects WHERE user_id = 42
QUERY 0.8ms SELECT * FROM tasks WHERE project_id = 1
QUERY 0.9ms SELECT * FROM tasks WHERE project_id = 2
QUERY 0.7ms SELECT * FROM tasks WHERE project_id = 3
...
(840 more queries)
How did this happen with preventLazyLoading enabled? The relationship was eager loaded — but inside a Blade component that was iterating over projects, there was a call to $project->tasks->where('status', 'pending'). The tasks were loaded, but then another query was running for a computed property that accessed a different relationship I'd missed.
The local dataset had 5 projects with 2-3 tasks each. The total overhead was milliseconds — below the threshold where I'd notice or investigate. In production, a power user had 200 projects with dozens of tasks each. The problem scaled.
Nightwatch didn't just show me there were too many queries — it showed me exactly which code was responsible, which route triggered it, how often it happened, and which users were affected.
The fix was adding a couple of with() calls and restructuring that computed property. Deployed. Problem solved. Total debugging time: about 10 minutes.
The lesson here isn't that preventLazyLoading is useless — it's incredibly valuable and catches most issues before they ship. But it's a development-time safety net, not a production monitoring solution. You need both.
Finding Cache Issues
Caching problems are notoriously hard to debug because they depend on timing, traffic patterns, and data state. Nightwatch tracks every cache hit and miss.
On one project, I noticed response times were spiking at predictable intervals. Looking at the cache metrics, I saw a pattern: cache hit rate would be 95%, then suddenly drop to 10%, then slowly climb back up.
Digging into the requests during those spikes, I found the issue. A scheduled command was running every hour and invalidating a large portion of the cache. Then the next wave of requests all hit the database simultaneously, causing a thundering herd.
The fix was to warm the cache in the scheduled command instead of just invalidating it. But I never would have connected those dots without seeing the cache metrics correlated with request timing.
Debugging Failed Jobs
Jobs fail silently. The user doesn't see an error page — they just don't get their email, or their export doesn't appear, or their payment doesn't process.
Nightwatch tracks every job execution. When a job fails, I see:
- The exception and stack trace
- The job payload (what data was passed to it)
- How long the job ran before failing
- How many times it was attempted
- Which queue worker processed it
For one app, I discovered a job that was failing about 2% of the time. The exception was a database deadlock. Looking at the timing, these failures always happened during peak hours when multiple workers were processing similar jobs.
Without Nightwatch, I might have noticed eventually from support tickets. With Nightwatch, I caught it in the first week and fixed the underlying concurrency issue.
Tracking Third-Party API Issues
Modern apps depend on external services. Payment gateways, email providers, geocoding APIs, social logins. When these services have issues, your app has issues.
Nightwatch tracks outgoing HTTP requests. I can see:
- Which external APIs my app calls
- Response times (average, P95, P99)
- Error rates by endpoint
- Timeout frequency
On one project, I noticed the checkout flow had inconsistent response times. Some requests completed in 500ms, others in 6 seconds. Looking at the outgoing request logs, I found the culprit: the payment gateway had periodic latency spikes.
Armed with this data, I:
- Added a shorter timeout to fail fast instead of making users wait
- Implemented retry logic for transient failures
- Added a circuit breaker for extended outages
- Had a conversation with the payment provider about their SLA
The data from Nightwatch gave me both the diagnosis and the ammunition for that conversation.
Exceptions Grouped into Issues
Raw exception logs are noisy. The same error might happen thousands of times across thousands of requests. Nightwatch intelligently groups related exceptions into "issues."
Instead of seeing 5,000 individual ModelNotFoundException entries, I see one issue with:
- How many times it occurred
- First and last occurrence
- Which users were affected
- Which routes triggered it
- Trending (is it getting worse or better?)
I can work on an issue, mark it as resolved, and get notified if they resurface. This transforms error tracking from "scan the logs and hope you notice patterns" to "here are the 5 issues you need to fix, prioritized by impact."
User Journey Tracking
Sometimes the most valuable view isn't a single request — it's a user's journey through your app.
Nightwatch lets me see all requests from a specific user. If a customer writes in saying "I tried to checkout but something went wrong," I can:
- Find their user ID
- See every request they made in the last hour
- Find the exact moment something failed
- See what happened before and after
This turns "I got an error" support tickets into "here's exactly what happened" investigations.
What Production Reveals That Development Hides
After using Nightwatch across several projects, I've developed a list of issues that reliably appear only in production:
Data volume issues: Your 50-record test database doesn't expose queries that become slow with 5 million records.
Concurrency issues: Locally, you're one user. In production, 100 users might hit the same endpoint simultaneously, revealing race conditions and deadlocks.
Cache timing issues: Cache expiration and regeneration patterns only manifest under real traffic patterns.
External service issues: Third-party APIs behave differently under load and have their own outages and slowdowns.
Edge cases in user data: Real users have names with special characters, email addresses from obscure providers, browsers from 2015, and use your app in ways you never imagined.
Memory and resource constraints: Your laptop has 32GB of RAM. Your production container has 512MB. That difference matters.
Nightwatch doesn't prevent these issues. But it makes them visible, and visibility is the first step to fixing them.
Performance Thresholds and Alerts
One feature I particularly appreciate is custom performance thresholds. I can define rules like:
- Alert me if any request takes longer than 5 seconds
- Alert me if the
/api/checkoutendpoint exceeds 2 seconds - Alert me if job failure rate exceeds 1%
- Alert me if error rate spikes 50% above normal
These alerts can go to Slack, so I can find out about problems before users start complaining.
This is the difference between proactive and reactive debugging. Instead of waiting for the support ticket, I'm investigating before the user even notices.
Sampling for High-Traffic Apps
If your app handles millions of requests per day, you don't necessarily need to capture every single one. Nightwatch supports sampling:
NIGHTWATCH_REQUEST_SAMPLE_RATE=0.1
This captures 10% of requests — enough to see patterns and catch issues, without overwhelming your event quota.
You can also apply different sampling rates to different routes. Maybe you sample 100% of your checkout flow (critical) but only 5% of your health check endpoint (noise).
Route::get('/checkout', [CheckoutController::class, 'show'])
->middleware(Sample::rate(1.0));
Route::get('/health', fn () => 'OK')
->middleware(Sample::rate(0.05));
The Real Benefit: Confidence
The best thing about having proper production monitoring isn't catching bugs — it's the confidence it gives you.
Before Nightwatch, deploying to production felt like sending code into the void and hoping for the best. Now I deploy and watch. I see traffic flowing, response times staying stable, no new exceptions popping up. I know within minutes if something is wrong.
That confidence changes how I work. I deploy more frequently because I trust I'll catch issues quickly. I make bolder changes because I have visibility into their impact. I sleep better because I know I'll be alerted if something breaks at 3 AM instead of finding out from angry users at 9 AM.
The Debugging Workflow Evolution
This series started with dd() — the simplest possible debugging tool. We progressed through Log statements, Ray, and Xdebug, each giving us more visibility and control during development.
Nightwatch completes the picture. It's what happens after your code leaves your machine and enters the real world.
Here's how I think about the full debugging toolkit now:
| Phase | Tool | What It Gives You |
|---|---|---|
| Quick check | dd() |
Instant value inspection |
| Development flow | Ray | Non-breaking visibility |
| Deep investigation | Xdebug | Step-by-step control |
| Production truth | Nightwatch | Real-world visibility |
Each tool has its place. Together, they cover the entire lifecycle of your code, from the first line you write to the millionth request in production.
Getting Started
If you're not monitoring your production Laravel apps with purpose-built tooling, you're flying blind. You might be lucky and nothing breaks. But when something does break, you want to find out from your monitoring dashboard, not from your users.
Nightwatch is free to start. Install it on one app, watch the data flow in, and see what you discover. I suspect, like me, you'll find issues in the first hour that you didn't know existed.
Don't be afraid of the dark. Shine a light on it.