We are currently investigating a DDOS against our mail servers.
Resolved
Sep 24 at 11:27am EDT
Post Mortem
(also posted to improvmx.com/blog/2025-09-23-post-mortem)
Summary
Spike in connections overloaded our SQL server and triggered a latent performance degradation in SQL lookups, causing delayed delivery until performance mitigation.
Timeline
11:30AM EST: Incoming mail connections spike to 30x steady state, causing increased load on our servers
11:49AM [Downtime Begins]: Emails begin being delayed/backed up.
11:58AM [First Alert Fired]: Automated Alerting notified Matthew Tse on call that there was a deliverability issue with Microsoft
11:59AM [First Responder Signs On]: Matthew Tse signs on and begins investigating
12:08PM [Customers Alerted]: Matthew Tse posts an incident to status page
12:42PM [Mitigation Attempted]: Matthew Tse pushes code that increases logging to track offending users, and also increases the connection limit on front door mail requests
01:34PM [Mitigation Attempted]: Matthew Tse unlocks the max SMTP autoscaled server limit
02:00PM [Recovery Begins]: The SQL servers begin to handle the load, and emails begin being delivered again, but delayed.
02:10PM [Mitigation Attempted]: Matthew Tse adds additional logging to root cause the dropped front door messages. We discover that there is a SQL connection pool overload error being emitted constantly.
04:27PM [Mitigation Attempted]: Matthew Tse pushes async connection pool optimizations to decrease load on SQL servers
07:10PM [Mitigation Attempted][Recovery Complete]: Matthew Tse pushes further async connection pool optimizations, that fully eliminate the SQL error.
Action Items
IMX-1337: Audit all python SQL connection pool logic across all clients ensuring the thundering herd issue doesn't happen again.
IMX-1338: Audit/persist all SQL limit changes made during the incident, ensuring they persist past server reboot.
IMX-1339: Add Metrics Tracking the number of Connections made to our SQL database, ensuring this issue surfaces immediately in the future.
IMX-1340: Add Metrics Tracking number of mail rejections due to unhandled SQL connection issues. This should further improve our speed and reliability during delivery.
We apologize for the downtime in our services. But this incident has brought us several key learnings which will improve our reliability going forward. If you have any questions, feel free to reach out to me at matthew@improvmx.com
Affected services
Updated
Sep 23 at 07:13pm EDT
We have fully root caused and mitigated the issue. Services should be returning to 100%.
We will post a full postmortem and RCA tomorrow.
Affected services
Updated
Sep 23 at 02:08pm EDT
We have found an initial mitigation to the issue. And are bringing services back up now.
Email forwarding and SMTP sending should begin recovering.
We will continue to post updates.
Affected services
Created
Sep 23 at 11:42am EDT
Mail delivery may be delayed while we mitigate the issue.
Affected services