Email Delivery Degradation
Started 17 Aug at 06:09am CEST, last updated 18 Aug at 02:37pm CEST.
Recently at ImprovMX, we encountered a challenge within our email forwarding service. An unexpected update to a server OS package on AWS disrupted our system's ability to automatically scale instances as demand increased.
Resolution and Recovery:
Our dedicated team responded quickly with a two-step plan:
Quick Fix: We redeployed our servers with a targeted fix to the issue caused by the unexpected update, enabling us to resume our service quickly.
Final Resolution: To ensure that such an incident does not recur, we revamped our deployment strategy. Now, instead of deploying new server instances with a fresh OS, we deploy them pre-installed. This approach not only eliminates nearly all deployment issues but also boosts deployment speed, resulting in much faster auto-scaling whenever there is increased demand or usage of our service.
We have learned from this incident and made significant improvements to our systems. We are more robust and resilient than ever before, and we remain committed to offering seamless service.
Transparency and Trust:
At ImprovMX, we believe in honest communication and accountability. We understand the importance of trust, and we assure you that we have taken all necessary steps to prevent such an occurrence in the future.
Thank you for your continued support and trust in ImprovMX.
Email forwarding is now resuming. Because of the massive amount of servers re-trying sending we are forwarding slower than usual, but we are actively monitoring our system to ensure things get back to normal levels.
We'll be sharing the post mortem here when it is ready.
A fix has been implemented and we are monitoring the results.
The issue has been identified and we are working on fixing it.
We are experiencing a temporary degradation of email delivery.
Our team is currently conducting a detailed investigation to identify and resolve this matter. Please be assured that this situation is being treated with the highest priority, and we're committed to restoring optimal performance as quickly as possible.
UPDATE: The issue has been resolved, and the post mortem is now available.