SnapAV�s CIO, Joe Topinka, outlines the OvrC outage that took place on January 17th, the lessons learned and the next steps.
There�s never a good time to have an outage, and this week�s event with OvrC was no exception.You may have noticed intermittent problems with devices connecting and disconnecting to the OvrC Cloud. Our issue was with the way devices were connecting to our servers on Amazon�s AWS platform.There is nothing more stressful for me and for our team than when we don�t deliver a positive experience to you and your customers. So, I want to apologize to you personally for any impact we may have caused, and share more details about what happened. That way, you have insight as to how we�re preventing similar situations from reoccurring.As dealers, you have many choices for Remote Management Systems, and it is the OvrC team�s job to earn and keep your trust.
Before we get to Monday�s outage, let me talk a little about how OvrC is built, so you understand how the platform has been constructed.As a solution, OvrC was designed to be scalable and fully redundant. In the past year or so, the growth of devices, locations, and dealers on our platform has exploded�we doubled the number of devices and locations we manage, and have added thousands of new devices and locations per month. Because it�s horizontally scalable, it uses servers and services to flex and match that corresponding growth in devices, locations, and end-users.OvrC also operates on approximately 150 servers in three redundant AWS environments, each with the ability to operate the full platform. Our servers are architected to handle dedicated functions. For example, specific servers manage device connections to the OvrC cloud, while other servers manage communication between servers, user notifications, event logs, and more. The system is designed to efficiently handle hundreds of thousands of events and messages routinely every hour.Finally, we use a technology on Amazon called Elastic Load Balancers, which helps ensure that we can scale dynamically as we grow. We also use what we refer to as �bank level security� to ensure that we keep your data and your customers� data safe and secure. We continuously invest in the platform to make improvements and rectify potential issues with the platform.
Around 6:15pm on Monday, January 17, our OvrC operations team was alerted to anomalies on our OvrC device servers. Keep in mind that our monitoring services never sleep � they operate 24/7, 365 days a year.After our team triaged the problem, we placed a banner on the OvrC Status page and informed Technical Support about the outage. An assembled �SWAT team� worked through the night to pinpoint the elusive problem. Together, we made several attempts to address the major issues and resolve the outage. We also called on our partners at Amazon as well as our digital security partner Fortalice to make sure the platform remained secure and safe.Collaborating with operations, our server partners, and other groups, our team worked non-stop through the night and into the evening on Tuesday to resolve this problem.The main issue was finally identified between the Elastic Load Balancers (ELB) and our OvrC device servers. Once we isolated that problem, we implemented configuration changes to correct the way the ELBs and OvrC collaborated server to server, then deployed the fix, and restarted OvrC services.Gradually, as we monitored the system, we watched devices, locations, and customer services return to normal. As we would after an event like this, the team took extra vigilance the next 48 hours to monitor and respond to any residual effects of the incident.
Problems happen. When they do, we try and learn as much as we can, with the sole purpose of improving our performance and solidifying our platform and our processes. Following an event, we take formal time to perform what we call a retrospective, and examine every aspect of our performance. In these sessions, there are no sacred cows. We seek to understand what didn�t go well and what did go well, while discussing customer communications, issue escalation, partner performance, and our own team�s performance. We have our list but will perform a deeper retrospective on the outage this week. I promise to share any relevant information from that with you in the very near future.
These days, we have become accustomed to system outages from many brand-name companies in almost every segment of the market. I can�t promise that OvrC will never have another one and if I did, I know you wouldn�t believe me.What I can promise is that we will communicate any issues to you quickly to make sure you are not left helpless or feeling like you�re �in the dark� when your customers ask questions. We work very hard to make sure that events like this week�s outage are rare. However, when they do happen, we pledge to share as much information as we can on the OvrC Status page (status.ovrc.com), and to brief our Technical Support team with the latest information.
We are truly sorry for the impact this outage caused you and your teams this week, and we are committed to communicating any updates or issues to you as often as we can.We are confident in our OvrC platform, our team and our partners. We will continue to make significant investments in OvrC to keep the platform scalable and secure. You have my commitment that we will take the learnings from this event to make OvrC even better.Thank you for understanding, and thanks for your business. Please feel free to reach out to me with your thoughts, comments, or suggestions.All the very best,Joe TopinkaChief Information Officer, SnapAV