Recently, we announced the opening of our Duck Creek OnDemand Operations Center, a new regional office in Chicago-Rosemont, Illinois and Duck Creek’s primary hub for the services and support we provide for our software-as-a-service (SaaS) offering. Earlier this month, I was able to visit and tour our SaaS operations center and see what my colleagues are up to. I’d like to highlight a couple of the things that the Duck Creek OnDemand Services team is doing on a day-to-day basis to proactively identify issues prior to their impacting our customers’ deployments, resolve active issues quickly, and provide transparent communication.
Walking into the Rosemont office, the OnDemand Operations Center stands out; an isolated room with blinds drawn, a secure door, and three walls covered in large TV monitors that display operational and security metrics. Desks are arranged in long rows, reminiscent of a NASA mission control center. So what exactly is happening in this room and what does a day in the life look like for the Duck Creek staff who sit there?
Securing the Future
Whether our customers are in the midst of launching new insurance products, making rate changes, or transforming their claims processes, our “eyes on the glass” team is watching the TV monitors, and additional OnDemand operations teams in other offices are there to make sure our customers are supported 24-7 and able to focus on innovating faster. These teams are often invisible to our customers’ end users – unless an issue pops up, and the team emerges to rectify it. Let’s start by looking at what our Security teams are doing.
At Duck Creek, we take security very seriously, and with OnDemand, keeping our customers’ data safe and secure is of paramount importance. For a security engineer that sits in the OnDemand Operations Center, their day begins by opening their daily checklist, and reviewing the security screen displayed on one of three TV monitors. There they see the threat monitoring dashboard containing a time series chart, displaying both alerts that might require immediate action as well as summaries of other signals that the team will parse through. On a monthly basis, we do vulnerability scans of all of our customers’ virtual machines and environments, identifying and patching systems accordingly.
Security engineers are also constantly monitoring dark web chatter and being alerted to mentions of potential threats to Duck Creek and our customers. The real-time, security intelligence we receive is one part of how OnDemand is able to stay ahead of potential bad actors and take preventative measures (whether that be calling our customers or notifying corporate security) to stop attacks before they start.
Our security engineers are held to the highest of standards; in fact, our in-house activity is also monitored – if anything is done even slightly outside of established protocol, the matter is investigated internally. The bottom line is that we recognize our critical responsibility of safeguarding our customers’ data, and we’ve taken a number of steps to ensure its protection.
Monitoring and Alerting
Our Monitoring & Alerting and Triage teams are core to our customer responsiveness efforts, as they are the first line of defense that helps our customers run their businesses as close to uninterrupted as possible. By leveraging tools that monitor our customers’ instances of Duck Creek applications, infrastructure, and databases by collecting system telemetry data, the team is able to respond quickly. Think of telemetry data like a car check engine light or tire pressure monitoring system – we’ve deployed sensors across our SaaS offering so that we can obtain both early warning signs of any potential issue and take corrective action before a customer experiences any service degradation. Examples of telemetry data continuously being collected and reviewed every day include:
- Failure rate – essentially, how often a certain action fails over a given time period. For example, if 100 quotes were attempted to be generated in the past hour, and one of them failed to generate, the failure rate would be 1%. Our teams can slice and view the data over a number of different time intervals. Sometimes, there are legitimate reasons for a failure – the key is to identify when the failure rate exceeds the ordinary.
- Bounce rate – in the context of SaaS operations, we’re talking about measuring if a page in one of our applications is running, but a user was kicked out or the page crashes.
- End user satisfaction – this is an automated score reporting page performance in a given application. For instance, if we know that a page in our Claims application should be loading in x seconds, but it is taking y seconds to load.
These teams are also responding both to automated alerts generated when our tools flag and notify us about issues, and to customer support tickets. If our teams start observing warning signs of a potential issue, or if there is a customer issue occurring, we employ several different tactics to identify the root cause and rectify it. For example, one technique is synthetic monitoring – this entails writing a script that is essentially setting up a robot user which can be used to trigger events within our clients’ live production environments (e.g., policy issued, claim filed) and observe what’s occurring.
While less frequently utilized, another technique is session replay, wherein after an issue has been confirmed, operations staff can replay historic user sessions so that we can observe firsthand exactly what the user saw on their screen along what they entered and clicked. Since this technique consists of our video recording a customer’s screens containing their personally identifiable information, a legal agreement must be in place.
Preparing and Responding to the Unexpected
Our Monitoring & Alerting and Triage teams also work to determine the severity of an incident – from a minor, limited issue to the rare Severity 1 (Sev 1) affecting a significant portion of a customer’s users. These teams review issues, and route them to the appropriate team—whether it be our formal triage team (applications), infrastructure, databases, or third-party integrations teams to be further assessed and resolved. These teams have close access to other relevant teams across the organization, so if a systemic issue in our code is identified, they can connect with our engineering and product teams to fix it quickly.
If a Sev 1 incident occurs, it’s all hands on deck – customers are notified within 15 minutes, and if we haven’t resolved an issue within 30 minutes, we’ll start a “Bridge call.” In these instances, our teams start a conference call and pull in all relevant Duck Creek staff, and in certain cases, our customers. These calls, which are led by an “Incident Commander,” mobilize resources across all departments and remain open until we’ve resolved the incident and restored our customers’ operations.
With our highest service level availability (SLA) guaranteed at 99.9%, our OnDemand Services team is continuously striving to ensure business continuity and provide the highest level of application and infrastructure support to our customers, so that they can have the speed and agility they need to react to market opportunities. Whether it’s capturing application telemetry data that provides health metrics of our customers’ deployments or monitoring dark web conversations, they employ various tools and people to proactively identify issues prior to customer impact, rapidly remediate known issues, and provide timely and transparent customer communications.
As the P&C insurance industry continues to adopt SaaS core systems, our OnDemand Operations Center team and dedicated experts elsewhere are poised to enable our customers to focus on innovation.