There are many components to a domain registry system. For example, we operate an EPP service which allows our many registrars to register and manage domains on behalf of their customers. That EPP server connects to our primary registry database which stores all the information about the millions of domain names that we manage. Behind the EPP system there’s a large and complex Shared Registry System (SRS) which manages the domain name lifecycle, processes transfers, renewals, and other out-of-band functions. We also operate a number of systems which allow our registry partners to have real-time access to the performance of their TLDs. We also operate Whois and RDAP services to allow third parties to access information about registered domain names.
But the most important part of our operations is our authoritative DNS service: without this system, none of the domain names registered in our database would actually work. Our DNS system has to be up and running 24 hours a day, 365 days a year - without even a second of downtime!
Although the DNS protocol has a number of attributes (such as the stateless nature of the UDP-based DNS transport protocol, and automatic failover between nameservers) which make it possible to sustain 100% uptime — something which we’re proud to have achieved for well over two decades — it’s by no means easy, and requires significant investment and effort to develop and maintain a DNS platform which can support hundreds of top-level domains. We have invested a lot of time and effort into building and maintaining our anycast DNS infrastructure, so that it could expand and grow at the same time as the TLDs that run on our registry platform have grown.
As with any mission-critical, large-scale technology platform, effective monitoring and analysis is essential to maintaining our DNS system. We make use of a huge range of monitoring tools and systems, that give us insights into almost every aspect of the system: from low-level things like how much storage, memory and compute resources are being used, through to high-level metrics such as overall service availability and performance, zone update speed, and DNSSEC chain-of-trust validity.
Many of the tools we use are the standard monitoring tools used by many organisations, such as Checkmk, LibreNMS, Pingdom, and DNSperf. However, some have been developed internally by our engineering teams: either because our requirements are sufficiently specialised that an off-the-shelf solution doesn’t exist (there aren’t a lot of organisations who need to monitor a Whois server, for example), or because existing solutions can’t scale to the size of our infrastructure.
Like many DNS operators, for years we used the venerable DSC to collect statistical information about our DNS query traffic. DSC is maintained by DNS-OARC, the DNS Operations Analysis and Research Centre, a non-profit organisation of which CentralNic is a proud member. DSC is used by DNS operators of all kinds, including many of the twelve root server operators, regional internet registries, TLD operators, and operators of large authoritative and recursive DNS services.
DSC works by using libpcap to perform out-of-band capture of DNS packets on the wire - this means that the DNS server itself does not need to be instrumented to perform the packet capture, which would impede its performance. A small daemon running on the server captures the DNS packets, parses them, and then writes the details of the queries to XML files on disk.
These files are then transferred to a centralised aggregator, which analyses them and writes statistical information into a database. Many organisations use Hedgehog as a GUI to present this data to operators - and again, for many years, CentralNic did too.
However, as the number of domains on our platform grew, and the volume of query traffic exploded, DSC stopped being a feasible solution for doing DNS analytics. The volume of data that was captured exceeded the capacity of the aggregator to process. As a result we ended up with enormous backlogs of XML files on both the aggregator and on the anycast nodes, resulting in filled disks, analytics reports which were out of date by the time they could be created, and other issues, which created a lot of pain for our operations teams.
As a result, we decided to retire our DSC-based analytics system and decided to build our own. What follows is a description of what we built, and why we built it.
As it happens, at the time we made this decision, many other DNS operators were having similar issues, and in the last year a number of developments have occurred in the DNS monitoring space, many of which use similar technologies to those we used when building our system. We’ll talk a bit about those later on in this article.