#dns #dns-resolver #load-balancing #dns-records #monitoring #high-availability #ip-address

app rrdnsd

Distributed monitoring for Round Robin DNS load balancing and high availability

1 unstable release

0.1.0 Dec 15, 2024

#377 in Web programming

Download history 71/week @ 2024-12-09 73/week @ 2024-12-16 17/week @ 2024-12-23 4/week @ 2024-12-30

165 downloads per month

AGPL-3.0-only

10MB
2K SLoC

Rust 1.5K SLoC // 0.1% comments AsciiDoc 160 SLoC Bitbake 22 SLoC // 0.1% comments Nim 22 SLoC Shell 17 SLoC // 0.1% comments

Contains (ELF exe/lib, 6MB) rrdnsd_bookworm, (ELF exe/lib, 6MB) rrdnsd_bullseye, (ELF exe/lib, 6MB) tmp/rrdnsd/srv/rrdnsd, (debian package, 2MB) tmp/rrdnsd.deb

rrdnsd

:toc: left :toclevels: 3 :nofooter: :sectlinks:

Distributed monitoring for Round Robin DNS load balancing and high availability.

badge License REUSE,link="https://api.reuse.software/info/codeberg.org/FedericoCeratto/rrdnsd" Rust image:https://custom-icon-badges.demolab.com/badge/hosted on-codeberg-4793CC.svg?logo=codeberg&logoColor=white[Codeberg, link="https://codeberg.org/FedericoCeratto/rrdnsd"] //Crates.io

rrdnsd monitors the reachability of HTTP[S] services and updates DNS records accordingly. Lightweight and easy to configure - it can run on a small SBC but also scale up to hundreds of services. For increased reliability it can run on multiple instances using a quorum protocol.

This project is proudly supported by NLnet with sincere gratitude. [.text-center] NLnet logo,link="https://nlnet.nl/project/rrdnsd/"

The main project website is https://rrdnsd.eu/

Downloads, codebase and bug tracker: https://codeberg.org/FedericoCeratto/rrdnsd

Description

When running Internet services it is common to configure multiple 'A' or 'AAAA' DNS records for the same FQDN in order to provide simple forms of load balancing and failover. In this way clients can connect to one or another IP address for the same FQDN randomly or in round-robin fashion. If any of the IP address providing the service becomes unreachable the related DNS record can be removed. Once the DNS update propagates all the way to the clients they will stop using the unreachable IP address.

This method complements, rather than replacing, the "traditional" TCP/HTTP[S] load balancers that sit between clients and servers.

It provides resiliency against small and large network failures especially when used for geographically distributed services.

rrdnsd is not a DNS resolver. It creates and deletes records on authoritative DNS resolvers and cloud services.

.Features:

[toc]

Statsd metrics

rrdnsd generates StatsD metrics that can be used to integrate with external monitoring tools and generate alarms for service outages and also to monitor rrdnsd itself.

.Generated metrics:

  • fetch.cnt - internal
  • probe.duration - internal
  • probe.failure.cnt - Failed service probes count
  • probe.load_factor - debugging
  • probe.success.cnt - Successful service probes count
  • probe_time_msec - Service probing elapsed time in milliseconds
  • received_update.cnt - internal
  • status_change.cnt - Endpoint status change count

Metrics described as debugging or internal are subjects to change without notice.

Glossary:

  • Service: a public facing service identified by a unique FQDN in the configuration file
  • Endpoint: an ipaddres providing a service. A service should have many endpoints. An endpoint might provide multiple services.
  • Node: an instance of rrdnsd identified by an unique ipaddr/port pair.

Use case of a simple setup

A client (e.g. a browser) going to access a service running at two IP addresses (endpoints).

rrdnsd monitors the availability of the endpoints. It detects that one endpoint becomes unreachable and updates the DNS resolver in order to delete the related A record. When the downtime ends rrdnsd publishes the A record again.

Use case of redundant nodes

Again the service is available at two endpoints. Three rrdnsd nodes are monitoring the endpoints from different locations on the Internet.

[.text-center] Diagram

Scenario 1

[.text-center] Diagram

The endpoint on the left becomes unreachable from 2 nodes due to large network outage.

Also, the endpoint on the right is unachable from one node due to a localized issue near the rightmost node.

The nodes vote by majority and decide to remove the "A" record from the left endpoint from DNS. They also decide that the endpoind on the right is likely to be in good shape and keep the related A record in DNS.

Scenario 2

A DNS resolver or DNS API receiving updates is shown on the left.

[.text-center] Diagram

The leftmost node is not running (e.g. due to a power failure). The two live nodes are still able to probe the endpoints, communicate to each other and react to changes.

Additionally, the node in the center is unable to reach the resolver/API (e.g. due to network congestion). Yet, rrdnsd is still able to work as the rightmost node can reach the resolver/API.

Usage

Building natively on Debian Bookworm or Sid

Generates a native .deb package:

[source,console]

  sudo apt-get install rust
  dpkg-buildpackage -us -uc -b

Building on Debian Bookworm using Podman

This generates a minimalistic container localhost/rrdnsd to run rrdnsd

[source,console]

make podman-build-bookworm

# You can optionally extract the binary:
podman run --rm --entrypoint cat localhost/rrdnsd:bookworm /app/rrdnsd > ./rrdnsd_bookworm

Deployment

Install the Debian package locally, configure /etc/rrdnsd.json, then: [source,console]

sudo apt install ./rrdnsd.deb
sudo systemctl start rrdnsd.service
sudo journalctl -f

Security

Do not expose the rrdnsd API port directly on the Internet. Mount it behind a reverse proxy like Nginx and enable TLS.

You can further restrict the API entry points:

  • POST /quorum/v1/* : Required, used only between rrdnsd nodes
  • GET /dash : Web dashboard for administrators, optional
  • GET /health : Basic internal healthcheck, optional
  • POST /api/v1/* : Administration API, optional

Configuration

A simple setup running on one node:

update_method can be:

  • nsupdate: Use /usr/bin/nsupdate to send an update to one or more DNS resolvers.
  • knsupdate: Similarly to nsupdate, use /usr/bin/knsupdate for Knot DNS
  • dynu: Uses the Dynu API. Configure update_credentials with the Dynu token.

Example configuration file:

[source,json]

{
  "_comment": "Keys starting with underscore are ignored.",
  "conf_version": 1,
  "_local_node": "IP address and port of this rrdnsd node.",
  "local_node": "127.0.0.1:3333",
  "_nodes": "IP address and port of this node and its peers, if any.",
  "nodes": [
    "127.0.0.1:3333"
  ],
  "nodes_protocol": "http",
  "services": [
    {
      "fqdn": "rrdnsd.test",
      "_healthcheck": "Protocol, port and HTTP path to test. {} is replaced with the enpoint IP address.",
      "healthcheck": "http://{}:8778/",
      "ipaddrs": [
        "127.0.0.2",
        "127.0.0.3"
      ],
      "ttl": 5,
      "zone": "rrdnsd.test"
    }
  ],
  "probe_interval_ms": 1000,
  "update_method": "nsupdate",
  "_update_resolvers": "List of DNS resolvers and port for nsupdate or knsupdate.",
  "_update_credentials": "Token, password or username:password to authenticate updates",
  "update_credentials": "",
  "update_resolvers": [
    "127.0.0.1:5454"
  ]
}

local_node can be overridden using the environment variable LOCAL_NODE

The default config path is /etc/rrdnsd.json and can be overridden using the environment variable CONF

Journald logging

Logs are sent to the local journald instance and are often prefixed by tags like [main], [load_conf], [updater] to identify where they come from.

.Log example: [source,]

Dec 05 10:57:19 tux rrdnsd[2432198]: [main] config overridden by envvar CONF: integ/rrdnsd.json
Dec 05 10:57:19 tux rrdnsd[2432198]: [load_conf] reading integ/rrdnsd.json
Dec 05 10:57:19 tux rrdnsd[2432198]: [main] config overridden by envvar LOCAL_NODE: 127.0.0.1:3333
Dec 05 10:57:19 tux rrdnsd[2432198]: [updater] started

Additionally, file, module and line from the source file are included in the log. Use -overbose to visualize them.

.Example: [source,console]

CODE_FILE=src/main.rs
CODE_MODULE=rrdnsd
CODE_LINE=823

To follow logs from the systemd service:

[source,console]

sudo journalctl -u rrdnsd -f

To follow logs from the rrdnsd ran in userspace e.g. during development:

[source,console]

sudo journalctl --identifier rrdnsd -f

Development

rrdnsd is under development; contibutions and testing are welcome.

Codebase documentation is published at https://rrdnsd.eu/codebase_doc/rrdnsd/

Bugtracker: https://codeberg.org/FedericoCeratto/rrdnsd/issues

.Roadmap:

  • Optimize connection reuse and lifetime
  • Log if a whole service goes down with warning
  • Add end-to-end healthchecks e.g. /health
  • Fail-open: if more than 50% of endpoints are down keep them in DNS
  • Support active-standby pattern
  • Support IPv6 (AAAA records)
  • Exit gracefully
  • Support running custom scripts
  • Support calling webhooks
  • Add an API to:
  • fetch status
  • fetch events
  • add/remove services and endpoints
  • add/remove services nodes
  • manual failover
  • drain endpoints
  • flag services as under maintenance
  • "refresh" an FQDN by adding again all live endpoints and then deleting the unreachable ones
  • Add TCP testing
  • Service-specific probing interval
  • Support DNS APIs
  • Dynu
  • Digital Ocean

Testing

Run basic unit tests using:

[source,sh]

cargo test

Run basic unit tests using:

[source,sh]

cargo test

Run a test instance:

[source,sh]

RUST_BACKTRACE=1 CONF=integ/rrdnsd.json LOCAL_NODE=127.0.0.1:8000 target/debug/rrdnsd

Run full integration tests using the following command. Warning: this starts/stops knotd, runs sudo tc ...

[source,sh]

cargo test --test integration_test

Dependencies

~12–26MB
~410K SLoC