How some Let's Encrypt renewal failures pointed to an AWS traffic hijacking issue

tl;dr A BGP-based feature of the AWS Direct Connect service allowed a third party to inject an incorrect route for an external IP assigned to me, effectively hijacking my AWS-sourced traffic.

The certificate renewal problem...

It started innocently enough. The FreeBSD VPS that I’ve had since the still-mostly-pre-cloud days of 2010 (ARP Networks is fantastic!) hosts a few web sites that use Let’s Encrypt certificates for TLS.

Sometime in March the certbot-driven renewal cronjob started failing.

There were 2 variations on the error message being logged, both referring to “secondary validation”:

During secondary validation: 174.136.109.18: Fetching http://chair6.net/.well-known/acme-challenge/5zie8rgT52uTNnKmy2jMndZDSOx8Wg5QyfBqF0vWi7w: Error getting validation data

During secondary validation: 174.136.109.18: Fetching http://chair6.net/.well-known/acme-challenge/s5QIfvbA3v_zrBGZ7qWzTUcDRijO4bQCUY5j3YXAIyQ: Timeout during connect (likely firewall problem)

I checked and was able to connect to these URLs or similar from an outside system.

Using tcpdump on the VPS, I confirmed that I was seeing inbound HTTP traffic on port 80 from Let’s Encrypt sources when certbot renew was run. The tcpdump capture also showed an HTTP 200-okay response being returned - but the renewal kept failing, with the same error.

So what does this “secondary validation” reference mean? 🤔

After a little Google’ing:

A 2020 Let's Encrypt blog post talked about multiple perspective domain validation, which is essentially doing HTTP-01 challenge validation requests from various diverse network connections instead of just one.
A 2024 Let's Encrypt community post explained that 2 new remote perspectives were being added, and that domain validation would now require 5 validation requests from different locations.
Another 2024 post explained how additional validations from geographically diverse locations were causing problems for a subset of Let’s Encrypt customers.
A Let's Encrypt FAQ entry stated that the IP addresses of validation sources would not be shared, but an older 2022 post had input from Let’s Encrypt team members who explained that secondary validation processes ran from AWS region eu-central-1.

It seemed that the Let's Encrypt traffic I was seeing was some but not all of the required validation requests, and that some secondary validation requests - probably from eu-central-1 - were not making it to the VPS. I spun up an EC2 instance in eu-central-1 and confirmed I couldn’t curl http://chair6.net from there. Huzzah, it seems we’re narrowing things down.

I opened a ticket with my VPS provider, asking them if any other customers had experienced problems (Let’s Encrypt is fairly widely used, I figured others would’ve probably run into it). They hadn’t had any other reports but said they’d check in with their upstream provider.

We exchanged output of various ping, curl, and mtr commands, checking routes from both directions. It looked like there was some filtering at an intermediary step that was dropping traffic associated with my IP address in both directions. My provider kept pushing upstream, because something was clearly wrong.

At this point, it wasn’t looking like we weren’t going to resolve this quickly. I wanted to get the certificates renewed as they were close to expiration, so set up an nginx reverse proxy with another, unrelated provider (different IP space) & pointed my domains there temporarily. Next time certbot ran, secondary validations passed and certificate renewal succeeded, 🎉 then I undid the DNS change & we were back with fresh certificates.

I kept poking around occasionally, when I had time. I couldn’t see anything abnormal for my IP / range in various looking glasses and BGP route views. I also checked various reputation lists, but didn’t see anything of concern.

... that lead to an AWS connectivity problem...

Until one day, I had an instance in another AWS region for an unrelated reason, did a quick curl, and realized that us-west-2 had the same connectivity issue! Turns out, the problem wasn’t just one AWS region... it was all of them. 💥

Cloudy Banff Mountains

(Here's a picture of some cloudy mountains from a recent trip to Banff National Park, just because.)

I figured I might as well try attacking the problem from the other direction, and opened a ticket with AWS Support.

We traded mtr results again ($ mtr -4rnc 1 174.136.109.18 was the magic stanza), both confirmed there was a problem, and the case was escalated to the VPC team.

They checked ACLs, security groups, and route tables, and everything looked okay. But we still had connectivity weirdness.

After a bit more back & forth, we figured out that traffic destined for my IP was being routed out to an unrelated third party whose 2 destination IPs just happened to be close - but not the same - as my IP, with a flipped digit in the 3rd octet.

This wasn't immediately obvious, because the network path and point of failure / drop seemed to vary between mtr runs. Most of the time we'd get the first result here, but occasionally we'd get results more like the 2nd and 3rd example below.

[ec2-user@ip-172-31-24-210 ~]$ mtr -4rnc 1 174.136.109.18
Start: 2024-05-09T16:22:46+0000
HOST: ip-172-31-24-210.us-west-2. Loss% Snt Last Avg Best Wrst StDev
1.|-- 244.5.0.189 0.0% 1 0.8 0.8 0.8 0.8 0.0
2.|-- 108.166.228.68 0.0% 1 0.4 0.4 0.4 0.4 0.0
3.|-- 240.5.4.6 0.0% 1 0.5 0.5 0.5 0.5 0.0
4.|-- 100.100.2.122 0.0% 1 0.5 0.5 0.5 0.5 0.0
5.|-- 100.91.29.117 0.0% 1 47.4 47.4 47.4 47.4 0.0
6.|-- 52.95.62.48 0.0% 1 45.2 45.2 45.2 45.2 0.0
7.|-- 52.93.249.56 0.0% 1 44.4 44.4 44.4 44.4 0.0
8.|-- 52.95.8.197 0.0% 1 47.4 47.4 47.4 47.4 0.0
9.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
[ec2-user@ip-172-31-24-210 ~]$ mtr -4rnc 1 174.136.109.18
Start: 2024-05-09T16:23:30+0000
HOST: ip-172-31-24-210.us-west-2. Loss% Snt Last Avg Best Wrst StDev
1.|-- 244.5.0.189 0.0% 1 33.4 33.4 33.4 33.4 0.0
2.|-- 108.166.228.68 0.0% 1 0.4 0.4 0.4 0.4 0.0
3.|-- 240.5.4.6 0.0% 1 0.6 0.6 0.6 0.6 0.0
4.|-- 100.100.2.122 0.0% 1 1.5 1.5 1.5 1.5 0.0
5.|-- 100.91.29.117 0.0% 1 47.3 47.3 47.3 47.3 0.0
6.|-- 52.95.62.48 0.0% 1 44.5 44.5 44.5 44.5 0.0
7.|-- 52.93.249.56 0.0% 1 47.5 47.5 47.5 47.5 0.0
8.|-- 52.95.8.197 0.0% 1 47.6 47.6 47.6 47.6 0.0
9.|-- 174.136.xyz.124 0.0% 1 44.1 44.1 44.1 44.1 0.0
[ec2-user@ip-172-31-24-210 ~]$ mtr -4rnc 1 174.136.109.18
Start: 2024-05-09T16:24:49+0000
HOST: ip-172-31-24-210.us-west-2. Loss% Snt Last Avg Best Wrst StDev
1.|-- 244.5.0.189 0.0% 1 4.5 4.5 4.5 4.5 0.0
2.|-- 108.166.228.68 0.0% 1 0.4 0.4 0.4 0.4 0.0
3.|-- 240.5.4.6 0.0% 1 0.6 0.6 0.6 0.6 0.0
4.|-- 100.100.2.122 0.0% 1 0.9 0.9 0.9 0.9 0.0
5.|-- 100.91.29.117 0.0% 1 48.3 48.3 48.3 48.3 0.0
6.|-- 52.95.62.48 0.0% 1 45.7 45.7 45.7 45.7 0.0
7.|-- 52.93.249.56 0.0% 1 43.6 43.6 43.6 43.6 0.0
8.|-- 52.95.8.197 0.0% 1 47.6 47.6 47.6 47.6 0.0
9.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
10.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
11.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
12.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
13.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
14.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
15.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
16.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
17.|-- ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
18.|-- 174.136.xyz.124 0.0% 1 44.4 44.4 44.4 44.4 0.0

A simple ICMP ping also showed some variability, with some packets being filtered, but not all:

[ec2-user@ip-172-31-24-210 ~]$ ping 174.136.109.18
PING 174.136.109.18 (174.136.109.18) 56(84) bytes of data.
From 174.136.xyz.124 icmp_seq=15 Packet filtered
From 174.136.xyz.124 icmp_seq=34 Packet filtered
From 174.136.xyz.124 icmp_seq=44 Packet filtered
From 174.136.xyz.124 icmp_seq=63 Packet filtered
From 174.136.xyz.124 icmp_seq=73 Packet filtered
From 174.136.xyz.124 icmp_seq=92 Packet filtered
From 174.136.xyz.124 icmp_seq=102 Packet filtered
From 174.136.xyz.124 icmp_seq=121 Packet filtered
From 174.136.xyz.124 icmp_seq=131 Packet filtered
^C
--- 174.136.109.18 ping statistics ---
148 packets transmitted, 0 received, +9 errors, 100% packet loss, time 152885ms

Me to AWS Support:

Those [third-party] IPs are interestingly-close to the IP of the system I'm having problems connecting to.

174.136.xyz.abc and 174.136.xyz.abd both just have that 3rd octet difference (xyx vs 109) from 174.136.109.18.

I wonder if there's a typo/misconfiguration here that's routing the traffic to somewhere it shouldn't be? I've emailed them, I'll let you know what I hear back.

... that lead to a third-party related routing problem...

On May 10, I emailed [thirdparty] a description of the problem (seems whois contact information is still useful, sometimes), and they responded on May 11:

We do not allow access over our AWS PublicVif direct connect handoffs. What application are you trying to access and what company are you with?

Hrrm.. Direct Connect? The documentation looks interesting, and it lets you use BGP peering to route your AWS network/s to your non-AWS networks. More documentation states “[y]ou must own the IP address prefixes that you advertise to the AWS network in the public VIF. To advertise IP address prefixes that are owned by third parties or Internet Service Providers (ISPs), provide AWS Support with a Letter of Authorization (LOA).”

I replied:

I am not a [thirdparty] customer, I am not trying to access any of your applications, and I do not know what your direct connect handoffs are.

I am an AWS EC2 customer who is trying to be able to get traffic from AWS EC2 instances to an external system, chair6.net / 174.136.109.18.

For some reason, my traffic from EC2 to 174.136.109.18 seems to be ending up at a [thirdparty]-registered 174.136.x.y or 174.136.x.z, where it is being dropped / filtered. I don't know why this is happening, and don't want this to happen.. I want my traffic to take whatever direct / default route is available from AWS to 174.136.109.18.

I am not sure how the AWS <-> [thirdparty] connectivity / routing is done, but I suspect a typo in a configuration somewhere, given how close the first 3 octets of those 3 IP addresses are.

They responded:

Yes, I understand what you are saying now.

That network isn’t registered to [thirdparty] but it is being routed within our network and out to AWS. I’m guessing someone meant 174.136.x.y since that is registered to us. I’ll have to get back to you on this one.

And a couple of days later, on May 14:

This will be taken care of tonight.

Me, back to AWS Support:

[thirdparty] tell me that the issue should be resolved after they deploy a change tonight, so fingers crossed! I'll see how it's looking tomorrow.

I've attached a screenshot of the email thread so far.

(I'm curious why an unrelated external party would be in a position to inject incorrect routes for networks they don't own into AWS, such that they can affect an EC2-wide egress path? Feels like a traffic hijacking vector that could be open to abuse.)

The next day, May 15, the connectivity problem was gone! 🎉🎉🎉

I had one more interaction with [thirdparty]:

I just checked from my side & that change seems to have worked.. my AWS EC2 instances can send traffic to 174.136.109.18 again.

Just out of curiosity, what was the problem? It sounds like there was perhaps a route with a typo'd prefix being advertised by [thirdparty] to AWS via their Direct Connect service, which was affecting my AWS EC2 egress traffic?

Could you please share the prefix in question, so I can let my provider know what other IPs in their range may have been affected?

Their reply:

We have a test environment that had a typo on the third octet which we fixed last night. The prefix was 174.136.109.0/26.

Suspicion confirmed.

I let my VPS provider know the problem was resolved. His reply:

That's an incredible find. And also quite scary 😳 So AWS doesn't validate in any (reasonable) way that routes they ingest are legitimate. Not even an RPKI lookup. Maaaaaaaaaan.

I burned so much time, and I'm sure you did too, and our ISP will hate us now 😂 I gotta tell them to close the ticket and admit, "Yeah, it really was AWS' fault, soooooorry".

... that pointed to an AWS security issue.

With his "quite scary" concern in mind, I figured I'd keep pushing a little on the security angle. Me again, to AWS Support, on May 15:

Yup, looks like we're all set now... thanks for following up!

Do you have any idea why a misconfiguration by this third party would've been able to affect my AWS egress traffic like this? I guess there's some trusted relationship in place such AWS accepts & prioritize routes they're receiving from them via Direct Connect?

AWS Support, on May 16:

Glad to hear the confirmation that everything works! Unfortunately, as we take the data and confidentiality of our customers very seriously, it is not possible for me to provide information about the configuration and resources of other customers or AWS accounts.

But generally speaking, advertized IP prefixes can take precedence if not withdrawn, especially if there's a longer prefix match.

I will be setting this case as resolved, as reachability has been confirmed now.

After Support closed out the ticket, I still wasn't quite comfortable with the answer from a security perspective.

I could have done some more testing myself (setting up Direct Connect, creating a public virtual interface, establishing peering, and trying to advertise routes for IPs I don't own), but I've spent enough time on this already.

I checked out https://aws.amazon.com/security/vulnerability-reporting/ & on May 16 fired off an email to aws-security@ with subject “Possible traffic hijacking issue for AWS EC2 egress paths”:

I came across an interesting situation over the past few weeks where it seems that AWS Direct Connect may expose the potential for third-party injection of incorrect routes that affect AWS EC2 egress traffic for all customers. While in my case this was a mistake on the part of that third party, it seems like you have potential for malicious exploitation here.

The general situation:

I run an external, non-AWS system, at a.b.xyz.c in and found that I couldn't get network connectivity to that system from EC2 instances in multiple AWS regions (I tested us-west-2, us-east-1, and eu-central-2).

I talked to my external provider, they talked to their upstream, and we couldn't see a problem. But on checking in with AWS Support, we observed that EC2-sourced traffic for my external system, a.b.xyz.c, was being routed to and then dropped by a similar-but-different IP range, a.b.xzy.d (note that flipped ordering in that 3rd octet) in a different & unrelated AS.

On communicating with the unrelated third party and owner of that similar-but-different range and AS, it seemed that they had typo'd a configuration that was related to AWS Direct Connect in some way, and were somehow advertising an incorrect prefix (including my external system's IP) that was being applied with preference to AWS egress traffic.

In this situation, the unrelated third party corrected their configuration and my connectivity issues were resolved immediately after they deployed. They didn't share many details, but did refer to their "AWS PublicVif direct connect handoffs" at one point, and "a test environment that had a typo on the third octet".

Looking at the docs (https://docs.aws.amazon.com/directconnect/latest/UserGuide/routing-and-bgp.html), and reading between the lines, it sounds like this was likely related to Direct Connect's support for BGP route advertisement.

While my case seems to have been due to an honest mistake, this feels potentially bad! How can an unrelated third party inject incorrect routes (seemingly via Direct Connect) that are given preference for general AWS EC2 egress traffic? I suspect there's a potential for malicious exploitation here, as an attacker could use the same vector to hijack traffic by injecting bad routes & redirecting traffic for networks they don't own. AWS has talked previously about using RPKI to secure your BGP usage, but that doesn't seem to be the case here.

They acknowledged immediately on May 16, then replied again on May 17:

We have notified the relevant team of your concern and they will be taking appropriate action. The service team is currently working on the fix and the fix will be implemented in the near future.

We exchanged a few more short emails, then they closed it out on June 19:

Hope you are doing well! I'm happy to report that we have completed our investigation for your reported issue.

AWS DirectConnect customers can configure a public virtual interface (VIF), which allows them to use their connection to access public AWS resources [1] from their on-premise location. This requires that AWS services be able to send the traffic destined to the customer's public IP through their connection. As such, customers need to advertise their Public IPs directly to AWS over BGP on their connections; these routes are added to the AWS Network with a higher preference than the path over the internet. This enables any AWS-sourced traffic to be directed to the customer over their DirectConnect connection.

Given that these routes are given preference over the internet-received routes, customers wanting to use this feature need to provide proof of ownership of the prefixes. When setting-up a new public VIF, AWS will not accept any prefix advertised by the customer until the prefix ownership has been validated [2]. This prevents customers from receiving traffic destined to any arbitrary prefix. In the instance you reported, there was an issue with our process for validating the ownership of the IP prefix, which led to the traffic being sent to an unintended destination. We have since improved the process by expanding the checks being performed.

AWS has adopted Resource Public Key Infrastructure (RPKI) in its public peering and transit facing infrastructure [3]. However, RPKI had not yet been adopted in DirectConnect due to the increased burden RPKI would put on DirectConnect users. We are actively investigating improvements to the customer experience by adopting more streamlined mechanisms to verify prefix ownership, similar to the Bring your own IP address (BYOIP) features used with EC2 and Amazon Global Accelerator [4].

[1] https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-ranges.html

[2] https://repost.aws/knowledge-center/public-vif-stuck-verifying

[3] https://aws.amazon.com/blogs/networking-and-content-delivery/how-aws-is-helping-to-secure-internet-routing/

[4] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-byoip.html

Once again, I want to thank you for reaching out to us with this report and collaborating with us. While we do not plan to publish a security bulletin at this time; we would be happy to provide technical feedback on your content if you choose to publish.

And that’s it!

One takeaway here is that networks and processes still break in interesting ways (I can't say I'd expected another AWS customer to be able to affect my traffic in this way), and that you shouldn't just assume a cloud provider is giving you a clean path to the public internet.

A second takeaway, based on a comment I received - let's be explicit, this is a good example of the broader system operating as intended! 🏆 While the situation here may have been slightly different from an assumed threat model, as BGP hijacking wasn't being directly used to redirect traffic & obtain an unauthorized certificate, the multi-perspective validation performed by Let's Encrypt did prevent issuance of a certificate for an endpoint in a nebulous networking state.

I shared a draft of this post with AWS on July 15 & they provided feedback on August 13. Hopefully the process changes they referred to help prevent future exposure.

Posted on: Thu 15 August 2024

Category: security – Tags: security