Weird IPSEC VPN Routing Issue

enki

This isn't a cert prep question per se, it's actually a production network issue, but I've been banging my head against the wall for over a month on what the issue could be, doing a bunch of research, but have come up empty handed so I figured I would ask here. Maybe @donpezet will be able to shed some light on what the issue is.

Without delving too much into the details, we were trying to add a new location into an existing multi-site Cisco network infrastructure. For cost reasons, they went with a Sonicwall firewall for the new site, though the other locations and central data center hub are all Cisco equipment (routers). The network architecture is a very simple hub and spoke design, with the central DC router being the hub. The other locations all have VPN tunnels (with Tunnel interfaces on both sides) going over the Internet. They utilize EIGRP for interior routing, and each site uses it's own Internet WAN for the default gateway. Simple enough.

Due to some limitations I read online about integrating Cisco<->Sonicwall VPNs, we did not utilize the tunnel interfaces for this VPN and setup a simple IPSec link on both sides to handle traffic between them. No problems there, everything came up as expected. We setup the necessary ACLs on both sides to handle the traffic and I confirmed everything was working properly. That's when things started to get weird.

The primary issue was that the other sites weren't able to reach the new one, while the data center subnets (going through the hub directly) and Azure traffic (using a VPN to the hub with hard coded static routes to all site networks) could. All other site-to-site traffic worked fine. I began troubleshooting and noticed that the subnet for the new site was not in the DC router's routing table, whereas all the other sites were. As a result, it was not being distributed via EIGRP. I tried some various troubleshooting steps, but nothing would get it to show up, so I gave up for the day and did some research that night.

I came back the next day and, voila, the route for the new location was in the routing table and being distributed properly. Nothing had changed config wise. I chalked it up to some delay issue and figured we were good to go. I forget to mention that this was being done in our office prior to deployment on-premise for the customer, but outside of the site's WAN address changing, everything else was equal.

We go to install it on-site, and once again the routing issue comes back. No route in the table, no EIGRP distribution to the other sites, but the network is reachable to and from the data center. In other words, if the data center network is 192.168.0.0/24 and the new site is 192.168.1.0/24, I could ping between those subnets, but not between the new site and other locations through the DC. Once again, I troubleshooted, and waited to see if it magically fixed itself, but nope.

Eventually, I gave up and just set static routes from the other sites to direct traffic to the DC router if it was destined for the new site's network. While not ideal, it's a simple and small network and it works fine. The fact that it does work is what is confusing. If the route to the new site isn't in the router's routing table, how is it routing the traffic there in the first place? Why did it show up once, and even stay up after I made some general network optimizations (taking the tunnel up/down a few times in the process), but then disappear again never to show back up. I even tried setting the route on the DC router statically, mimicing the settings from when it did show up, but that didn't work either.

Anyway, like I said, this has been a puzzle to me for a couple months. It just doesn't make any sense. So I figured I would run it by people here to see if anyone else has any ideas.

Thanks.

Daniel Espinal

@enki Whenever you mix vendors, strange things happen. If I remember correctly you said it worked on site when you first set it up but didn't work offsite when the sonic wall was installed at the remote site. Sounds to me like a carrier issue. What's different between the WAN connections on the sites that work and the one that doesn't? I would give your ISP a ring and see if they have any insight on this issue. They might be blocking some important traffic. Or the sonic wall itself might be blocking some EIGRP related traffic.

Let me tell you, all of my dealings with sonicwall have been nightmares. I take care of home and small office networks. I never see any cisco equipment because of the cost. A few of my clients have their office techs install sonicwalls at their homes so they can VPN into their office from home. And every one has had issues, either with their verizon service no longer working or Apple TV not working, always something. I just started suggesting Meraki security appliances. It's trivial to set up VPNs and the stuff works. Plus the GUI gives me very high visibility into what is going on. Plus it's comparably cheap.

Very interested in hearing what the problem was and how you eventually figured it out.

enki

@Daniel-Espinal
Technically both situations were 'off-site'. We initially setup the Sonicwall in our office, but the connectivity was always to the client's data center. We ran into the issue during initial setup, which magically fixed itself, and then re-occurred at the client's site but never worked properly there. In both cases, outside of a different IP address (which we obviously changed in the configs once it moved), everything else was identical. Even the carrier.
It's not an EIGRP issue because we weren't using EIGRP at the new site (since it's Sonicwall). While they are using EIGRP internally between the other sites, it wasn't really necessary since it's a simple hub and spoke topology with no redundant routes or connectivity.
I'm pretty sure that the underlying issue is with the Cisco router in the hub. I say this because, even though the VPN tunnel is up to the new site, and even though it routes traffic to/from those subnets without issue, the new site's subnets aren't showing up in the routing table and , as a result, aren't being distributed out via EIGRP.
Under normal circumstances, I would look into rebooting the hub router since the uptime was something crazy and it might just be a hang up. It's also running an older IOS, but they don't have a smartnet contract on it. But since this was a one-off project for a non-managed client, with a T&M rate, eventually we just did static routes on the other sites back to the hub and everything worked fine with that.
But it's one of those things that still bothers me and I would eventually like to figure out why it didn't work properly all the time. My assumption is just a glitch in something.