Ancestors

Toot

Written by gjm on 2024-09-27 at 00:40

Clever internet people! I have a weird network problem.

My local machine is a Linux box. It runs Exim, a mail transport agent. When I send an email, my mail user agent talks to Exim, and Exim passes the message on via SMTP to one of two remote smarthosts for further delivery.

For one of those two, but not the other, something times out when trying to make the SMTP connection. The message in the Exim logs is "SMTP timeout after initial connection" but I am not certain there has actually been an "initial connection" in any useful sense.

When it's trying to deliver such a message, netstat shows the connection to the relevant server as in state SYN_SENT.

Usually I think this would indicate some rather low-level network problem, like a broken firewall rule.

BUT

  1. If I try to make a connection (from the same local machine) to the same server using smtplib in Python, it connects without problems! (And I can successfully send mail that way.)

  1. Between the last time this all worked without problems and now, I have not knowingly changed anything in my configuration.

Yes, I have tried turning it off and then on again.

Anyone got any idea what might be happening?

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Descendants

Written by mathew on 2024-09-27 at 00:48

@gjm Is it possible that exim supports STARTTLS, the Python library doesn’t by default, and the connection is timing out when TLS is attempted?

(If so, the obvious followup question will be why TLS is making the connection fail with a timeout rather than something more informative.)

=> More informations about this toot | More toots from mathew@universeodon.com

Written by gjm on 2024-09-27 at 00:51

@mathew I'm explicitly invoking STARTTLS when I do it in Python, and I get a response from the server indicating that it worked. So I don't think it can be that.

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by mathew on 2024-09-27 at 00:55

@gjm Hmm, OK, next random guess: is one using IPv4 and the other IPv6 for whatever reason?

=> More informations about this toot | More toots from mathew@universeodon.com

Written by Tony Finch on 2024-09-27 at 01:02

@mathew @gjm i suspect some magic firewall rules on your box that are dropping packets depending on which process is sending them

=> More informations about this toot | More toots from fanf@mendeddrum.org

Written by gjm on 2024-09-27 at 01:06

@fanf @mathew I cannot think of any reason why I would have such firewall rules on my box, and still less any reason why they would suddenly have appeared (or suddenly started applying to Exim where they hadn't before) a few hours ago. But I guess it's not impossible.

(I think I have no firewall rules at all on this box. There is a separate box between it and the internet with plenty of firewall rules on it, but obviously that box has no knowledge of what process on this box any packet is coming from. This box is running Ubuntu 22.04, I am fairly sure I have never knowingly configured a firewall on it. The default firewall on recent-ish Ubuntu is ufw, and ufw status says Status: inactive.)

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by Tony Finch on 2024-09-27 at 01:19

@gjm @mathew hmm i no longer suspect firewall magic

does your mua submit to exim via smtp or exec(sendmail)? can you reproduce the problem by feeding exim a test message from a shell prompt? that would make it easier to turn on debug tracing so you can see if there’s anything suspicious in the logs

=> More informations about this toot | More toots from fanf@mendeddrum.org

Written by gjm on 2024-09-27 at 01:50

@fanf @mathew MUA passes mail to MTA via SMTP.

If I do exim -i -t < mail-test.txt where mail-test.txt contains an RFC822 mail message, I get the same behaviour.

Unfortunately, exim -d+all -i -t < mail-test.txt (run as root) doesn't generate debug info for any of the relevant stuff -- it hands the message off to a separate process for delivery.

However, once delivery has failed once, if I clear the retry database, stop my already-running exim process, and do exim -d+all -bd -q30m (i.e., the same thing that usually runs in the background, but requesting debug output), the relevant-looking bits go like this (summarized because of character limit):

hostlist:

'smtp.pobox.com' IP 103.168.172.64 -1

'smtp.pobox.com' IP 103.168.172.49 -1

[stuff involving hints database; "no host retry record"; "no message retry record"; "retry-status = usable"]

103.168.72.64 in serialize_hosts? no (option unset)

delivering [messageID] to smtp.pobox.com [103.168.172.64] (...)

[checks hosts_require_dane and hosts_pipe_connect; no to both]

Connecting to smtp.pobox.com [103.168.172.64]:25 ... 103.168.172.64 in hosts_try_fastopen? yes (matched "*")

TFO mode sendto, no data: EINPROGRESS

connected

[stuff to do with $primary_hostname]

selecting on subprocess pipes

selecting on subprocess pipes

SMTP(Connection timed out)<<

SMTP(close)>>

LOG:MAIN

H=smtp.pobox.com [103.168.172.64]: SMTP timeout after initial connection: Connection timed out

The bit that takes all the time is where it says "selecting on subprocess pipes" twice.

(Some comments to follow.)

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by gjm on 2024-09-27 at 01:56

@fanf @mathew What strikes me here is the "fastopen" bit, so I'm going to try disabling that and see whether it makes anything better. (Initial indications are that it doesn't, but it's possible I screwed something up.)

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by gjm on 2024-09-27 at 02:18

@fanf @mathew I had screwed it up, but unfortunately after unscrewing (and checking that fastopen was then actually disabled) it fails in the same way as before.

The debug output looks much the same as it did before. The failure message it displays after the timeout is now "failed: Connection timed out (timeout=5m)" which isn't quite the same as before.

netstat still shows a connection in SYN_SENT state while it's busy timing out, as before.

Thanks for looking at this, btw!

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by mathew on 2024-09-27 at 02:28

@gjm @fanf At this point I’d be inclined to try WireSharking it to see exactly what’s going on.

Does Exim give you a full transcript of everything sent by the remote server? Wondering if there could be a clue there.

=> More informations about this toot | More toots from mathew@universeodon.com

Written by gjm on 2024-09-27 at 02:50

@mathew @fanf I don't think it gets as far as anything being sent by the remote server.

I'm not an Exim expert but I'd have expected that telling it "turn on all debugging" would make it dump out any full transcripts it might be capable of dumping.

If I run tshark while making it retry one of my failed messages, the only things involving the relevant IP address are (1) querying a DNS server and getting those addresses back, and (2) the following lines which look to me as if my box is sending a SYN but it's never getting a response:

14 1.188041335 172.16.123.1 → 103.168.172.64 TCP 78 37218 → 25 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=2465038006 TSecr=0 WS=128 TFO=R

16 2.191082237 172.16.123.1 → 103.168.172.64 TCP 74 [TCP Retransmission] [TCP Port numbers reused] 37218 → 25 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=2465039010 TSecr=0 WS=128

19 3.215323691 172.16.123.1 → 103.168.172.64 TCP 74 [TCP Retransmission] [TCP Port numbers reused] 37218 → 25 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=2465040034 TSecr=0 WS=128

20 4.239356867 172.16.123.1 → 103.168.172.64 TCP 74 [TCP Retransmission] [TCP Port numbers reused] 37218 → 25 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=2465041058 TSecr=0 WS=128

35 5.263335677 172.16.123.1 → 103.168.172.64 TCP 74 [TCP Retransmission] [TCP Port numbers reused] 37218 → 25 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=2465042082 TSecr=0 WS=128

and a few more retransmits looking very like the ones above.

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by mathew on 2024-09-27 at 03:12

@fanf @gjm Hmm, OK, that’s pretty baffling. Paging @rk in case he might have some ideas. If I think of anything in my sleep I’ll let you know.

=> More informations about this toot | More toots from mathew@universeodon.com

Written by gjm on 2024-09-27 at 09:52

@mathew @fanf @rk Thanks!

I bet that I've somehow done some incredibly boneheaded thing, but at the moment I can't imagine what it might be.

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by gjm on 2024-09-27 at 10:32

@mathew @fanf @rk "Fixed" now by explicitly putting "port = 587" in Exim's SMTP smarthost transport config. I still don't know what changed (I don't think that was ever there before), and I suspect this isn't the right way to fix it, but it seems like it's working now.

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by gjm on 2024-09-27 at 10:07

@mathew @fanf Wait, I just noticed something extremely damning in those packets. It's trying to talk to port 25 on the remote server, not port 587. That's got to be wrong, and somehow it's got to be my fault, though I don't know what I did that would cause it. But this feels like something that should be solvable.

[EDITED to add:] No, I'm a moron, and the initial SMTP connection is meant to be on port 25. Right? I am so ignorant. I will try it with Python smtplib again and watch exactly what it does.

[EDITED again:] Nope, Python/smtplib connects on port 587 right from the start and never attempts port 25. This must be what's broken.

[EDITED a last time:] Aaaand the ghastly hack of adding "port = 587" to the smtp smarthost section of my Exim configuration makes it work again. I suspect I should be doing something less ghastly. I still don't know what made this break.

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by mathew on 2024-09-27 at 14:09

@gjm @fanf Plain text SMTP should be on port 25; STARTTLS on port 587; and “always on” implicit TLS on port 465.

My guess is that pobox used to allow port 25, maybe even with STARTTLS, for maximum compatibility with badly written customer SMTP clients. They probably stopped doing so because both plaintext SMTP and STARTTLS are considered deprecated at this point (per RFC 8314), and there just aren’t that many flavors of email client out there any more.

The other possibility that springs to mind is that there’s some config file that specifies a default port for exim, and a routine version update resulted in that config file getting changed.

=> More informations about this toot | More toots from mathew@universeodon.com

Written by gjm on 2024-09-27 at 15:10

@mathew @fanf Yeah, sounds plausible. Though I'm fairly sure I didn't accidentally update exim, and looking at things like file modification times I'm now really quite sure I didn't.

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Written by gjm on 2024-09-27 at 01:02

@mathew Exim is using IPv4, or at least it explicitly lists IPv4 addresses for the things it tries to talk to (including in particular the server it's failing to talk to). For the Python, I tried connecting using the DNS name (which for all I know might be doing IPv6) but also using the explicit IPv4 address. Exim failed, Python worked. I'm pretty sure the above is good evidence that both were using IPv4.

(The DNS name actually has records for two different IPv4 addresses. I only tried one of them in Python, but I have observed Exim failing to talk to that same one.)

=> More informations about this toot | More toots from gjm@mathstodon.xyz

Proxy Information
Original URL
gemini://mastogem.picasoft.net/thread/113206730182580564
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
333.169688 milliseconds
Gemini-to-HTML Time
9.884569 milliseconds

This content has been proxied by September (3851b).