Separate transport for retried recipients

Post by Patrik Rak
Hello,
Some time ago I was setting up yet another postfix deployment, and I was
once again thinking about the case when (temporarily) undeliverable
recipients block most or all of the available delivery agents.

Low-level comments:

- What common use case has different per-recipient (not: per-sender,
etc.) soft reject rates for a mail stream between two sites? Does
it matter whether some portion of a mail stream between two sites
is deferred because of the recipient, sender or other cause?

- Postfix has multiple transports configured: the required ones
such as local_transport (local), default_transport (smtp),
relay_transport (relay), plus the ones that aren't selected with
$local/virtual/relay/default_transport such as retry, error, and
ad-hoc transports. It would be wrong to hard-code the "alternate
retry transport" feature to just one transport. Even if it were
used only with smtp-like transports, there may be more than one.

- Postfix tries to play "nice" by not overwhelming remote servers
with many connections. This is scheduled per transport, not across
transports. I'm not claiming that Postfix concurrency scheduling
solves all problems, but having two transports sending to the same
destination would complicate things a little (but no more than
having sender-dependent source IP addresses).

Wietse

Post by Patrik Rak
In enterprise environments this problem has been traditionally solved by
using the fallback_relay feature to pass these recipients to standalone
postfix server (or at least separate instance). This is perhaps not going
to change, as on high traffic sites such recipients can clog not only the
available delivery agents, but the entire active queue as well.
However, on medium traffic sites, I was thinking that there might be a
more convenient solution which would not require standalone server or
multi-instance setup. The idea is that qmgr could automatically use
different transport for recipients from the deferred queue (as opposed to
those coming from the incoming queue). This wouldn't eliminate the active
queue bottleneck, but other than that it would keep the retried recipients
out of the way of the new ones entirely. We have already used similar
approach in the past to separate the inbound and outbound recipients by
introducing the relay transport class, for example.
- in master.cf, clone the smtp transport, call it "slow" for example
- in main.cf, set smtp_retry_transport = slow
- when creating message structure, qmgr would need to keep track
if it came from the incoming or deferred queue
- in qmgr_message_resolve(), just before looking up the transport,
when the message originates from deferred queue, qmgr would
replace the transport name in the reply with the configured retry
variant if it is defined and such transport exists.
What do you think? Is it an idea worth implementing?
Patrik

Patrik Rak

2013-05-11 14:20:51 UTC

Post by Patrik Rak
Some time ago I was setting up yet another postfix deployment, and I was
once again thinking about the case when (temporarily) undeliverable
recipients block most or all of the available delivery agents.

- What common use case has different per-recipient (not: per-sender,
etc.) soft reject rates for a mail stream between two sites? Does
it matter whether some portion of a mail stream between two sites
is deferred because of the recipient, sender or other cause?

The use case which I am interested in is basically some service sending
registration confirmation messages to its users, where some users decide
to fill in bogus addresses which result in temporary errors until the
message expires and bounces. Such messages tend to stock pile in the
deferred queue and can quite dominate the active queue and adversely
affect the deliveries to proper recipients. Especially when these bogus
recipients are not deferred immediately, but only after considerably
long timeout.

I don't think the other standard scenarios of message deferral like when
the network connection really goes down temporarily would benefit much
from the separate transport I propose. OTOH, it shouldn't affect them
terribly either.

And as for the deferral cause, I think that from the point of blocked
delivery agents it doesn't really matter why exactly was the message
blocked. In my experience, if something was deferred, it is more likely
it will be deferred again rather than not, which seems to be enough to
try to make sure it doesn't take resources away from fresh deliveries....

However, I would very much like to hear what other people think and how
it might affect their setups...

Post by Wietse Venema
- Postfix has multiple transports configured: the required ones
such as local_transport (local), default_transport (smtp),
relay_transport (relay), plus the ones that aren't selected with
$local/virtual/relay/default_transport such as retry, error, and
ad-hoc transports. It would be wrong to hard-code the "alternate
retry transport" feature to just one transport. Even if it were
used only with smtp-like transports, there may be more than one.

Definitely. In the example I gave, the smtp_retry_transport was meant to
work as the usual per-transport <transport>_retry_transport fallback.

And I would not mind if you want to call it differently either, to
prevent any possible confusion with the retry transport based on the
error delivery agent.

Post by Wietse Venema
- Postfix tries to play "nice" by not overwhelming remote servers
with many connections. This is scheduled per transport, not across
transports. I'm not claiming that Postfix concurrency scheduling
solves all problems, but having two transports sending to the same
destination would complicate things a little (but no more than
having sender-dependent source IP addresses).

Hmm, right. I haven't considered this explicitly. However:

- the impact on the target site doesn't seem to be worse than if the
fallback_relay feature was used to deal with the problem in the first place.

- the concurrency window limit of that alternate transport can be
explicitly configured to be small, which should minimize the difference
of the load caused on the target site.

- in the usual case when the recipients are deferred again, the
concurrency window wouldn't even grow in the first place. (In my use
case there is often even no proper site to connect to as such, but
that's not an argument, of course).

- the current cohort based concurrency algorithm should play nicely even
with unknown number of other hosts connecting to the target site, so it
shall work reasonably well with two separates transports each trying on
their own as well.

I am all eager to hear what Victor has to say about this one, though...
He has a lot of experience with problematic sites using small
concurrency windows, from what I remember...

Post by Patrik Rak
- in master.cf, clone the smtp transport, call it "slow" for example
- in main.cf, set smtp_retry_transport = slow

I forgot to explicitly mention that the parameters of the "slow"
transport can be configured independently of the original smtp
transport, but I guess that's obvious.

Post by Patrik Rak
- when creating message structure, qmgr would need to keep track
if it came from the incoming or deferred queue
- in qmgr_message_resolve(), just before looking up the transport,
when the message originates from deferred queue, qmgr would
replace the transport name in the reply with the configured retry
variant if it is defined and such transport exists.

Hmm, or shall this perhaps be made part of trivial rewrite resolving
instead? Which one would you prefer?

Patrik

Wietse Venema

2013-05-11 15:00:20 UTC

Hmm, or shall this perhaps be made part of trivial rewrite resolving
instead? Which one would you prefer?

This is already how the qmgr chooses between transports for normal
delivery, and transports for address verification. I don't expect
that adding a third request type (for retried deliveries) would
break resolver requests from other Postfix daemons.

Also, this orthogonal approach guarantees that there will be no
surprising interactions between address verification and retried
deliveries. A request is one of "normal", "verification" or "retried",
(i.e. there is no overlap).

With this approach, each additional resolver request type results
in a number of additional configuration parameters:

XXX_default_transport = $default_transport
XXX_local_transport = $local_transport
XXX_relay_transport = $relay_transport
XXX_virtual_transport = $virtual_transport

XXX_relayhost = $relayhost
XXX_transport_maps = $transport_maps

XXX_sender_dependent_default_transport_maps = $sender_dependent_default_transport_maps
XXX_sender_dependent_relayhost_maps = $sender_dependent_relayhost_maps

This may appear unwieldy at first, but it does give a lot of control
over how Postfix handles delayed mail deliveries.

Wietse

Patrik Rak

2013-05-11 16:37:54 UTC

Post by Patrik Rak
Hmm, or shall this perhaps be made part of trivial rewrite resolving
instead? Which one would you prefer?

This is already how the qmgr chooses between transports for normal
delivery, and transports for address verification. I don't expect
that adding a third request type (for retried deliveries) would
break resolver requests from other Postfix daemons.
Also, this orthogonal approach guarantees that there will be no
surprising interactions between address verification and retried
deliveries. A request is one of "normal", "verification" or "retried",
(i.e. there is no overlap).
With this approach, each additional resolver request type results
XXX_default_transport = $default_transport
XXX_local_transport = $local_transport
XXX_relay_transport = $relay_transport
XXX_virtual_transport = $virtual_transport
XXX_relayhost = $relayhost
XXX_transport_maps = $transport_maps
XXX_sender_dependent_default_transport_maps = $sender_dependent_default_transport_maps
XXX_sender_dependent_relayhost_maps = $sender_dependent_relayhost_maps
This may appear unwieldy at first, but it does give a lot of control
over how Postfix handles delayed mail deliveries.

Looks great.

I'll see if others have some additional comments, but other than that
this sounds like a plan...

Patrik

Viktor Dukhovni

2013-05-11 16:40:01 UTC

Post by Wietse Venema
- What common use case has different per-recipient (not: per-sender,
etc.) soft reject rates for a mail stream between two sites? Does
it matter whether some portion of a mail stream between two sites
is deferred because of the recipient, sender or other cause?

The only way to deal with high latency is with high concurrency,
thus maintaining a reasonable throughput (concurrency/latency).
Most cases of high latency due to bogus domains, non-responding MX
hosts, ... are cases in which the concurrency at the receiving
system is zero, since no SMTP connection is ever made. So in this
case you want at least a high process limit for the transport. If
the bogus destinations are many, then this is enough.

One would need to size the active queue limits for some multiple
of the expected 5-days of bad addresses so that such mail rarely
fills the active queue. Since Postfix 1.0 was released in 2001,
the price of RAM has fallen considerably. It is now quite
cost-effective to build servers with 1-4 GB of RAM or more. So an
MTA with this problem should have a large active queue size to avoid
running out of queue slots.

I think such tuning is a pain in a single instance of Postfix,
and monitoring such a queue is needlessly complex with a single
instance. I find all the fear and loathing of multiple instances
perplexing. Multiple instances are *simpler* than intricately
tuned single instances.

Post by Patrik Rak
- the concurrency window limit of that alternate transport can be
explicitly configured to be small, which should minimize the
difference of the load caused on the target site.

That would be a mistake. You want a high concurrency, which is
problematic for retries to some legitimate destinations (say Yahoo
after greylisting). Therefore, what one really wants to know is:

- Did the message fail via a 4XX reply or connection failure?

- Is this the first failure, or has delivery failed multiple times?
(though with greylisting, one's own retry time may be sooner than
the receiver's minimum delay).

Thus one may want to keep messages that fail for the fist time or
with a 4XX reply rather than a timeout or connection failure in
the same queue as regular mail, while sending messages that time
out after being deferred into a fallback queue (remote or second
instance).

For this one would need to change the SMTP delivery agent to use a
a conditional fallback relay. This would be added to the delivery
request by the queue manager when processing messages from the
deferred queue, and used by the SMTP delivery agent only when the
"last" regular MX host "site-fails" (not 4XX reply).

The effect is to separate slow mail that times out multiple times,
whose delivery could clog the queue, from other mail that is in
the queue briefly, or whose delivery failures are in any case fast
enough to not be a big problem.

Post by Patrik Rak
I am all eager to hear what Victor has to say about this one,
though... He has a lot of experience with problematic sites using
small concurrency windows, from what I remember...

I don't think that additional transports in the same instance are
a good idea here. Too much complexity, and still a high risk of
a full active queue. With a second downstream instance that holds
only the slow mail, one can tune concurrency up, and tune any queue
monitoring more appropriately to the content in hand. One can also
adjust queue lifetimes more sensibly, ...

So I propose:

- No changes in trivial-rewrite.

- No additional transport personalities.

- One additional parameter to define a queue-manager signalled
fallback relay, included with delivery requests for messages
that come from the deferred queue.

- This fallback relay is ignored by default by all delivery agents, and
is optionally available in the smtp(8) delivery agent, which needs
a second non-default parameter to enable its use.

- The second parameter would be set by administrators of affected
sites in the "smtp" transport, and likely not set in the "relay"
transport.

- Sludge (connection timeout or failure possibly combined with a minimum
message age) goes to a remote or second instance queue.

The main difficulty is that this meshes somewhat poorly with
"defer_transports", since some deferred mail may be "innocent" and
could be sent to the slow queue when the transport is no longer
deferred if the first delivery fails, but this edge case is likely
not significant. Similar considerations for mail released from
the "hold" to the "deferred" queues.

We could extend the queue-file format, to define a new record type
which is a variant of 'R' (recipient), this would be a recipient
that failed slowly on the last delivery, and should become sludge
on the next failed delivery. It behaves just like 'R', except
in the smtp(8) delivery agent which sends it to the sludge fallback.

That way, the queue-manager is even simpler, just treat 'R' and
the new record identically, and let smtp(8) do all the work, but
now defer_append() would need to be able to update the recipient
record type just like sent().

--
Viktor.

Patrik Rak

2013-05-11 17:13:57 UTC

The only way to deal with high latency is with high concurrency,
thus maintaining a reasonable throughput (concurrency/latency).
Most cases of high latency due to bogus domains, non-responding MX
hosts, ... are cases in which the concurrency at the receiving
system is zero, since no SMTP connection is ever made. So in this
case you want at least a high process limit for the transport. If
the bogus destinations are many, then this is enough.
One would need to size the active queue limits for some multiple
of the expected 5-days of bad addresses so that such mail rarely
fills the active queue. Since Postfix 1.0 was released in 2001,
the price of RAM has fallen considerably. It is now quite
cost-effective to build servers with 1-4 GB of RAM or more. So an
MTA with this problem should have a large active queue size to avoid
running out of queue slots.
I think such tuning is a pain in a single instance of Postfix,
and monitoring such a queue is needlessly complex with a single
instance. I find all the fear and loathing of multiple instances
perplexing. Multiple instances are *simpler* than intricately
tuned single instances.

Post by Patrik Rak
- the concurrency window limit of that alternate transport can be
explicitly configured to be small, which should minimize the
difference of the load caused on the target site.

That would be a mistake. You want a high concurrency, which is
problematic for retries to some legitimate destinations (say Yahoo
- Did the message fail via a 4XX reply or connection failure?
- Is this the first failure, or has delivery failed multiple times?
(though with greylisting, one's own retry time may be sooner than
the receiver's minimum delay).
Thus one may want to keep messages that fail for the fist time or
with a 4XX reply rather than a timeout or connection failure in
the same queue as regular mail, while sending messages that time
out after being deferred into a fallback queue (remote or second
instance).
For this one would need to change the SMTP delivery agent to use a
a conditional fallback relay. This would be added to the delivery
request by the queue manager when processing messages from the
deferred queue, and used by the SMTP delivery agent only when the
"last" regular MX host "site-fails" (not 4XX reply).
The effect is to separate slow mail that times out multiple times,
whose delivery could clog the queue, from other mail that is in
the queue briefly, or whose delivery failures are in any case fast
enough to not be a big problem.

Post by Patrik Rak
I am all eager to hear what Victor has to say about this one,
though... He has a lot of experience with problematic sites using
small concurrency windows, from what I remember...

I don't think that additional transports in the same instance are
a good idea here. Too much complexity, and still a high risk of
a full active queue. With a second downstream instance that holds
only the slow mail, one can tune concurrency up, and tune any queue
monitoring more appropriately to the content in hand. One can also
adjust queue lifetimes more sensibly, ...
- No changes in trivial-rewrite.
- No additional transport personalities.
- One additional parameter to define a queue-manager signalled
fallback relay, included with delivery requests for messages
that come from the deferred queue.
- This fallback relay is ignored by default by all delivery agents, and
is optionally available in the smtp(8) delivery agent, which needs
a second non-default parameter to enable its use.
- The second parameter would be set by administrators of affected
sites in the "smtp" transport, and likely not set in the "relay"
transport.
- Sludge (connection timeout or failure possibly combined with a minimum
message age) goes to a remote or second instance queue.
The main difficulty is that this meshes somewhat poorly with
"defer_transports", since some deferred mail may be "innocent" and
could be sent to the slow queue when the transport is no longer
deferred if the first delivery fails, but this edge case is likely
not significant. Similar considerations for mail released from
the "hold" to the "deferred" queues.
We could extend the queue-file format, to define a new record type
which is a variant of 'R' (recipient), this would be a recipient
that failed slowly on the last delivery, and should become sludge
on the next failed delivery. It behaves just like 'R', except
in the smtp(8) delivery agent which sends it to the sludge fallback.
That way, the queue-manager is even simpler, just treat 'R' and
the new record identically, and let smtp(8) do all the work, but
now defer_append() would need to be able to update the recipient
record type just like sent().

Hmm, all good points, Victor. But this seems to be addressing the
situation in context of high traffic sites. In such environments, second
server or instance is definitely a necessity, and what you propose would
help with the active-queue-size bottleneck. Such servers are dedicated
to their cause and have enough power to run thousands of delivery agents
in parallel. Some alternate transport would make little difference in
such case.

However, what I was proposing was targeting sites with much less traffic
than what you are used to. In those contexts the problem is not critical
enough to make me want to setup another instance just to deal with that
- OTOH, it is enough to cause noticeable delays in mail deliveries. I
thought it might be good idea if postfix was able to deal with that
almost out-of-the-box, without resolving to multiple instances. For you,
setting up multiple instances is second nature, but I am afraid that for
someone who just wants to get their web app mail delivered in timely
manner it would be too much hassle and they would just frown upon
postfix's inability to do that instead...

Anyway, I am not pushing it at all. I just wanted to mention it in case
someone else would find that feature appealing.

Patrik

Patrik Rak

2013-05-11 17:22:48 UTC

Post by Patrik Rak
The use case which I am interested in is basically some service
sending registration confirmation messages to its users, where some
users decide to fill in bogus addresses which result in temporary
errors until the message expires and bounces. Such messages tend to
stock pile in the deferred queue and can quite dominate the active
queue and adversely affect the deliveries to proper recipients.
Especially when these bogus recipients are not deferred immediately,
but only after considerably long timeout.

...

Post by Viktor Dukhovni
One would need to size the active queue limits for some multiple
of the expected 5-days of bad addresses so that such mail rarely
fills the active queue. Since Postfix 1.0 was released in 2001,
the price of RAM has fallen considerably. It is now quite
cost-effective to build servers with 1-4 GB of RAM or more. So an
MTA with this problem should have a large active queue size to avoid
running out of queue slots.

BTW, perhaps I wasn't quite clear above when I said those messages
dominate the queue. I meant it like their amount in the active queue is
significant and prevailing, but not that they are really filling the
active queue entirely. Rather than addressing the full-active-queue
bottleneck (which you have most likely responded to), I was addressing
the all-delivery-agents-busy bottleneck. Especially in more limited
environments where increasing the delivery agent limit tenfold or more
is not a viable option.

Patrik

Viktor Dukhovni

2013-05-11 18:11:03 UTC

Post by Patrik Rak
BTW, perhaps I wasn't quite clear above when I said those messages
dominate the queue. I meant it like their amount in the active queue
is significant and prevailing, but not that they are really filling
the active queue entirely. Rather than addressing the
full-active-queue bottleneck (which you have most likely responded
to), I was addressing the all-delivery-agents-busy bottleneck.
Especially in more limited environments where increasing the
delivery agent limit tenfold or more is not a viable option.

So you want to desludge the outbound delivery agents, by dedicating
a subset of the process slots to fresh mail. This does not need
a second transport. Rather, just do that within a single transport!

The queue manager knows how many active deliveries it has scheduled
for each transport. Suppose we give the queue manager a per-transport
upper bound on the number of concurrent deliveries of previously
deferred messages. Then the remaining process slots are for fresh
mail only, and fresh deliveries cannot be starved.

This largely solves the problem, and is much simpler to configure:

# Out of a total of $default_process_limit (100), leaving 20
# for fresh mail. Adjust appropriately when master.cf or
# default_process_limit are changed.
#
smtp_deferred_concurrency_limit = 80

If the queue-manager were changed to read master.cf and thus know
delivery agent process limits, then the above could be a percentage,
or a fixed number of slots could be reserved for new mail.

This is much simpler than configuring additional transports.

--
Viktor.

Patrik Rak

2013-05-11 22:13:22 UTC

I have considered this solution as well.

It sounds simple enough, but once you start thinking about how to implement that, it's not as easy. The per transport scheduling is already fairly complex as it is, with concurrency window and blocker jobs and what not - introducing yet another threshold into the mix doesn't seem to simplify that. I do not say it's impossible - but it seems to be far from trivial to reconsider all the corner cases for me to be willing to undertake that. But maybe it's only because I'm too familiar with the qmgr internals which make it seem more difficult than it really is?

The separate transport OTOH provides clear separation. More importantly, though, it allows you to configure many things like timeouts etc. differently for each first time and retried delivery - something which can't be trivially achieved with single transport.

What would be really different when using single transport as opposed to two transports? Wietse pointed out the concurrency limits - anything else comes to mind? Is there something which the separation of the two transports would make worse? I am trying but I can't think of anything right now. I'll keep trying...

Another option to consider is to keep the transports separate for most operations but bind them together somehow so some aspects could be shared when such behavior would be desirable. But I am not sure that it is really necessary as of now - and it wouldn't make things simpler either.

Anyway, thank you both for your comments so far. Much appreciated.

Patrik

So you want to desludge the outbound delivery agents, by dedicating
a subset of the process slots to fresh mail. This does not need
a second transport. Rather, just do that within a single transport!
The queue manager knows how many active deliveries it has scheduled
for each transport. Suppose we give the queue manager a per-transport
upper bound on the number of concurrent deliveries of previously
deferred messages. Then the remaining process slots are for fresh
mail only, and fresh deliveries cannot be starved.
# Out of a total of $default_process_limit (100), leaving 20
# for fresh mail. Adjust appropriately when master.cf or
# default_process_limit are changed.
#
smtp_deferred_concurrency_limit = 80
If the queue-manager were changed to read master.cf and thus know
delivery agent process limits, then the above could be a percentage,
or a fixed number of slots could be reserved for new mail.
This is much simpler than configuring additional transports.
--
Viktor.

Wietse Venema

2013-05-11 22:33:22 UTC

Post by Viktor Dukhovni
# Out of a total of $default_process_limit (100), leaving 20
# for fresh mail. Adjust appropriately when master.cf or
# default_process_limit are changed.
#
smtp_deferred_concurrency_limit = 80

Even simpler: stop reading the deferred queue when more than N% of
the maximal number of recipient slots is from deferred mail.

Wietse

Viktor Dukhovni

2013-05-11 22:49:09 UTC

Even simpler: stop reading the deferred queue when more than N% of
the maximal number of recipient slots is from deferred mail.

This does not address Patrick's stated goal of avoiding process
saturation in the smtp transport by slow mail to bogus destinations.
(Similar to my 2001 analysis that motivated "relay" for inbound mail
and taught me the importance of recipient validation).

Rather, it adresses active limit exhaustion. The idea is perhaps
a good one anyway. Reserve some fraction of the active queue limits
for new mail, so that when enough deferred mail is in core, only new
mail is processed along with the already deferred mail.

A separate mechanism is still needed to avoid using all ~100 smtp
transport delivery processes for deferred mail. This means that
Patrick would need to think about whether the existing algorithm
can be extended to take limits on process allocation to deferred
mail into account.

I sympathise with the concern about the internal cost, but if the
solution adds substantial user-visible complexity I contend that
it is pointless, and the users who need this (the sites that accept
subscriptions via HTTP, ...) can just create a multi-instance
config, it is simple enough to do.

--
Viktor.

Wietse Venema

2013-05-11 23:15:39 UTC

Even simpler: stop reading the deferred queue when more than N% of
the maximal number of recipient slots is from deferred mail.

This is the "less input" approach.

Post by Viktor Dukhovni
This does not address Patrick's stated goal of avoiding process
saturation in the smtp transport by slow mail to bogus destinations.

Indeed, by far the simplest solution for that is to configure more
SMTP client processes. That is the "more output" approach.

If you do neither "less input" nor "more output" then Postfix has
to jump tricky hoops to enforce policies.

Wietse

Patrik Rak

2013-05-12 09:22:22 UTC

Post by Wietse Venema
Even simpler: stop reading the deferred queue when more than N% of
the maximal number of recipient slots is from deferred mail.

I too agree that this one would be really nice to have.

Postfix already prefers new mail when the active queue is full (which is
usually already too late, though), so it shouldn't be too difficult to
change either.

Post by Viktor Dukhovni
A separate mechanism is still needed to avoid using all ~100 smtp
transport delivery processes for deferred mail. This means that
Patrick would need to think about whether the existing algorithm
can be extended to take limits on process allocation to deferred
mail into account.

I have really tried, but unless I separated the two internally
considerably, I always winded up with the deferred recipients somehow
affecting the normal recipients. There is so many memory limits to deal
with, and once you let the deferred recipients in-core, it's hard to get
rid of them. The "less input" is a solution here, but I am afraid it
might affect the "real" deferred mail adversely to be generally
recommended...

The fact that qmgr doesn't know how many delivery agents for each
transport are there doesn't help either. It only knows the
var_proc_limit, which is not good enough for this. I recall we have had
a discussion with Wietse long time about this, and IIRC we decided that
it is better if qmgr doesn't depend on that value at that time...

Post by Viktor Dukhovni
I sympathise with the concern about the internal cost, but if the
solution adds substantial user-visible complexity I contend that
it is pointless, and the users who need this (the sites that accept
subscriptions via HTTP, ...) can just create a multi-instance
config, it is simple enough to do.

Hmm, if the visible configuration is what bothers you, it would be
equally trivial to implement it so qmgr splits the transport only
internally, and to the outside world it looks like if there was only one
transport. But I considered this a worse solution as it would do
something behind the scenes without allowing to configure it properly...

Patrik

Viktor Dukhovni

2013-05-12 16:46:18 UTC

Post by Patrik Rak
The fact that qmgr doesn't know how many delivery agents for each
transport are there doesn't help either. It only knows the
var_proc_limit, which is not good enough for this. I recall we have
had a discussion with Wietse long time about this, and IIRC we
decided that it is better if qmgr doesn't depend on that value at
that time...

Yes, of course, I covered this in my earlier post, it would need to
be told an upper bound on the number of processes for deferred entries,
leaving the rest for new entries.

smtp_deferred_concurrency_limit = 0 | limit

Hmm, if the visible configuration is what bothers you, it would be
equally trivial to implement it so qmgr splits the transport only
internally, and to the outside world it looks like if there was only
one transport. But I considered this a worse solution as it would do
something behind the scenes without allowing to configure it
properly...

The "configuring it properly" part raises the complexity cost to a
level where I would suggest that the tiny fraction of sites taking
a high volume of new recipients via HTTP subscription forms can
implement a fallback instance. The explicit parallel transports
are not much simpler. A bulk mail MTA probably needs a fallback
instance anyway.

Post by Wietse Venema
Even simpler: stop reading the deferred queue when more than N% of
the maximal number of recipient slots is from deferred mail.

I too agree that this one would be really nice to have.

We need to be a bit careful, starving the deferred queue can lead
to an ever growing deferred queue, with more messages coming and
getting deferred and never retried. If we are to impose a separate
deferred queue ceiling while continuing to take in new mail, we'll
need a much more stiff coupling between the output rate and the
input rate to avoid congested MTAs becoming bottomless pits for
ever more mail.

The current inflow_delay mechanism does not push back hard enough.
When the inflow_delay timer is exhausted, cleanup goes ahead and
accepts the message. We could consider having cleanup tempfail
when deferred mail hits the ceiling in the active queue.

- Suspend deferred queue scans when we hit a high water mark
on deferred mail in the active queue.

- Resume deferred queue scans when we hit a low water mark on
deferred mail in the active queue.

- On queue manager startup generate a set of default process
limit tokens.

- Generate one token per message moved from incoming into the
active queue, provided deferred queue scans are not suspended.

- Generate one token per message delivered or bounced (removed
rather than deferred) when deferred queue scans are suspended.

- Generate another set of default process limit tokens each
time the queue manager completes a full scan of the incoming
queue, provided deferred queue scans are not suspended.

- Cleanup (based on request flag from local, bounce, pickup vs.
smtpd/qmqpd) either ignores inflow_delay (not much point in
enforcing this with local sources the mail is already in the
queue) or tempfails after the inflow_delay timer expires.
With remote sources a full queue, as evidenced by lots of
deferred mail in the active queue, exerts stiff back-pressure
on the sending systems.

- We probably need a longer token wait delay if the coupling
is stiffer. This would be a new parameter than turns on
the new behaviour if set non-zero.

This is not yet a complete design, and requires more thought. We
need to better understand how this behaves when the queue is not
congested and a burst of mail arrives from some source. We also
need to understand how it behaves when the deferred queue is large
and the input rate is stiffly coupled to the output rate.

Unless the active queue is completely full, we're not coupled to
the output rate, rather we're coupled to the queue-manager's ability
to move mail from incoming into active, with excess tokens acting
a buffer that is refreshed on each complete incoming queue scan.
If this buffer is not too small (should it be a multiple of the
deferred process limit?) we should be able to accomodate bursts of
mail when not congested without tempfailing any of them, but with
some increase in latency to allow the queue manager to keep up.

What changes is the behaviour when we already have lots of deferred
mail. No new tokens are generated even when incoming queue scans
are completed, and the MTA only accepts as many messages as are
delivered until deferred messages in the active queue fall below
the low water mark.

The gap between the high and low water marks needs to be large
enough to not be significantly impacted by the minimum backoff time
quantization of deferred queue scans.

Yes the best internal implementation is to internally separate the
transports for which one defines a process ceiling into two twins,
and arrived at the same conclusion yesterday, before reading your
post, so we're in agreement there.

My point is that this is by far the simpler interface for the user.

Also with both internal logical transports talking to a single pool
of delivery agents, in the absense of deferred mail, new mail can
use the full set of delivery agent processes.

--
Viktor.

Patrik Rak

2013-05-13 17:27:36 UTC

Post by Viktor Dukhovni
Yes, of course, I covered this in my earlier post, it would need to
be told an upper bound on the number of processes for deferred entries,
leaving the rest for new entries.
smtp_deferred_concurrency_limit = 0 | limit

OK, I am perfectly fine with this. In the context of the later mails,
I would perhaps call it <transport>_slow_transport_limit.

Post by Wietse Venema
Even simpler: stop reading the deferred queue when more than N% of
the maximal number of recipient slots is from deferred mail.

Interesting analysis. Few points, though:

- While I am not against revising the token based flow control, I would
just like to point out that adding explicit limit which Wietse suggest
doesn't change things much from how it works now (in a bad way, I mean).
Let me elaborate a bit:

Currently, when the the active queue becomes full, postfix favors new
mail (otherwise it round robins the queues). Now if the deferred mail
flows away about as fast as new mail, there's no big issue. However, if
the deferred mail delivery takes much longer than new mail delivery, the
deferred queue starts dominate the active queue up to the point when it
fills it entirely, at which point new mail delivery suffers badly,
incoming queue start to fill in and deferred mail may start to pile up
as well. I am just recapping things here, I am pretty sure you know this.

Now imagine you increase the active queue size by say 20% message slots,
and allow only new mail to use those. The situation wrt/ deferred queue
doesn't change at all. But you allow more new mail to flow by, reducing
the increase of the incoming queue or perhaps eliminating it entirely.
Overall, the situation improves.

Now if you instead reserve 20% of the active queue size for new mail
only, it's exactly the same as the example above, just scaled down by a
constant. And as the message default queue size is just an quite
arbitrarily chosen constant, I wouldn't be afraid of saying that for
most users it wouldn't make any difference. And those who would perhaps
need to hand tune this would likely want to hand tune it anyway.

- I also don't think we need any hysteresis in this case. When the
number of deferred mail in the active queue hits certain threshold,
simply scan incoming mail queue only, otherwise round robin the queues.
This quite naturally makes sure that whenever the number of deferred
mail drops below the threshold again, we use the deferred queue either
next time or the time after that, so we don't starve either queue.

Also note that this solution works fine regardless of if deferred mail
is actually slow or not, and likewise for the new mail. Each group
should automatically adjust the ratio of used slots over time according
to the ratio of the corresponding delivery speeds, with the deferred
queue being artificially capped slightly lower than the entire active queue.

- The whole point is to prevent deferred mail from blocking mail
deliveries. Therefore, we definitely do not want to stop accepting new
mail when the deferred mail in active queue reaches some level, as you
suggest. That would be just self-imposing yet another bottle neck. The
idea is that lot of the new mail can flow by regardless of how much of
the active queue is used by deferred mail.

- If you are afraid of the deferred queue piling up indefinitely, that's
not going to happen. As I just described above, it will get its share of
the active queue. The token flow mechanism shall take care of the rest.
And if it keeps piling anyway, the free space check will eventually kick
in and stop accepting incoming mail. :)

Patrik

Wietse Venema

2013-05-12 16:47:39 UTC

As Postfix maintainer, my interest is to provide a system that meets
a wide range of needs. At the same time the system also has to be
implementable and maintainable. This means that some functionality
will not be implemented, no matter how desirable it might be. The
goal is to find a set of robust mechanisms that provides the most
bang for the buck. Not: to find a set of mechanisms that is perfect.
Perfection can be expensive and fragile.

In the context of this thread, the idea is that new-mail delivery
requests complete in less time on average than deferred-mail delivery
requests (for example because some deferred-mail requests time out),
and that therefore new-mail delivery performance can be improved
by separating new mail from deferred mail, and/or by somehow
prioritizing new mail over deferred mail.

Below is a list (not complete) of causes for mail delays. I also
list non-invasive changes that extend existing functionality
incrementally, and that are unlikely to require lengthy verification
or complicate future developments.

I'm not listing solutions that change nqmgr scheduling within a
per-destination queue. Cool as it might be to also group jobs by
criteria such as sender or incoming/deferred queue, support for
multi-criteria job grouping would have a high implementation and
maintenance cost.

And I also do not list solutions that depend on the cause of delays:
slow or quick handshake failure (for example SMTP site-fail) versus
slow or quick failure after handshake (for example SMTP message-fail
or SMTP recipient-fail). This would require invasive change: the
current infrastructure distinguishes only between "handshake failure"
(i.e. dead site) and "failure after handshake" but does not account
for the amount of time (though the implementation and maintenance
cost would be less than multi-criteria job grouping in the nqmgr
scheduler).

No solution should be rejected because it is imperfect (i.e. there
exists a high-cost change that covers some part of the problem
space, or that covers it better than some other change). Keep in
mind the goal, to find a set of robust mechanisms that provides the
most bang for the buck. Not: to find a set of mechanisms that is
perfect.

Bottlenecks:

- Receiver back-pressure. Make bilateral arrangements.

- Network capacity. Get a bigger pipe or move closer.

- Delivery agents. Increase output and/or process input selectively.
Specifically, configure more delivery agents, and/or separate mail
streams so that one aggregate stream (deferred mail) can't starve
other aggregate streams.

For example, when delivery agents are saturated with deferred
mail, introduce "slow path" / "fast path" delivery. Add a
trivial-rewrite "slow path" personality for deferred mail, and
use the existing ("fast path") personality for new mail. This
is a low-cost change that allows deferred mail to use different
transports (with different concurrencies, timeouts, etc.).

This idea generalizes to other aggregates, such as messages
from the same sender, from the same client, messages larger
than some threshold, and so on. For that we could let the
administrator decide a many->2 mapping from sender, client, or
size to slow/fast path. Initially, the mapping could be based
only on incoming versus deferred queue.

Based on this mapping the queue manager can propagate back
pressure to the incoming/deferred queue-scanning code (see
next).

- Active queue. Process input selectively.

For example, when the active queue becomes congested with "slow
path" mail and the above measures are exhausted, prioritize the
queue scan so that "slow path" mail cannot starve other mail.

- Queue manager. Process address resolution requests in parallel
(event-driven resolver client). This could increase the intrisic
limit (2500 msg/sec in ~2007) further.

- Disk I/O. Use a solid-state disk, or use a RAID with lots of
battery-backed cache.

Wietse

Patrik Rak

2013-05-12 08:12:24 UTC

Even simpler: stop reading the deferred queue when more than N% of
the maximal number of recipient slots is from deferred mail.

Hmm, that could work. It would be definitely useful knob for dealing
with the active-queue-full bottleneck. And if I wanted, I could set it
so low that it would even solve the delivery-agents bottleneck. Sure, it
might perhaps hamper the throughput a bit for "real" retried recipients
compared to standalone transport, but would solve my use case as well...

Patrik

Patrik Rak

2013-05-12 17:04:08 UTC

Even simpler: stop reading the deferred queue when more than N% of
the maximal number of recipient slots is from deferred mail.

Hmm, I have spent quite more time thinking about all this while
gardening today. Let me put it like this:

I am very well aware of the performance postfix is able to achieve, so I
find it quite embarrassing how surprisingly little bogus recipients can
stall legitimate mail. Whatever the solution, postfix should know better
than that out of the box.

Can we agree on that?

Now we know of two problems which cause this:

A) Congested active queue, which can be currently solved either by
increasing its size considerably, or by passing the deferred mail to
separate server or instance,

B) Congested delivery agents, which happens much earlier and which can
be somewhat remedied by increasing the transport process limit
sufficiently, or again by using separate server or instance.

In both cases we are talking about congestion by mail coming from the
deferred queue here. While it can theoretically happen with new mail
from the incoming queue as well, we have already spent considerable
effort in the past to make sure that doesn't happen, so I dare to say
that that is not an issue anymore.

The proposals to solving these two problems we have heard so far are:

0) Using separate server or instance.

1) Limiting amount of deferred mail moved into the active queue.

2) Allowing delivery of mail from deferred queue via dedicated transport.

3) Dedicating some amount of delivery agents to new mail only.

Now, how do these fare compared to each other:

- The solution 0 solves both A and B problems, and is something I would
definitely recommend for any enterprise setup anyway. Unfortunately, it
doesn't work out of the box and requires nontrivial amount of work to
set up, so it is not a proper solution for the problem I described as such.

- The solution 1 solves problem A quite elegantly, and I believe we are
all in agreement that it would be worth implementing. It can be IMO
easily made the default setting as well.

Technically speaking, it could be used to solve problem B as well, by
making the deferred message limit smaller than the delivery agent
process limit, but imposing such a low limit would quite certainly
affect throughput of deferred mail, so it is definitely not something we
would want to make the default, so it is out of question as the out of
the box solution for problem B as well.

- The solution 2 doesn't address problem A at all. It merely solves
problem B. While I was quite glad to see that it would separate the
deferred mail from new mail entirely, it's true that its perhaps too
pervasive to be made out of the box default either.

- The solution 3 is similar to solution 2 in that it only solves problem
B. The advantage here is that it doesn't introduce new transport or
anything, which makes me feel it can be used quite safely for a default
setting as well.

So, from a summary like this, it seems to me that we would want both
solution 1 and 3 as that would solve both problems A and B out of the box.

Therefore, I suggest that I'll do my best to review the current
algorithm and see how solution 3 could be implemented. Every time I
tried before I concluded that separating the two as much as possible
would be best for the sake of simplicity, but today I was trying to see
the whole problem as yet another limit similar to concurrency window
applied to jobs originating from deferred messages only. And as qmgr has
quite sophisticated support for dealing with blocker jobs and getting
them out of the way already, perhaps it's not that difficult to try to
approach it this way in the end. I'll see...

Any objections?

Patrik

P.S. As I finish this, your most recent responses to my earlier mails
just hit my mailbox. Unfortunately I will have to read them a bit later,
so my apologies for not taking them into account yet in this mail.

Wietse Venema

2013-05-12 18:52:05 UTC

Post by Patrik Rak
Therefore, I suggest that I'll do my best to review the current
algorithm and see how solution 3 could be implemented. Every time I
tried before I concluded that separating the two as much as possible
would be best for the sake of simplicity, but today I was trying to see
the whole problem as yet another limit similar to concurrency window
applied to jobs originating from deferred messages only. And as qmgr has
quite sophisticated support for dealing with blocker jobs and getting
them out of the way already, perhaps it's not that difficult to try to
approach it this way in the end. I'll see...
Any objections?

Please consider not hard-coding your two-class solution to new/deferred
mail only, but allowing one level of indirection so that we can
insert a many-to-2 mapping from message property (now: from queue
to delivery class; later: sender, client or size to delivery class).

The idea is that some part of Postfix may clog up due to mail with
properties other than having failed the first delivery attempt.

Wietse

Post by Patrik Rak
Patrik
P.S. As I finish this, your most recent responses to my earlier mails
just hit my mailbox. Unfortunately I will have to read them a bit later,
so my apologies for not taking them into account yet in this mail.

Viktor Dukhovni

2013-05-13 02:08:36 UTC

Post by Wietse Venema
Please consider not hard-coding your two-class solution to new/deferred
mail only, but allowing one level of indirection so that we can
insert a many-to-2 mapping from message property (now: from queue
to delivery class; later: sender, client or size to delivery class).
The idea is that some part of Postfix may clog up due to mail with
properties other than having failed the first delivery attempt.

Since we're addressing congestion caused by slow mail, perhaps
we're going about it the wrong way. The heuristic that deferred
(or selected via some other a-priori characteristic) mail is likely
slow is very crude approximation, and may be entirely wrong.

Instead, I think we can apply a specific remedy for the actual
congestion as it happens.

- Enhance the master status protocol to add a new state in addition
to busy and idle:

* Blocked. The delivery agent is busy, but blocked waiting to
complete connection setup (the "c" time in delays=...).

The SMTP delivery agent would enter this state at the beginning
of the delivery attempt, and exit it before calling smtp_xfer(),
when another session will be attempted to deliver tempfailed
recipients, the state is re-entered at the completion of
smtp_xfer().

- Add two companion parameters:

# Typically higher than the process limit by some multiple.
#
default_blocked_process_limit = 500

# Processes blocked for more than this time yield their slot
# to new processes, dynamically inflating the process limit.
#
blocked_process_time = 5s

When a process stays blocked for more than blocked_process_time,
master(8) decrements the service busy count and increments
the service blocked count, provided the maximum blocked count
has not been reached. This allows master(8) to create more
processes to handle mail that is not slow.

When a delivery agent that has been blocked for more than
blocked_process_time completes a delivery, it does not
go back to the accept loop. Rather it exits. The process
start-up cost is ammortized by the long delay.

- The master.cf maxproc column is optionally extended to allow
setting both the process limit and the blocked process limit.

# service type private unpriv chroot wakeup maxproc command + args
smtp unix - - n - 200/900 smtp

The syntax is "m[/n]" where m is the process limit or "-" for default,
and "n" is the blocked process limit or is not specified for default.

This directly addresses process starvation via slow processes, and
does not require any queue manager changes (the queue manager is the
most expensive to support with complex features).

--
Viktor.

Wietse Venema

2013-05-13 02:12:06 UTC

Post by Viktor Dukhovni
(the queue manager is the
most expensive to support with complex features).

That's only because I haven't let anyone else hack around in the
master daemon. Changing this code is incredibly expensive.

Wietse

Viktor Dukhovni

2013-05-13 03:20:36 UTC

Post by Viktor Dukhovni
(the queue manager is the
most expensive to support with complex features).

That's only because I haven't let anyone else hack around in the
master daemon. Changing this code is incredibly expensive.

OK, master(8) is not trivial, and yet the busy/idle logic is
relatively simple. All we need is a new per-service counter (the
blocked count), and a timer event turned on by children entering
the "blocked" state and turned off by children leaving the blocked
state (including process termination).

No change is free, but I would like to suggest that your response
(if it is a negative gut reaction to my proposal, rather than a
simple observation) is a bit hasty.

The reasonable response to latency spikes is creating concurrency
spikes. This limits the maximum latency for the busy state (normal
mail) to ~5s (which is commensurate with normal delivery latency
to remote sites that do various DNS lookups, ...), which gives a
throughput throughput floor of process_limit/5 rather than
process_limit/30 (or possibly even higher denominators with additional
delays in DNS lookups and delays between connect and "MAIL").

If my suggestion is accepted we need to include the DNS resolution
delay in the "blocked" time. The DNS timeouts are controlled by
the DNS library and can be long on some systems (dominating connection
timeouts).

Please consider this option, I think it addresses the issue head-on.
It also avoids the creation destination concurrency spikes, which
may be frowned upon by remote destinations. With multiple transports
the concurrencies add up, with my proposal delivery goes on for
unblocked destinations but blocked destinations don't get increased
per-destination concurrency, rather the added concurrency is an
aggregate across many destinations.

--
Viktor.

Wietse Venema

2013-05-13 10:55:12 UTC

Post by Viktor Dukhovni
The reasonable response to latency spikes is creating concurrency
spikes.

By design, Postfix MUST be able to run in a fixed resource budget.
Your on-demand concurrency spikes break this principle and will
result in unexpected resource exhaustion.

If you want to run more processes increase the process limit. Then,
Postfix's resource budget can be validated with smtp-sink/source
like tools.

Recall that the automatic response to smtpd overload was NOT running
more smtpd processes. Instead, the solution was allowing them to
run with shorter timeouts. That approach respects the "fixed budget"
requirement.

Please consider a "stress" like equivalent for delivery agents.

Wietse

Viktor Dukhovni

2013-05-13 11:42:37 UTC

Post by Viktor Dukhovni
The reasonable response to latency spikes is creating concurrency
spikes.

By design, Postfix MUST be able to run in a fixed resource budget.
Your on-demand concurrency spikes break this principle and will
result in unexpected resource exhaustion.

No, there are two different process limits one for non-slow deliveries,
that protects exagainst excessive network concurrency, and another for
slow deliveries that protects against memory exhaustion. We can set the
blocked process limit to zero for backwards compatible process ceilings
if you like. That restores legacy behaviour.

Post by Wietse Venema
If you want to run more processes increase the process limit. Then,
Postfix's resource budget can be validated with smtp-sink/source
like tools.

I want to run more processes (up to a limit) when deliveries are
slow, than when they are not. Just increasing the process limit
while effective for the slow case, risks too much contention for
the fast case.

Post by Wietse Venema
Recall that the automatic response to smtpd overload was NOT running
more smtpd processes. Instead, the solution was allowing them to
run with shorter timeouts. That approach respects the "fixed budget"
requirement.
Please consider a "stress" like equivalent for delivery agents.

Is a shorter timeout wise here? With remote connections we don't
have any control over the connection rate, so we set lower idle
timeouts under stress.

With delivery agents, we have full control of the concurrency, so
there is no need to drop timeouts and we're never idle deliberately,
only blocked on DNS lookups and connection attempts, ...

So raising process concurrency to a higher ceiling is quite ok.
There is little risk of memory issues, each additional smtp(8)
consumes a trivial amount of RAM. The text pages are shared,
and the data footprint of smtp(8) is low.

Memory pressure based on smtp(8) delivery agent process count isnot
not an issue anymore. Each process uses O(100KB) of RAM. We can
run 1,000 delivery agents in 100MB of RAM, which is rather tiny
these days, and the default process limit is still 100, mostly
because 100 parallel outbound streams is reasonable from a network
viewpoint on bandwidth constrained links, which are still common.

If you feel that under stress smtp(8) connection timeouts should
be low, perhaps that's reasonable, but we have no control over DNS
timeouts and short HELO timeouts may be unwise, poorly connectied sites
may never get their mail from a loaded MTA.

In any case, we are at least talking about solving the right problem,
managing concurrency and latency of actual deliveries as they happen.

--
Viktor.

Wietse Venema

2013-05-13 13:10:13 UTC

Post by Viktor Dukhovni
The reasonable response to latency spikes is creating concurrency
spikes.

By design, Postfix MUST be able to run in a fixed resource budget.
Your on-demand concurrency spikes break this principle and will
result in unexpected resource exhaustion.

No, there are two different process limits one for non-slow deliveries,

No. It is a mistake to have an overload resource budget that is
different for different kinds of overload. This is fundamental to
the design of Postfix. Resources are not just memory but also file
handles, protocol blocks, process slots, and so on.

The overload resource budget must be easy to validate with tools
like smtp-source/sink and the like: just max out the number of
connections and verify that things don't come crashing down (I have
to admit that postscreen(8) complicates this a little; one may have
to disable its cache temporarily to perform full validation).

If the overload resource budget depends on overload context, then
it becomes non-trivial to validate, and one introduces an unexpected
failure mode into Postfix.

For example, I don't need an MTA that works under all conditions
as long as the network is up, but that comes crashing down when the
network hiccups for a minute, just because the MTA decides to run
6x the number of SMTP client processes (your example).

Instead of introducing a context-dependent overload resource budget,
I have a proposal that addresses the real problem (slow or
non-responding DNS and SMTP servers) and that requires no changes
to qmgr(8) or master(8), and minor changes to smtp(8).

If we want to address the real problem: slow or non-responding DNS
and SMTP servers, then we should not waste an entire SMTP client
process blocking on DNS lookup and TCP connection handshake in the
first place. Instead it is more efficient to interpose a prescreen(8)
process between the qmgr(8) and smtp(8) processes. This process
can look up DNS, create the initial TCP connection, peek() at the
remote server greeting, and keep the bogons away without wasting
any smtp(8) processes. Just like postscreen(8) can keep bogus SMTP
clients away without wasting smtpd(8) processes.

In the mean time, "stress"-like configuration can be a short-term
solution to temporarily tighten timing and other limits, and to
relax those limits when conditions return to normal (Patrik's
scenario did not concern persistent overload). And yes, it would
be a good idea to have an option to control the amount of time that
smtpd(8) and other Postfix components spend on DNS lookups.

Wietse

Viktor Dukhovni

2013-05-13 14:18:10 UTC

Post by Viktor Dukhovni
No, there are two different process limits one for non-slow deliveries,

This is right in principle, not so much in practice. Were Postfix
delivery agent concurrency tuned to the limit of local system
resources, indeed one should be careful about overload, but this
too is easy to test, just raise the process limit to the combined
ceiling before testing.

In practice the smtp(8) process limit is far below the system
resource limit, the reason I don't configure 10,000 delivery agents
is not lack of RAM or kernel resources. My $300 asus laptop has
4GB of RAM.

Typically it is unwise to run even 1,000 parallel deliveries,
because the network delays would be unfortunate. However, 1,000
parallel blocked delivery agents are not unreasonable, and I can
test at that load level if I am worried about resource limits.

Post by Wietse Venema
The overload resource budget must be easy to validate with tools
like smtp-source/sink and the like: just max out the number of
connections and verify that things don't come crashing down (I have
to admit that postscreen(8) complicates this a little; one may have
to disable its cache temporarily to perform full validation).

Or just by knowing that 1,000 processes is an easy fit.

Post by Wietse Venema
Instead of introducing a context-dependent overload resource budget,
I have a proposal that addresses the real problem (slow or
non-responding DNS and SMTP servers) and that requires no changes
to qmgr(8) or master(8), and minor changes to smtp(8).
If we want to address the real problem: slow or non-responding DNS
and SMTP servers, then we should not waste an entire SMTP client
process blocking on DNS lookup and TCP connection handshake in the
first place. Instead it is more efficient to interpose a prescreen(8)
process between the qmgr(8) and smtp(8) processes. This process
can look up DNS, create the initial TCP connection, peek() at the
remote server greeting, and keep the bogons away without wasting
any smtp(8) processes. Just like postscreen(8) can keep bogus SMTP
clients away without wasting smtpd(8) processes.

Sadly the smtp(8) delivery agent makes multiple connections, supports
fallback destinations, has SASL and TLS dependent connection cache
re-use barriers, ... The high latency can happen on a second
connection after a fast 4XX with the first MX host, ... A prescreen
would be very difficult to implement.

The kernel resources of prescreen would still need to be commensurate
(socket control blocks, ...) with the various smtp(8) processes I
proposed.

Stress dependent timers could be more realistic if we can get DNS
under control, may need a new client library (ldns or similar). I
am wary of aggressively low client timeouts, we could end up treading
water by timeout out over and over when waiting a bit longer would
get the mail through.

Finally, the original proposal of parallel transports doubles or more
the process concurrency (Patrick would probably tune the slow slow path
with a high process limit). The same objections apply even more strongly
there, since we may send fast mail down the slow path and stress the system
even more.

All I'm doing is allocating slow path processes on the fly, by
doing it when delivery is actually slow. Think of this as 2 master.cf
entries in one. You don't object to users adding master.cf entries,
so there's little reason to object to implicit ones.

--
Viktor.

Patrik Rak

2013-05-13 19:51:41 UTC

Hmm, this is definitely an interesting idea. Looks pretty cool. But
there are few problems as I see it:

First, it seems to work when the servers you talk to are slow, you can
talk to thousands of them at the same time easily. But what happens if
you initiate thousands of connections but then you find out that the
servers do not respond slowly, but are actually pretty fast? You still
have only hundred worth of smtp clients to deal with that. And the
message delivery can take considerable time. This spells trouble to me...

Second, it doesn't prevent the "blocked delivery agents" problem as
such. It merely scales the point when it happens - in this regard it
achieves the same thing as if the transport limit was increased, only
using much less resources. Even if such prescreen was able to handle few
thousand connections at a time, the deferred queue can easily grow much
larger than that.

Therefore I believe that trying to classify the mail as "hopefully fast"
and "perhaps slow" and capping the ratio of the resources we allocate to
the slow one is the best we can do, for both the active queue as well as
the delivery agent slots.

Patrik

Wietse Venema

2013-05-13 20:28:33 UTC

Post by Wietse Venema
If we want to address the real problem: slow or non-responding DNS
and SMTP servers, then we should not waste an entire SMTP client
process blocking on DNS lookup and TCP connection handshake in the
first place. Instead it is more efficient to interpose a prescreen(8)
process between the qmgr(8) and smtp(8) processes. This process
can look up DNS, create the initial TCP connection, peek() at the
remote server greeting, and keep the bogons away without wasting
any smtp(8) processes. Just like postscreen(8) can keep bogus SMTP
clients away without wasting smtpd(8) processes.

Hmm, this is definitely an interesting idea. Looks pretty cool. But
First, it seems to work when the servers you talk to are slow, you can
talk to thousands of them at the same time easily. But what happens if
you initiate thousands of connections but then you find out that the
servers do not respond slowly, but are actually pretty fast? You still

I am not that stupid. Just like postscreen(8) handles up a LIMITED
number of connections at any point in time, so would prescreen(8)
handle only a limited number of delivery requests at any point in
time, giving back pressure to qmgr(8).

The main benefit is that presecreen(8) parallizes the waiting
for DNS lookup, TCP handshake, and server greeting.

Wietse

Patrik Rak

2013-05-13 20:50:14 UTC

I am not that stupid. Just like postscreen(8) handles up a LIMITED
number of connections at any point in time, so would prescreen(8)
handle only a limited number of delivery requests at any point in
time, giving back pressure to qmgr(8).
The main benefit is that presecreen(8) parallizes the waiting
for DNS lookup, TCP handshake, and server greeting.

That's exactly what I had in mind, too - the thousands I mentioned was the (upper) limit of what prescreen could IMO handle. It may be lower, but there will definitely some limited number.

My assumption was that it will be bigger than the amount of delivery agents available (if it was about the same, it would make little point). But then I can't see how you can prevent opening too many connections against fast servers, unless you risk opening too little connections against the slow servers at the same time…

Patrik

Wietse Venema

2013-05-13 21:18:05 UTC

Post by Wietse Venema
I am not that stupid. Just like postscreen(8) handles up a LIMITED
number of connections at any point in time, so would prescreen(8)
handle only a limited number of delivery requests at any point in
time, giving back pressure to qmgr(8).
The main benefit is that presecreen(8) parallizes the waiting
for DNS lookup, TCP handshake, and server greeting.

That's exactly what I had in mind, too - the thousands I mentioned
was the (upper) limit of what prescreen could IMO handle. It may
be lower, but there will definitely some limited number.
My assumption was that it will be bigger than the amount of delivery
agents available (if it was about the same, it would make little
point). But then I can't see how you can prevent opening too many
connections against fast servers, unless you risk opening too
little connections against the slow servers at the same time?

The qmgr(8) concurrency scheduler limits the concurrency per nexthop.
That does not change when prescreen is inserted between qmgr(8) and
smtp(8) processes.

For each nexthop:

number of qmgr-prescreen connections + number of qmgr-smtp connections
<= concurrency limit

Wietse

Viktor Dukhovni

2013-05-14 03:58:00 UTC

Post by Wietse Venema
The qmgr(8) concurrency scheduler limits the concurrency per nexthop.
That does not change when prescreen is inserted between qmgr(8) and
smtp(8) processes.
number of qmgr-prescreen connections + number of qmgr-smtp connections
<= concurrency limit

Patrick does raise a valid new concern about the prescreen design.
Suppose that all ~100 smtp(8) delivery agents are busy, and that
prescreen is willing to accept ~500 simultaneous qmgr(8) delivery
requests in the expectation that for many of these DNS lookups and
or initial connections will incur high latency.

Suppose that instead all 500 DNS lookups and initial connections
complete quickly, giving us 500 parallel connections to some set
of remote servers. In the mean time the 100 currently busy SMTP
deliveries are taking their time.

We now have a problem, since we we've lots of connections we can't
immediately start using. And by the time we've capacity to use
them the remote servers may well drop the idle connections.

So while doing pre-emptive DNS lookups is quite safe, doing
pre-emptive connections is more risky. A similar issue exists
in principle with postscreen, in that more connections might be
all accepted for "pass OLD" than backend smtpd(8) processes to
serve them, leaving clients "stranded" for some time.

The impedance mistmatch is less severe with postscreen since so
much mail is spam, and because clients are generally more willing
to wait out delays than servers.

So the prescreen design may not pan out. And my contention is that
in any case it is a bit pricey it terms of implementation cost to
benefit.

If we limit prescreen to initial DNS lookups, the cost to implement
gets much lower, and much of the initial latency is avoided for
dead sites with broken DNS, so that could be of some use, and we
don't tie up remote resources by prefetching DNS results. So a
DNS-only pre-screen could be added, still not sure it is worth it.
We'd need lots of data on how much of the latency of dead destinations
is DNS latency vs. connection timeout latency.

--
Viktor.

Wietse Venema

2013-05-14 12:47:50 UTC

Patrick does raise a valid new concern about the prescreen design.

Yes, I had some time to ponder over this in during the night.
Parallelization is easy enough for DNS which does not expire
immediately, but the number of TCP connections in progress must be
proportional to (but not necessarily equal to) the number of idle
delivery agents.

Suppose that we set aside a pool of 10 of the 100 delivery agents.
Then prescreen can safely have 10 connection requests in progress
(connection request here = DNS lookup, TCP handshake and receiving
the initial server response).

It can be a little smarter than that. It can group SMTP/TCP connection
requests by nexthop, effectively creating a connection request queue.

If a nexthop is quick, then its connection requests clear the
connection request queue quickly, so this queue could be given
favorable scheduling preference. If a nexthop is slow, then this
really wants a back-channel to the queue manager to provide selective
scheduler or queue-reader feedback of some kind.

This is a rough design.

Wietse

Patrik Rak

2013-05-14 14:13:49 UTC

Post by Wietse Venema
Yes, I had some time to ponder over this in during the night.
Parallelization is easy enough for DNS which does not expire
immediately, but the number of TCP connections in progress must be
proportional to (but not necessarily equal to) the number of idle
delivery agents.
Suppose that we set aside a pool of 10 of the 100 delivery agents.
Then prescreen can safely have 10 connection requests in progress
(connection request here = DNS lookup, TCP handshake and receiving
the initial server response).

Hmm, I am afraid that such a small window makes nearly no difference as
far as the delivery agent bottleneck is concerned.

Post by Wietse Venema
It can be a little smarter than that. It can group SMTP/TCP connection
requests by nexthop, effectively creating a connection request queue.
If a nexthop is quick, then its connection requests clear the
connection request queue quickly, so this queue could be given
favorable scheduling preference. If a nexthop is slow, then this
really wants a back-channel to the queue manager to provide selective
scheduler or queue-reader feedback of some kind.

Isn't that just duplicating the per-destination concurrency window /
dead site detection mechanism which is already in place?

Patrik

Wietse Venema

2013-05-14 14:37:07 UTC

Patrik Rak:
[ Charset ISO-8859-1 unsupported, converting... ]

Hmm, I am afraid that such a small window makes nearly no difference as
far as the delivery agent bottleneck is concerned.

It should be OK for fast connections.

Isn't that just duplicating the per-destination concurrency window /
dead site detection mechanism which is already in place?

Unlike qmgr, prescreen knows that a destination is slow, so it
can prioritize connection requests.

Wietse

Wietse

Viktor Dukhovni

2013-05-14 17:34:38 UTC

Post by Patrik Rak
Hmm, I am afraid that such a small window makes nearly no difference as
far as the delivery agent bottleneck is concerned.

It should be OK for fast connections.

Isn't that just duplicating the per-destination concurrency window /
dead site detection mechanism which is already in place?

Unlike qmgr, prescreen knows that a destination is slow, so it
can prioritize connection requests.

How does it know this? Remembering which destinations took a long
time to connect (typically timed out) in the past?

What is the concurrency for trying to make such connections? It
should be higher, but there is some risk that the heuristic data
is stale, and connections are made too fast. Which leads to a slow
mail building up in the active queue.

What is the resulting output rate for the slow path? Does prescreen
help this exceed the input rate from the deferred queue?

--
Viktor.

Wietse Venema

2013-05-14 17:47:26 UTC

Post by Patrik Rak
Hmm, I am afraid that such a small window makes nearly no difference as
far as the delivery agent bottleneck is concerned.

It should be OK for fast connections.

Isn't that just duplicating the per-destination concurrency window /
dead site detection mechanism which is already in place?

Unlike qmgr, prescreen knows that a destination is slow, so it
can prioritize connection requests.

How does it know this? Remembering which destinations took a long
time to connect (typically timed out) in the past?
What is the concurrency for trying to make such connections? It
should be higher, but there is some risk that the heuristic data
is stale, and connections are made too fast. Which leads to a slow
mail building up in the active queue.
What is the resulting output rate for the slow path? Does prescreen
help this exceed the input rate from the deferred queue?

Among the text in the complete message was mention of feedback from
prescreen based on per-hexthop DNS/TCP/SMTPgreet latency, back to
the (concurrency) scheduler and queue reader.

The scheduler would prevent slow mail from overtaking all delivery
agents, and the queue reader would prevent slow mail from overtaking
the active queue.

Basically, allow slow mail to use up most, but not all, deliveries.

Wietse

Patrik Rak

2013-05-13 15:25:44 UTC

Uh oh, lot's of mails to respond to. I'll start with this one and get to

Post by Viktor Dukhovni
The reasonable response to latency spikes is creating concurrency
spikes.

By design, Postfix MUST be able to run in a fixed resource budget.
Your on-demand concurrency spikes break this principle and will
result in unexpected resource exhaustion.

I'd second Wietse on this one.

If you throw in more resources for everyone, the bad guys are gonna
claim it sooner or later. You have to make sure you give it only to the
good guys, which is the same as giving less to the bad guys in the first
place. No need to throw in yet more additional resources on demand.

And that's also why it is important to classify ahead of time, as once
you give something away, it's hard to take it back.

Patrik

Viktor Dukhovni

2013-05-14 04:55:45 UTC

Post by Viktor Dukhovni
The reasonable response to latency spikes is creating concurrency
spikes.

By design, Postfix MUST be able to run in a fixed resource budget.
Your on-demand concurrency spikes break this principle and will
result in unexpected resource exhaustion.

I'd second Wietse on this one.

And yet you're missing the point.

Post by Patrik Rak
If you throw in more resources for everyone, the bad guys are gonna
claim it sooner or later. You have to make sure you give it only to
the good guys, which is the same as giving less to the bad guys in
the first place. No need to throw in yet more additional resources
on demand.

We don't know who the "good guys" are and who the "bad guys" are.

- A deferred message may simply be greylisted and may deserve
timely delivery on its 2nd or 3rd (if the second was a bit too
early) delivery attempt.

- A small burst of fresh messages may be a pile of poop destined
to dead domains, and may immediately clog the queue for 30-300
seconds.

Post by Patrik Rak
And that's also why it is important to classify ahead of time, as
once you give something away, it's hard to take it back.

There is no "giving away" to maintain throughput, high latency
tasks warrant higher concurrency, such concurrency is cheap since
the delivery agents spend most of their time just sitting there
waiting.

By *moving* the process count from the fast column to the slow
column in real-time (based on actual delivery latency not some
heuristic prediction), we free-up precious slots for fast deliveries,
which are fewer in number. Nothing I'm proposing creates less
opportunity for delivery of new mail, rather I'm proposing dynamic
(up to a limit) higher concurrency that soaks up a bounded amount
of high latency traffic (ideally all of it most of the time).

To better understand the factors that impract the design we need
to distinguish between burst pressure and steady-state pressure.

When a burst of bad new mail arrives, your proposal takes it through
the fast path which gets congested "once" (by each message anyway,
but if the burst is large enough, the effect can last quite some
time). If the mail is simply slow to deliver, but actually leaves
the queue, that's all. Otherwise the burst gets deferred, and now
gets the slow path, which does not further congest delivery of new
mail, but presumably makes multiple trips through the deferred queue,
causing congestion there each time, amplified if you allocate fewer
processes to the slow than the fast path (I would strongly discourage
that idea).

In any case the fast/slow path fails to completely deal with bursts.

So lets consider steady-state. Suppose bad mail trickles in as a
fraction "0 < b < 1" of the total new mail stream, at a rate that
does lead to enough congested fast path processes just from new
mail. What happens after that?

Well in steady-state, each initially deferred message (which we
for worst-case assume continues to tempfail until it expires) gets
retried N times, where N grows with the maximum queue lifetime and
shrinks with the maximal backoff time (details later). Therefore,
the rate at which bad messages enter the active queue from the
deferred queue is approximately N * b * new_mail_input_rate.

When is that a problem? When, N * b >> 1. Because now a small
trickle of bad new mail becomes a steady stream of retried bad
mail whose volume is "N * b" higher. So what can we do to
reduce the impact?

I am proposing raising concurrency for just the bad mail, without
subtracting concurrency for the good mail, thereby avoiding collateral
damage to innocent bystanders (greylisted mail for example). This
also deals with the initial burst (provided the higher concurrency
for slow mail is high enough to absorb the most common bursts and
low enough to not run out of RAM or kernel resources). This does
no harm! It can only help.

You're proposing a separate transport for previously deferred mail,
this can help but also hurt if the concurrency for the slow path
is lower than for the fast path, otherwise it is just a variant of
my proposal, in which we guess who's good and who's bad in advance,
and avoid spillover from the bad processes into the good when the
bad limit is reached. In both cases total output concurrency should
rise. Each performs better in some cases and worse in others.

The two are composable, we could have a dedicated transport for
previously deferred mail with a separate process limit for slow
vs. fast mail if we really wanted to get fancy. We could even
throw in Wietse's prescreen for DNS prefetching, making a further
dent in latency. All three would be a lot of work of course.

So what have we not looked at yet? We've not considered trying to
reduce "N * b", which amounts to reducing "N" since "b" is outside
our control to some degree (though if you can accept less junk,
that's by far the best place to solve the problem, e.g. validate
the destination domain interactively while the user is submitting
the form for example).

So what controls "N"? With exponential backoff we rapidly reach
the maximum backoff time in a small number of retries, especially
because the this backoff time is actually a lower bound in the
spacing between deliveries, actual deliveries are typically spaced
wider and so the time grows faster than a simpler power of two.
Therefore, to good approximation we can assume that the retry
count for steady-state bad mail is queue_lifetime/maximal_backoff.

Let's plug-in the defaults:

$ echo "1k 86400 5 * 4000 / p" | dc
108.0

That's ~100 retries in 5 days. This concentrates bad mail when
the bad is > 1% of the total. Suppose a site with unavoidable
garbage entering the queue has users that are happier to find out
that their mail did not get to its recipient sooner rather than
perhaps wating for a full 5 days adjusts the queue lifetime down
to 2 days (I did that at Morgan Stanley, where this worked well
for the user community, RFCs to the contrary notwithstanding).

Then we get:

$ echo "1k 86400 2 * 4000 / p" | dc
43.2

Now N drops to ~40, which could make the differennce between deferred
mail concentrating initial latency spikes to diluting them (at the
2.5% bad mail mark). What else can do? Clearly raise the maximal
backoff time. How does that help? Consider raising the maximal
backoff time from 4000s to 14400s (4 hours). Now we get:

$ echo "1k 86400 2 * 14400 / p" | dc
12.0

Now N is ~12, and we've won almost a factor of 10 from the default
settings. Unless the bad mail is ~8% of the total input there is
no concentration and we don't need to discriminate against deferred
mail.

Is it reasonable to push the max backoff time this high? I think
so, by the time we tried;

5m (default first retry)
10m
20m
40m
80m (20% higher than the curren ceiling of 4000s)

the message has been in the queue for 155 minutes (or 3.5 hours)
and has been tried 6 times. The next retry would normally be about
66 minutes later, but I'd delay it to 160 minutes, so such a message
would leave (if that is its fate however unlikely) after 6 hours instead
of 5. Is that sufficiently better? Otherwise, with the message already
6 hours late, do we have to try every hour or so? Or is every 4 hours
enough? I think it is.

So the simplest improvement we can make it just tune the backoff
and queue lifetime timers. If we then add process slots for blocked
messages (another factor of 5 in many cases) we are looking at raw
sewage (40% bad) entering the queue before the deferred queue is
any different from fresh mail.

Since we've managed 12 years with few complaints about this issue,
I think that the timer adjustment is the easiest first step. Users
can tune their timers today with no new code.

--
Viktor.

Wietse Venema

2013-05-14 12:24:16 UTC

Nothing I'm proposing creates less opportunity for delivery of new
mail, rather I'm proposing dynamic (up to a limit) higher concurrency
that soaks up a bounded amount of high latency traffic (ideally
all of it most of the time).

This is no better than having a static process limit at that larger
maximum. Your on-demand additional process slots cannot prevent
slow mail from using up all delivery agents.

To prevent slow mail from using up all delivery agents, one needs
limit the amount of slow mail in the active queue. Once a message
is in the active queue the queue manager has no choice. It has to
be delivered ASAP.

How do we limit the amount of slow mail in the active queue? That
requires prediction. We seem to agree that once mail has been
deferred a few times, it is likey to be deferred again. We have one
other predictor: the built-in dead-site list. That's it as far as
I know.

As for after-the-fact detection, it does not help if a process
informs the master dynamically that it is blocked. That is too
late to prevent slow mail from using up all delivery agents,
regardless of whether the process limit is dynamically increased
up to some maximum, or whether it is frozen at that same inflated
maximum.

[detailed analysis]

Thanks. This underscores that longer maximal_backoff_time can be
beneficial, by reducing the number of times that a delayed message
visits the active queue. This reflects a simple heuristic: once
mail has been deferred a few times, it is likey to be deferred
again.

Wietse

Viktor Dukhovni

2013-05-14 14:14:03 UTC

This is no better than having a static process limit at that larger
maximum. Your on-demand additional process slots cannot prevent
slow mail from using up all delivery agents.

The difference is that the static larger maximum does not prevent
a thundering hurd of fast deliveries using the high limit to thrash
the network link and process scheduler.

Post by Wietse Venema
To prevent slow mail from using up all delivery agents, one needs
limit the amount of slow mail in the active queue. Once a message
is in the active queue the queue manager has no choice. It has to
be delivered ASAP.

My goal was not preventing congestion under all conditions, this
is simply not possible. Once some heuristically identified mail
is substantially delayed, we've lost already, since the proposed
heuristics are rather crude.

I am proposing a means of having sustainably higher process limits,
without thrashing. The higher process limits substantially reduce
steady-state congestion frequency. As you said, we don't need
perfection. Simply higher limits are a bit problematic when the
slow path is in fact full of fast mail.

Post by Wietse Venema
How do we limit the amount of slow mail in the active queue?

I would prefer to process it at higher concurrency, to the extent
possible, maintaining reasonable throughput even for the plausibly
slow mail, unless our predictors become much more precise.

Post by Wietse Venema
That
requires prediction. We seem to agree that once mail has been
deferred a few times, it is likey to be deferred again. We have one
other predictor: the built-in dead-site list. That's it as far as
I know.

Provided the reason is an unreachable destination, and not a deferred
transport, or a certificate expiration, (any fast repeated deferral
via local policy, ...)

Post by Wietse Venema
As for after-the-fact detection, it does not help if a process
informs the master dynamically that it is blocked. That is too
late to prevent slow mail from using up all delivery agents,
regardless of whether the process limit is dynamically increased
up to some maximum, or whether it is frozen at that same inflated
maximum.

The above is a misreading of intent. It does help, it enables safe
support for higher concurrency levels, which modern hardware and
O/S combinations can easily handle.

Post by Wietse Venema
[detailed analysis]
Thanks. This underscores that longer maximal_backoff_time can be
beneficial, by reducing the number of times that a delayed message
visits the active queue. This reflects a simple heuristic: once
mail has been deferred a few times, it is likey to be deferred
again.

That, plus for many sites a not too aggressively reduced queue
lifetime. Often an email delayed for more than 1 or 2 days is
effectively too late, with a bounce the sender can resend to a
better address or try another means to reach the recipient. I
found 2 days to rather than 5 to be largely beneficial with no
complaints of lost mail because some site was down for ~3-4 days.

--
Viktor.

Patrik Rak

2013-05-14 14:18:03 UTC

This is no better than having a static process limit at that larger
maximum. Your on-demand additional process slots cannot prevent
slow mail from using up all delivery agents.
To prevent slow mail from using up all delivery agents, one needs
limit the amount of slow mail in the active queue. Once a message
is in the active queue the queue manager has no choice. It has to
be delivered ASAP.
How do we limit the amount of slow mail in the active queue? That
requires prediction. We seem to agree that once mail has been
deferred a few times, it is likey to be deferred again. We have one
other predictor: the built-in dead-site list. That's it as far as
I know.
As for after-the-fact detection, it does not help if a process
informs the master dynamically that it is blocked. That is too
late to prevent slow mail from using up all delivery agents,
regardless of whether the process limit is dynamically increased
up to some maximum, or whether it is frozen at that same inflated
maximum.

This is exactly as I see it, too. Couldn't have put it better.

Thanks,

Patrik

Patrik Rak

2013-05-14 14:00:27 UTC

We don't know who the "good guys" are and who the "bad guys" are.

Exactly. That's the problem. And all we can do is either try to detect
them, or merely guess.

You try to detect them, by initiating the request and measuring how long
it takes. No matter what exactly the test is, that means allocating some
resources for each such test (giving away). However, unless you are
willing to tear down the connection if it doesn't complete on time
(taking away), the resource is wasted until the test completes. And the
bad guys will eventually take it all. Therefore, you have merely moved
the bottleneck elsewhere - instead of competing for the delivery agents,
they compete for the bad/good guy test resources.
The fact that they are the same resourcein what you describe makes no
difference.

Therefore, I say the guess is the best what we have got. All we have to
focus on is to make sure it doesn't backfire if we mis-classify someone.
And what I proposed (solutions 1 and 3) and now repeat below does not,
AFAICT.

Post by Patrik Rak
And that's also why it is important to classify ahead of time, as
once you give something away, it's hard to take it back.

You say it's cheap. I believe it is in your environment, as well as it
is in mine and in many others. However, I believe that no one is willing
to pay for more RAM every month for their cloud servers, just so they
can dedicate it for dealing with mail that never gets delivered. That
seems like total waste of money. Especially if there is another solution
which works with fixed resources at hand.

The problem with your approach is that whatever the bad guys want, you
give it to them. That's meek. They want more, you give it to them. They
take it and ask for even more. You give it to them again. Until the
point when you can't give them more, or until the point when they are
finally happy with what they have. That's no better than setting the
transport limit this high in the first place. If there is no demand, it
remains unused, if there is, it will be consumed exactly the same way.

In my approach, I instead tell the bad guys: "You want more? No way!
This is all you'll get, now shut up and move along as well as you can."

Post by Viktor Dukhovni
You're proposing a separate transport for previously deferred mail,

Not anymore. What I suggest is the solution 1 and 3 from my previous
mail, which both merely restrain how much of the available resources we
are willing to give to the bad guys in the worst case. Note that there
is no slow/fast way either, which would somehow affect either group. We
just make sure that those who we think are bad never get everything.

Post by Viktor Dukhovni
Each group should automatically adjust the ratio of used slots over
time according to the ratio of the corresponding delivery speeds

That's the key idea why it doesn't matter if we classify someone
incorrectly. How does it work?

We classify every mail into one of the two groups. We can call them fast
and slow for simplicity, but in fact they are "hopefully fast" or
"presumably slow". For the start it can be equal to new mail and
deferred mail, but doesn't have to, as Wietse pointed out before.

Now let's explore what share of the available resources each group gets.
When both groups contain some mail and the mail is delivered equally
fast, they get 1:1 split. That seems fair. If the slow group becomes say
4 times slower on average, they will get 4:1 split over time. The same
holds if the fast group becomes 4 times slower, they will get 1:4 split.
So far, so good.

Now if one group becomes really slow, like 30 or 60 times slower than
the other one, it's effectively the case when it starts starving the
other one. If it is the slow group which becomes this slow, it gets
60:1 split, which with ~100 delivery agents available is obviously not
enough to get new mail delivered fast enough. If we were willing to
increase the transport limit considerably, the 1/61 will eventually
become enough delivery agents available for fast mail delivery. However,
what I say is that it's enough if we simply do not allow the ratio go
this high. We can fairly easily limit the amount of resources we give to
the bad guys to 80% or 90%, allowing them to get no more than 4:1 or 9:1
split. That can leave quite enough for the fast group while not wasting
too much on the bad group. Seems like good trade, especially when we
presume that most of the bad mail won't get delivered anyway (if it
were, it wouldn't likely be this slow and demand so much resources in
the first place).

Finally, what happens if the fast group becomes terribly slow instead,
and the slow group not? I'd conclude that that doesn't have to bother
us. It makes no sense to try to take resources away from new mail which
suddenly became slow, just so we can try more new mail which can be just
as slow. And taking resources away from slow new mail just so we can
retry some deferred mail seems equally pointless. So I would say it's
perfectly fine to wait until the situation gets back to normal, shall
this ever happen.

Now, does this make it any clearer what I have in mind?

Patrik

Viktor Dukhovni

2013-05-14 14:50:11 UTC

Post by Patrik Rak
You try to detect them, by initiating the request and measuring how
long it takes. No matter what exactly the test is, that means
allocating some resources for each such test (giving away). However,
unless you are willing to tear down the connection if it doesn't
complete on time (taking away), the resource is wasted until the
test completes. And the bad guys will eventually take it all.
Therefore, you have merely moved the bottleneck elsewhere - instead
of competing for the delivery agents, they compete for the bad/good
guy test resources.
The fact that they are the same resourcein what you describe makes
no difference.

Repeating the same non-fact without analytical backup does not make
it true. There is no infinite supply of "bad guys" to take it all.

There is a finite arrival rate of mail that takes a long time to
process (perhaps time out, perhaps be delivered), this arrival rate
creates a demand for equal or higher output throughput, otherwise
such mail *accumulates* (until the arrival rate drops or perhaps the
supply is exhausted).

Throughput = concurrency / latency. If we provide enough concurrency,
the throughput exceeds the arrival rate and there is no congestion.
Queueing systems are bimodal. When the demand is below the critical
level, the queue is empty. When it spikes above, the queue grows
(until demand levels off).

You must switch from gut reaction to detailed analysis or we are
wasting time.

I am assuming that the case that lead to this thread is accumulation
of repeatedly timed-out mail in the deferred queue. With a
sufficiently narrow slow output channel, and enough such mail, at
least half of it will be in the active queue at any given time,
once the time it takes for a message at the back of the queue to
reach the front is equal to the maximum backoff time or higher.

You can attempt to quench supply by suppressing deferred queue
scans, but this hurts innocent bystanders. The crude deferred
== slow heuristic is not by itself a good solution.

Post by Patrik Rak
Therefore, I say the guess is the best what we have got.

We must increase concurrency for such mail, perhaps in a separate
transport, but then risk "thundering hurd" when the guess is wrong,
hence bi-modal process limits.

Post by Viktor Dukhovni
There is no "giving away" to maintain throughput, high latency
tasks warrant higher concurrency, such concurrency is cheap since
the delivery agents spend most of their time just sitting there
waiting.

Most people don't have this problem. The RAM required is rather
light. You have not thought carefully through the dynamics of the
queue or the impact on deferred queue processing, which is also
important, it is not all junk, or we would not bother deferring
mail at all.

Post by Patrik Rak
The problem with your approach is that whatever the bad guys want,
you give it to them.

Nonsense. There are no "bad guys", just an arrival rate of mail
that takes longer to process. If someone wants to DoS your MTA,
it is far easier than trying to starve SMTP output with slow mail.

Also for most sites, I can't inject arbitary recipients into their
queue.

Post by Patrik Rak
That's meek. They want more, you give it to them.

There is no "them". This is not an analysis.

Post by Patrik Rak
Not anymore. What I suggest is the solution 1 and 3 from my previous
mail, which both merely restrain how much of the available resources
we are willing to give to the bad guys in the worst case. Note that
there is no slow/fast way either, which would somehow affect either
group. We just make sure that those who we think are bad never get
everything.

But if this is an issue, the slow path *needs more* concurrency or
else the slow path is starved and it contains real mail to be
delivered.

Please come back with a model to which we can apply thought experiment
analyses (even simulations if we wanted to write code for that).
Without a model, there is no way to validate the design.

Post by Patrik Rak
Now if one group becomes really slow, like 30 or 60 times slower
than the other one, it's effectively the case when it starts
starving the other one.

This is too crude. The deferred queue emits mail at a fixed rate,
proportional to its population and dependent on the maximal backoff
time, the messages in the deferred queue that are just sitting
there do no harm. If this rate is below the throughput for slow
messages, we have no congestion problem.

We can:

- Raise the throughput (expanded concurrency).

- Reduce the rate (increase the maximal backoff time,
reduce the queue lifetime).

If you want to guarantee a reserved pool of processes for new mail,
we can do that too (most easily second instance and smtp_fallback_relay).

Post by Patrik Rak
If it is the slow group which becomes this
slow, it gets 60:1 split, which with ~100 delivery agents available
is obviously not enough to get new mail delivered fast enough.

Only if the throughput is not high enough to exceed the arrival rate.
A healthy MTA has a small deferred queue, and most deferred queue
scans finish quickly, thus at any given moment the deferred queue
is not a source of new mail (every 5 minutes a new scan injects
a small amount of previously deferred mail).

The flow from the deferred queue is a trickle, not a stream. It does
not much additional concurrency to provide an output rate equal to the
deferred queue input rate.

Post by Patrik Rak
Finally, what happens if the fast group becomes terribly slow
instead, and the slow group not? I'd conclude that that doesn't have
to bother us. It makes no sense to try to take resources away from
new mail which suddenly became slow, just so we can try more new
mail which can be just as slow. And taking resources away from slow
new mail just so we can retry some deferred mail seems equally
pointless. So I would say it's perfectly fine to wait until the
situation gets back to normal, shall this ever happen.

There is no taking resources away. I run slow deliveries at high
concurrency (since the network can afford this), and fast deliveries
at lower concurrency (since that's enough, and we don't want thrashing).

Post by Patrik Rak
Now, does this make it any clearer what I have in mind?

No, because all this was already clear. You need a more complete
model of the queue dynamics.

Yes, separate output pools can help. But once the deferred queue
is creating congestion, starving it is not a good idea. Provide
more concurrency in a separate pool if you want some sort of
guaranteed capacity for new mail (which may itself include bursts
of slow messages that clog the queue).

Don't starve the deferred mail, it needs to processed too.
Grey-listed mail should not be delayed for hours or days.
Reduce the deferred queue output rate, by tuning parameters.
Implement DNS prefetch.

All or some of of these are reasonable, just starving the deferred
queue with reduced concurrency is not.

--
Viktor.

Patrik Rak

2013-05-15 10:22:41 UTC

Post by Patrik Rak
The problem with your approach is that whatever the bad guys want,
you give it to them.

Nonsense. There are no "bad guys", just an arrival rate of mail
that takes longer to process. If someone wants to DoS your MTA,
it is far easier than trying to starve SMTP output with slow mail.
Also for most sites, I can't inject arbitary recipients into their
queue.

Post by Patrik Rak
That's meek. They want more, you give it to them.

There is no "them". This is not an analysis.

The "bad guys" was a metaphor for slow mail which consumes available
resources.

Without any offense, maybe you should reread all what was already
written and put it all more thought. Then you might realize why your
after-the-fact-testing solution is flawed, and why your
boost-the-concurrency solution works but is a needless waste. Wietse
explained the former pretty clearly, IMHO, and I tried my best about the
latter.

Patrik

Viktor Dukhovni

2013-05-15 15:19:46 UTC

Post by Viktor Dukhovni
There is no "them". This is not an analysis.

The "bad guys" was a metaphor for slow mail which consumes available
resources.

The metaphor is flawed, you need a model that considers message
rates entering the active queue from incoming and deferred. A
queue whose output rate exceeds its input rate is nearly empty.

If the blocked process limit is 500, and minimal backoff time is
the default 300s and blocked messages are done in <= 300s, you'd
need to bring in 600 or more messages from the deferred queue in
a single scan. If the maximum backoff time is raised from the
default 4,000 to just 19,200 (5.33 hours) then the odds of any
single deferred message being eligible during a given 5 minutes
are 1:64. Thus you'd need O(38,400) of the bad messages in the
deferred queue to see starvation of new mail. If the queue lifetime
is 5 days, that's ~8,000 arriving each day, if it is 2 days, that's
20,000 each day.

I'd say a site in that situation should focus on not admitting that
much junk. Otherwise, given a sensibly low load of bad addresses,
improved tuning, plus dynamic capacity to soak up concurrency demand
spikes, the scenario you're imagining just does not happen.

I faced this very problem in 2001, and:

- Switched to nqmgr since without "relay" it does a better job
of fairness between inbound and outbound mail, since it is FIFO
not round-robin by destination.

- In fact increased the maximal backoff time, and reduced the
minimum.

- Increased the smtp delivery agent process limit.

- Enabled recipient validation, which was the real solution.

Some of my observations back then were part of the motiviation for
Postfix to evolve:

- The default process limit has been raised from 50 to 100

- The minimum backoff time is smaller by default.

- The default qmgr became "nqmgr", thanks FIFO is good, and
making it work with mega-recipient messages is a win. [ We
still don't have a way to deal with senders who flood the
queue with lots of single-recipient messages. There was a
thread about that recently. If you want to make the queue
manager even more flexible, you could support mechanisms to
group related messages into a logical scheduling unit. ]

- The relay transport is standard.

- Recipient validation is on by default for local recipients.

Post by Patrik Rak
Without any offense, maybe you should reread all what was already
written and put it all more thought. Then you might realize why your
after-the-fact-testing solution is flawed, and why your
boost-the-concurrency solution works but is a needless waste. Wietse
explained the former pretty clearly, IMHO, and I tried my best about
the latter.

And yet it moves.

It is a fallacy to claim that increasing the output capacity of a
queue does not reduce congestion. Any queue is congested under
high enough load, there might simply be too much mail slow or fast.

Your magical 60:1 model ignores the "supply" of deferred mail, it
is generally equal to the supply of new mail and should be tuned
to generate retries at a rate below the output capacity. I'm
providing that tuning.

I explained a safe way to dynamically raise concurrency when the
load creates pressure via latency and we can afford more "threads",
because they are mostly all sleeping, provided the machine is not
tuned with exceedingly tight RAM limits. An extra smtp(8) process
is quite cheap. Just a bit of space for the stack and a small
heap, plus process structure. Look at the process map and see how
much of the address space is private.

This is a quantitative issue, do the numbers.

--
Viktor.

Wietse Venema

2013-05-15 15:44:17 UTC

Post by Viktor Dukhovni
There is no "them". This is not an analysis.

The "bad guys" was a metaphor for slow mail which consumes available
resources.

The metaphor is flawed, you need a model that considers message
rates entering the active queue from incoming and deferred. A
queue whose output rate exceeds its input rate is nearly empty.

With all due respect, it seems to me that we all three have a handle
on a part of the problem, but none of us appears to be versed in
queuing theory.

What I recall is that queue lengths depend not only on AVERAGE
arrival rates. The variations in arrival rates make a huge difference,
as experienced daily with queues before ladies' bathrooms (yes I
am aware that ladies, unlike email, don't back off exponentially).

Short of adding extra concurrency nothing is going to clear a
persistent source of slow mail (or a sufficiently-large deferred
queue). However I this is not the scenario that I have in mind.
and I think the same holds for Patrik.

We just don't want to dedicate too many mail delivery resources to
the slowest messages. Faster messages (or an approximate proxy:
new mail) should be scheduled soon for delivery. It should not have
to wait at the end of the line.

Now we could take advantage of the fact that in many cases the
"slow" and "fast" messages cluster around different sites, thus
their recipients will end up in different in-memory queues. If
there was a feedback of fine-grained delivery agent latencies to
qmgr(8), then could rank nexthop destinations. Not to starve slow
mail, but only to ensure that slow mail does not starve new mail.

Wietse

Viktor Dukhovni

2013-05-15 16:27:04 UTC

Post by Wietse Venema
We just don't want to dedicate too many mail delivery resources to
new mail) should be scheduled soon for delivery. It should not have
to wait at the end of the line.

Providing more capacity when we:

- Have the memory resources.

- Are in no risk of creating excessive contention.

is unconditionally going to help, even when some of the processes
are reserved for new mail. As I've said before, these approaches
are *composable*. One can dynamically yield more slots to the slow
mail, and use separate transports for slow vs. fast mail.

We could have multiple deferred queues, where mail that took a
long time to deliver goes into the slow deferred queue, while
mail that ways simply greylisted, ... goes into the regular
deferred queue, thus giving us better proxies for fast/slow.

There are many possible knobs. Not letting slow mail accumulate
in the active queue, by adding concurrency is one of them. Tuning
retries is another...

Post by Wietse Venema
Now we could take advantage of the fact that in many cases the
"slow" and "fast" messages cluster around different sites, thus
their recipients will end up in different in-memory queues. If
there was a feedback of fine-grained delivery agent latencies to
qmgr(8), then could rank nexthop destinations. Not to starve slow
mail, but only to ensure that slow mail does not starve new mail.

The queue manager tends to forget everything about a queue, when it
becomes empty. This is needed to conserve memory. With intermittent
queue scans, if the blockages clear before the next queue scan, the
knowledge that a source is slow may be flushed.

It takes multiple dead destinations to create a problem, since the
initial destination concurrency is 5, and we don't raise it when
deliveries site-fail (as they typically do for the bogus destinations).

So it takes 20+ such destinations to saturate the process limit.
If there are many, the queue starvation happens before there is
time for any feedback to rank slow destinations. In Patricks's
case with made-up domains on webforms, does the bogus mail in fact
tend to have lots of instances of the same invented destination
domain? Or are users sufficiently inventive to make collisions
rare?

In all probability all we need to do, is raise the maximal backoff
time from 4000s to ~4-5 hours, and advise the users in question to
reduce the maximal queue lifetime to ~2 days from 5. This will
reduce the stream from the deferred queue to a trickle.

It may also help to add a random number between 1 and minimal_backoff_time
to the next retry time of a deferred message (on top of the
exponential backoff). This will help to diffuse clusters of deferred
mail.

--
Viktor.

Patrik Rak

2013-05-15 16:52:52 UTC

Post by Wietse Venema
Short of adding extra concurrency nothing is going to clear a
persistent source of slow mail (or a sufficiently-large deferred
queue). However I this is not the scenario that I have in mind.
and I think the same holds for Patrik.

Right.

Post by Wietse Venema
We just don't want to dedicate too many mail delivery resources to
new mail) should be scheduled soon for delivery. It should not have
to wait at the end of the line.

Exactly.

I would also like to point out that in my case, the "slow mail" is not a
slow mail as in "mail which goes to sites behind slow links". It is slow
as in "it takes long time before the delivery agent times out".

Therefore, the 60:1 example is not unrealistic at all - in fact, as
normal mail delivery gets faster, this ratio easily get's even worse,
because (and as long as) the timeout remains the same.

And that's also why throwing in more delivery agents in this case is
such a waste - no matter how much I throw in, this mail doesn't get
delivered, period. That's why I am reluctant to spend any extra
resources on that.

Ditto, in my case I don't really need to measure remote site speed, as
there is often no site as such anyway. It just times out.

For normal sites, as long as their respective "delivery time speeds"
ratios are reasonable, none of them is dominating the queue so terribly
that it would bother me that much...

Patrik

Viktor Dukhovni

2013-05-15 17:04:23 UTC

Post by Patrik Rak
I would also like to point out that in my case, the "slow mail" is
not a slow mail as in "mail which goes to sites behind slow links".
It is slow as in "it takes long time before the delivery agent times
out".

Clear from the outset.

Post by Patrik Rak
Therefore, the 60:1 example is not unrealistic at all - in fact, as
normal mail delivery gets faster, this ratio easily get's even
worse, because (and as long as) the timeout remains the same.

My issue with 60:1 is not with the latency ratio, but with the
assumption that there is an unlimited supply of such mail to soak
up as many delivery agents as one may wish to add. In practice
the input rate of such mail is finite, if the output rate (via high
concurrency) exceeds the input rate, there is no accumulation and
no process exhaustion.

Post by Patrik Rak
And that's also why throwing in more delivery agents in this case is
such a waste - no matter how much I throw in, this mail doesn't get
delivered, period. That's why I am reluctant to spend any extra
resources on that.

It is not a waste, each message *will* eventually be allocated a
process and will be tried. All I want to do is widen the pipe and
deal with congestion quickly! If you keep the pipe narrow you risk
overflowing the queue capacity. A wider pipe is useful. You want
to not starve new mail, we can do both.

In order to drain slow mail quickly, (allocate a buch of sleeping
processes via a bit of memory capacity without thrashing) without
starving new mail we need separate process pools for the slow and
fast path. Each of which can use the blocked delivery agent process
limit baloon. Then there is never any contention between the two
flows.

Be careful to not starve the deferred queue without back-pressure
on new mail. Let new mail find a less-congested MX host.

--
Viktor.

Patrik Rak

2013-05-16 08:22:17 UTC

Post by Viktor Dukhovni
My issue with 60:1 is not with the latency ratio, but with the
assumption that there is an unlimited supply of such mail to soak
up as many delivery agents as one may wish to add. In practice
the input rate of such mail is finite, if the output rate (via high
concurrency) exceeds the input rate, there is no accumulation and
no process exhaustion.

No doubt about any of this.

No one is talking about unlimited, though. You need only few times as
little deferred messages as you have delivery agents available to
experience delays in new mail deliveries. Probability-wise it all works
well, but in practice it does not.

Post by Viktor Dukhovni
In order to drain slow mail quickly, (allocate a buch of sleeping
processes via a bit of memory capacity without thrashing) without
starving new mail we need separate process pools for the slow and
fast path. Each of which can use the blocked delivery agent process
limit baloon. Then there is never any contention between the two
flows.

I think we are beyond this split model already. It increases the overall
resource cost and yet doesn't allow the groups to share them. It also
doesn't seem to deal as well with the situation if you mis-classify
something. I would say the shared resource pool is better.

Patrik

Patrik Rak

2013-05-16 07:54:45 UTC

Post by Wietse Venema
What I recall is that queue lengths depend not only on AVERAGE
arrival rates. The variations in arrival rates make a huge difference,
as experienced daily with queues before ladies' bathrooms (yes I
am aware that ladies, unlike email, don't back off exponentially).

BTW, this one, no matter how humorous it may seem, is really spot on.

Like the ladies, I don't need a solution which works well on average.
New mail delivered quickly on average is not good enough. I need it
delivered quickly, period. If any new mail is delayed several minutes,
the harm is already done, however low the average delay might be.

That's the reason why Viktor's analysis about average rates, no matter
how enlightening and insightful it may be, doesn't provide usable
solution for this.

During the night I have also realized another problem regarding the
retry times. By adjusting those to defeat congestion, for the lack of
better controls, Victor deliberately and unconditionally affects the
delivery times of legal deferred mail at the same time. By preparing for
the worst, he may as well needlessly delay graylisted mail, regardless
of whether there is any congestion happening or not.

With the common presence of graylisting these days, I can't afford to
use the retry times in a questionable attempt to fight congestion. I
need them for what they are supposed to do - to control the retry times
of deferred mails. My goal is to deliver legal mail as quickly as
possible, be it graylisted or not. The graylisting timeouts range from
tens of seconds to several minutes, so I need short retry times to deal
with that. If I leave the minimum retry time set to 300s, I have already
lost the case.

Just thought I'd mention this, in case there is someone else still
following this thread who hasn't been bored to death by now... :)

Patrik

Patrik Rak

2013-05-15 16:01:42 UTC

Post by Viktor Dukhovni
It is a fallacy to claim that increasing the output capacity of a
queue does not reduce congestion. Any queue is congested under
high enough load, there might simply be too much mail slow or fast.

No one is saying that. Well, definitely I am not.

Post by Viktor Dukhovni
Your magical 60:1 model ignores the "supply" of deferred mail, it

What's magical about it?

Post by Viktor Dukhovni
is generally equal to the supply of new mail and should be tuned
to generate retries at a rate below the output capacity. I'm
providing that tuning.

Yes, you explain how you can achieve reasonable throughput by throwing
in lots of delivery agents and by carefully tuning handful of secondary
parameters so you get what you need. Well, at least as long as your
input and output rates don't change much.

I instead provide additional knob which doesn't harm delivery of
deferred mail in healthy mail system (such as mail deferred by
greylisting) at all and is able to prevent congestion by undeliverable
mail without doing either of those.

Still waiting to hear some reason why what I propose is bad.

Patrik

Viktor Dukhovni

2013-05-15 16:35:58 UTC

Post by Patrik Rak
Still waiting to hear some reason why what I propose is bad.

The various proposals are largely complementary. If we restrict
the slow path to 80% of the process limit, that's not too dramatic
a reduction (though slow mail should get more processes if possible,
espeically if this can be done *without* starving fast mail).

I am more concerned with the idea to limit deferred queue scans
when 80% of the active queue is previously deferred, while continuing
to take in new mail exclusively. This is a not a good idea.

Postfix already exerts too little back-pressure when the queue
fills, ignoring the deferred queue while taking more new mail
quickly will eliminate most of that (when the incoming queue is
not growing, there is no inflow_delay).

So I would definitely NOT cap the deferred queue fraction of the
active queue and favour new mail, unless stiffer back-pressure is
applied to new input (cleanup, ...) upstream. Yes we should quickly
process what we accepted, but we really should reduce our appetite
for new mail when the queue is already very large.

--
Viktor.

Wietse Venema

2013-05-15 16:54:20 UTC

Post by Viktor Dukhovni
Postfix already exerts too little back-pressure when the queue
fills,

Agreed.

Post by Viktor Dukhovni
ignoring the deferred queue while taking more new mail
quickly will eliminate most of that (when the incoming queue is

You are mis-representing.

There is no intent to IGNORE the deferred queue. After all it is
allowed to occupy 80% of all the delivery agents! The intent is
to give it only 80% or whatever. As soon as a deferred message
clears the queue it is replaced with another one.

Wietse

Viktor Dukhovni

2013-05-15 17:25:07 UTC

Post by Viktor Dukhovni
Postfix already exerts too little back-pressure when the queue
fills,

Agreed.

Post by Viktor Dukhovni
ignoring the deferred queue while taking more new mail
quickly will eliminate most of that (when the incoming queue is

You are mis-representing.
There is no intent to IGNORE the deferred queue. After all it is
allowed to occupy 80% of all the delivery agents! The intent is
to give it only 80% or whatever. As soon as a deferred message
clears the queue it is replaced with another one.

Yes, but the effect is the same, the input queue continues to drain
quickly with a substantial reduction in the already light back-pressure,
and the deferred queue grows.

This growth without back-pressure is arguably a feature for a backup
MX host with piles of disk that is willing to queue 5 days of mail
for a dead primary, but in most other cases back-pressure is useful,
to avoid making a bad situation worse. (I would add a fallback
queue for a large-capacity backup MX, and disable inflow controls
on the fallback input).

We could also handle deferred processing of dead (or temporarily
down) destinations more efficiently, by using a lower *initial*
destination concurency for deferred mail on a per-destination basis.

Turn the initial concurrency of 5 with a cohort count of 1, sideways
to an initial concurrency of 1 and a cohort limit of 5.

On a per-destination basis, the same five messages fail and throttle
the site, but they do so in series, leaving more room for other
concurrent deliveries. If they don't fail, the concurrency rises.
A destination with some previously deferred mail that is currently
active is treated as though all the mail is high risk, and gets
the adjusted concurrency limits.

--
Viktor.

Wietse Venema

2013-05-15 18:30:06 UTC

Post by Viktor Dukhovni
Postfix already exerts too little back-pressure when the queue
fills,

Agreed.

Post by Viktor Dukhovni
ignoring the deferred queue while taking more new mail
quickly will eliminate most of that (when the incoming queue is

You are mis-representing.
There is no intent to IGNORE the deferred queue. After all it is
allowed to occupy 80% of all the delivery agents! The intent is
to give it only 80% or whatever. As soon as a deferred message
clears the queue it is replaced with another one.

Yes, but the effect is the same, the input queue continues to drain
quickly with a substantial reduction in the already light back-pressure,
and the deferred queue grows.

[discussion of what I perceive as a many-knob solution involving
initial concurrency, cohort sizes, and more]

I don't want to disparage your contribution to the discussion, but
I think that software's job is to change one complex problem into
a simpler problem (examples from other domains: driving a car with
automatic transmission; controlling temperature with a thermostat).

I have seen (too) many examples of software that does too little
in this respect, leaving the user with a complex multi-knob solution.
Thus, my reluctance to implement Postfix solutions that rely on
many-knob solutions.

Said otherwise, the typical Postfix operator should not need a math
degree from Princeton. However it may well take a math degree to
figure out how one complex problem can be mapped onto one parameter.

Patrik appears to have a source of mail that will never be delivered.
He does not want to run a huge number of daemons; that is just
wasteful. Knowing that some mail will never clear the queue, he just
doesn't want such mail to bog down other deliveries.

Post by Viktor Dukhovni
From that perspective, the natural solution is to reserve some fraction

X of resources to the delivery of mail that is likely to be deliverable
(as a proxy: mail that is new).

Wietse

Patrik Rak

2013-05-16 08:40:38 UTC

Post by Wietse Venema
Patrik appears to have a source of mail that will never be delivered.
He does not want to run a huge number of daemons; that is just
wasteful. Knowing that some mail will never clear the queue, he just
doesn't want such mail to bog down other deliveries.
From that perspective, the natural solution is to reserve some fraction
X of resources to the delivery of mail that is likely to be deliverable
(as a proxy: mail that is new).

Very well said. Describes my thoughts exactly.

So, if you don't mind, I would like to go ahead and try to implement
this limit, for both the delivery agent slots as well as the active
queue slots. I think that enough has been said about this to provide
evidence that adding such knob doesn't put us in any worse position than
we are at now, nor does it preclude us from using other solutions.

The only remaining objection seems to be the amount of back pressure
postfix applies to incoming mail, depending on the growth of the queue.
I believe this problem exists regardless of if this new knob is in place
or not, so it may as well be good idea to discuss this independently if
you feel like doing so now...

Patrik

Wietse Venema

2013-05-16 11:13:16 UTC

You can try. I hope you can also document the result! Neither
Victor nor I have been able to fully absorb the subtle details
of nqmgr in a reasonable time span (like a long weekend or so).

Just be aware that persistent backlog will also affect geylisted
mail, as the time to make one pass over the deferred queue increases
with backlog size. Increasing the maximal_backoff_time only delays
the onset of trouble.

Post by Patrik Rak
The only remaining objection seems to be the amount of back pressure
postfix applies to incoming mail, depending on the growth of the queue.
I believe this problem exists regardless of if this new knob is in place
or not, so it may as well be good idea to discuss this independently if
you feel like doing so now...

Agreed. Back-pressure to "sources outside Postfix" is orthogonal to this.

Wietse

Patrik Rak

2013-05-16 12:57:16 UTC

Post by Wietse Venema
You can try. I hope you can also document the result! Neither

I'll do my best. Fortunately it seems this knob is pretty
straightforward to explain to the end users.

Post by Wietse Venema
Victor nor I have been able to fully absorb the subtle details
of nqmgr in a reasonable time span (like a long weekend or so).

We can try to arrange some kind of online meeting about this someday if
you wish. It would be even better to have a real whiteboard around, but
we can try.

It's not for the first time when it occurs to me that some interactive
discussion might be helpful. OTOH, being able to think about something
over longer periods of time caused by time zone differences sometimes
helps, too. :)

Post by Wietse Venema
Just be aware that persistent backlog will also affect geylisted
mail, as the time to make one pass over the deferred queue increases
with backlog size. Increasing the maximal_backoff_time only delays
the onset of trouble.

I was thinking about related issue today, too. The enhancement I was
considering was that rather than classifying all deferred mail as
"slow", we can exclude deferred mail which is not in the deferred queue
for longer than some configurable time (explicit retry count might be
even better, but we don't keep track of that now). That would prevent
the greylisted mail from being put into the same class as the entire
backlog, which might be too harsh for fresh deferred mail.

Patrik

Wietse Venema

2013-05-16 13:34:48 UTC

Patrik Rak:
[ Charset ISO-8859-1 unsupported, converting... ]

Post by Wietse Venema
You can try. I hope you can also document the result! Neither

I'll do my best. Fortunately it seems this knob is pretty
straightforward to explain to the end users.

Post by Wietse Venema
Victor nor I have been able to fully absorb the subtle details
of nqmgr in a reasonable time span (like a long weekend or so).

We can try to arrange some kind of online meeting about this someday if
you wish. It would be even better to have a real whiteboard around, but
we can try.

How much time do you have? I estimate this will take 3-4 sessions
of about 3-4 hours each, 1-2 per weekend. Obviously that can't be
done primarily by keyboard. In the worst case we may have to set
up some temporary account to keep international phone charges
under control.

That is an optimization for near-empty deferred queues, but does
not address the concern that I raise above. Aside from that, your
optimization does not seem to introduce new worst-case behavior so
it is mostly harmless.

Wietse

Patrik Rak

2013-05-20 15:30:32 UTC

Post by Patrik Rak
We can try to arrange some kind of online meeting about this someday if
you wish. It would be even better to have a real whiteboard around, but
we can try.

Very little these days, but it's something I can plan for in a longer
term. I'll get in touch off-list about this...

Right.

That huge backlog issue could be theoretically solved by introducing
additional queue which would keep the deferred mail only for some
limited time. I say theoretically, as in practice I don't think it's
very likely we would want to introduce another queue at this moment.

What we might do, though, is to introduce time-dependent fallback relay,
which would be used for deferred mail only after it was in the queue for
some time, similar to what Viktor mentioned. That way, the amount of
mail in the deferred queue is considerably limited, and only older mail
is moved to the fallback relay. This way the greylisted mail would have
much better chance of getting delivered, compared to current state, and
both machines could be IMO tuned more appropriately. One could even
setup a cascade of few more if desired...

Patrik

Patrik Rak

2013-06-11 09:11:35 UTC

Post by Patrik Rak
What we might do, though, is to introduce time-dependent fallback relay,
which would be used for deferred mail only after it was in the queue for
some time, similar to what Viktor mentioned. That way, the amount of
mail in the deferred queue is considerably limited, and only older mail
is moved to the fallback relay. This way the greylisted mail would have
much better chance of getting delivered, compared to current state, and
both machines could be IMO tuned more appropriately. One could even
setup a cascade of few more if desired...

BTW, any opinions about this feature?

Patrik

Wietse Venema

2013-06-11 10:49:36 UTC

BTW, any opinions about this feature?

What if we were to group jobs by attribute X, where X can
be sender or size or something else?

Wietse

Patrik Rak

2013-06-11 11:00:29 UTC

BTW, any opinions about this feature?

What if we were to group jobs by attribute X, where X can
be sender or size or something else?

Hmm, I am afraid I don't understand what you mean exactly. How would
that help with reducing the deferred backlog so graylisted mail isn't
delayed excessively by deferred queue scan taking looong time?

Patrik

Wietse Venema

2013-06-11 12:51:47 UTC

BTW, any opinions about this feature?

What if we were to group jobs by attribute X, where X can
be sender or size or something else?

... where X can be time in queue.

I am looking for general principles, rather than special cases.

Wietse

Patrik Rak

2013-06-12 07:57:04 UTC

BTW, any opinions about this feature?

What if we were to group jobs by attribute X, where X can
be sender or size or something else?

... where X can be time in queue.
I am looking for general principles, rather than special cases.

Ah, I see. For some reason I got it as "what we could do instead",
rather than "how we could extend the idea". Thus my confusion. Sorry.

Well, we can definitely add other attributes into the mix. I was
originally thinking about using the fallback relay, so the deferred mail
gets one more try before getting passed to separate instance, but
looking at it in this broader context, we might perhaps override the
nexthop directly instead.

The problem with current fallback relay is that it only allows us to
separate deferred mail from new mail after one try. There are ways of
separating mail based on senders and recipients already, but there is no
mechanism for separating mail queued for longer than X, nor message
bigger than X (although the latter might be already achievable with a
bit of filtering).

Patrik

Viktor Dukhovni

2013-05-16 15:47:22 UTC

Very well said. Describes my thoughts exactly.

What Patrick may not yet appreciate, is that I was advocating
tackling a *related* problem. I was not claiming that concurrency
balooning (lets give my approach that name) prevents starvation of
new mail under all conditions.

Rather concurrency balooning can:

- More quickly dispose of bursts of slow messages that can congest
the queue when they first arrive as new mail. A separate for
deferred mail does not address this.

- More quickly dispose of bursts of slow messages in the deferred
path when bad mail is mixed with greylisting, ...

The downside is that new mail is not "protected" from bursts of
bad mail that fill the baloon.

The Postfix "sitting there doing nothing" problem is not new, that's
what got me on the list posting comments and patches in June of 2001.

My view then and now is that when an idle system with plenty of
CPU, RAM and networking resources is just sitting there waiting
for a timers to fire, what's wasteful (even if much of the mail is
ultimately discarded) is not using more of the system's resources
to have more timers expiring concurrently.

It is fine if you don't think the related problem is worth addressing,
but at least understand that my perspective is different, I strive
for higher throughput, and then congestion mostly takes care of itself.

I am not completely sold on the 80/20 reservation, since it too
will get blocked with slow mail when a concurrent bunch of slow
mail is new, or when the deferred queue is a mixture of likely
never deliverable, and recently deferred mail. So the approach is
not perfect either. Tweaking it to exclude messages that are not
sufficiently old (one maximal backoff time) perhaps addresses most
of my concern about mixed deferred mail, since a sufficiently
delayed message can reasonably tolerate a bit more delay.

Post by Patrik Rak
So, if you don't mind, I would like to go ahead and try to implement
this limit, for both the delivery agent slots as well as the active
queue slots. I think that enough has been said about this to provide
evidence that adding such knob doesn't put us in any worse position
than we are at now, nor does it preclude us from using other
solutions.

Go ahead.

Post by Patrik Rak
The only remaining objection seems to be the amount of back pressure
postfix applies to incoming mail, depending on the growth of the
queue. I believe this problem exists regardless of if this new knob
is in place or not, so it may as well be good idea to discuss this
independently if you feel like doing so now...

Back-pressure is about the behaviour of 80% of queue (rather than
80% of agents) ceiling and its likely impact is to largely eliminate
inflow_delay (which is already fairly weak). So the issue is whether
and how to slow down input (smtpd + cleanup) when the queue is large
(as evidenced by a lot of deferred mail in the active queue).

So your work will have an impact on back-pressure (it will further
reduce it), but perhaps since the existing back-pressure is fairly
weak, we can live with it becoming a bit weaker still for now.

The current back-pressure mostly addresses the "stupid performance
test" case rather than the persistent MTA overload case. So if we
want to address persistent overload (perhaps as a result of output
collapse, as with a broken second network card) we can design that
separately. It would perhaps be useful to shed load onto healthier
MX hosts in a cluster in which one MX host is struggling.

--
Viktor.

Viktor Dukhovni

2013-05-16 16:39:56 UTC

Post by Viktor Dukhovni
The Postfix "sitting there doing nothing" problem is not new, that's
what got me on the list posting comments and patches in June of 2001.

For the record, it was July.

http://archives.neohapsis.com/archives/postfix/2001-07/0871.html

--
Viktor.

Wietse Venema

2013-05-14 14:58:00 UTC

Post by Patrik Rak
We classify every mail into one of the two groups. We can call them fast
and slow for simplicity, but in fact they are "hopefully fast" or
"presumably slow". For the start it can be equal to new mail and
deferred mail, but doesn't have to, as Wietse pointed out before.
Now let's explore what share of the available resources each group gets.
When both groups contain some mail and the mail is delivered equally
fast, they get 1:1 split. That seems fair. If the slow group becomes say
4 times slower on average, they will get 4:1 split over time. The same
holds if the fast group becomes 4 times slower, they will get 1:4 split.
So far, so good.
Now if one group becomes really slow, like 30 or 60 times slower than
the other one, it's effectively the case when it starts starving the
other one. If it is the slow group which becomes this slow, it gets
60:1 split, which with ~100 delivery agents available is obviously not
enough to get new mail delivered fast enough. If we were willing to
increase the transport limit considerably, the 1/61 will eventually
become enough delivery agents available for fast mail delivery. However,
what I say is that it's enough if we simply do not allow the ratio go
this high. We can fairly easily limit the amount of resources we give to
the bad guys to 80% or 90%, allowing them to get no more than 4:1 or 9:1
split. That can leave quite enough for the fast group while not wasting
too much on the bad group. Seems like good trade, especially when we
presume that most of the bad mail won't get delivered anyway (if it
were, it wouldn't likely be this slow and demand so much resources in
the first place).

With 100 delivery agents, this means you can have 80 slow messages
in the active queue, right?

Wietse

Patrik Rak

2013-05-15 08:46:00 UTC

Post by Wietse Venema
With 100 delivery agents, this means you can have 80 slow messages
in the active queue, right?

Not exactly.

It means that

- with 100 delivery agents, only up to 80 can be tied by the slow mail,
and 20 always remain dedicated to the "fast" mail,

and also that

- with active queue size of 10000, only up to 8000 of those can be from
the deferred queue, and 2000 are always dedicated to the incoming queue.

The same principle is applied to both bottlenecks. That should be enough
for qmgr to keep the mail flowing.

If I decide I need more delivery agents dedicated for the new mail, I
have the option of either decreasing the ratio limit (even as little as
50% is still fair), or increase the total number of clients. The major
advantage here is that if I double the total number of clients, I am
guaranteed to get twice as much for the fast mail.

In comparison, without this ratio limit knob available, if I were to
follow Viktor's advice of bumping the transport limit through the roof,
I would need to set the limit to 1220 to get the same 20 delivery agents
available for the fast mail, as 1200 would be blocked by the slow mail
(assuming 60:1 split, e.g., 30sec vs 0.5sec delivery times, as an example).

Patrik

Viktor Dukhovni

2013-05-11 22:41:55 UTC

Post by Patrik Rak
I have considered this solution as well.
It sounds simple enough, but once you start thinking about how
to implement that, it's not as easy.

Then I believe you're solving the wrong problem. Your claim is
on the one hand that multiple instances are too complex, and users
want something simple to avoid congested output queues.

Then you decide that the simple thing is too hard, and you want
them to configure a second set of transports for deferred mail and
hand tune these all within a single Postfix instance.

Somewhere the original motivation is completely lost, and the
solution is *more* complex than setting up another instance as a
fallback queue and solves a subset of the possible issues (e.g.
not active queue saturation, just high transport latency).

You need to decide which problem you're solving. If you want
something simple enough for the naive user, it needs to actually
be simple. If you want a power tool for the experienced user, give
it power and flexibility.

--
Viktor.

Patrik Rak

2013-05-12 08:48:41 UTC