Discussion:
singlethreaded vs. multithreaded servers
(too old to reply)
j***@comcast.net
2006-05-02 22:13:31 UTC
Permalink
I realize this question isn't directly related to winsock programming,
please let me know if there's a more appropriate group.

I'm involved in a debate about the best way to implement a low-latency,
high-throughput, high-number-of-simultaneous-clients server. The
primary platform we'll support is the Windows Server family. Everything
I've read indicates these kinds of servers should use multiple threads,
non-blocking sockets, overlapped I/O, and I/O completion ports (hence
why I'm posting to this group). However, some influential people here
claim using the IOCP approach brings more pain than it's worth.

Their main point is that multiple threads don't provide much benefit
for an I/O-bound application, since most server configurations will
have only one NIC, and that is the "great serializer". On the surface,
that seems convincing. Reading and writing to I/O devices is slower
than processing data and reading and writing memory. So, if the
application receives a client request, and then can process it and send
a response before the next request is available to be read, the app
might as well be single-threaded. Is this true?

How typical is that scenario (that a server can prepare a response
before the next request is read from the wire) in the server universe?

My main argument for using multiple threads, IOCP, etc is that most
every reference I've seen indicates that's the right way, and using
single-threaded select() apps is wrong. Is there any documentation that
says otherwise?

Currently, the app in question is multi-threaded, but all network I/O
is handled in a single thread in a select() loop. Is there anything I
can measure to determine if this is a problem, say, socket read buffer
sizes, number of ready-for-read handles returned by select, or
profiling the thread state (running vs. waiting for I/O)?

Thanks.
Peter Duniho
2006-05-02 23:50:40 UTC
Permalink
Post by j***@comcast.net
I realize this question isn't directly related to winsock programming,
please let me know if there's a more appropriate group.
It seems relevant enough. :)
Post by j***@comcast.net
I'm involved in a debate about the best way to implement a low-latency,
high-throughput, high-number-of-simultaneous-clients server. The
primary platform we'll support is the Windows Server family. Everything
I've read indicates these kinds of servers should use multiple threads,
non-blocking sockets, overlapped I/O, and I/O completion ports (hence
why I'm posting to this group). However, some influential people here
claim using the IOCP approach brings more pain than it's worth.
I assume that by "here" you mean in your own work team, and not this
newsgroup. I've never seen anyone here make any negative comments about
IOCP, from a cost-benefit point of view (yes, it's more complicated, but
everyone seems to be in agreement that it's the right way to go for
large-scale servers).
Post by j***@comcast.net
Their main point is that multiple threads don't provide much benefit
for an I/O-bound application, since most server configurations will
have only one NIC, and that is the "great serializer". On the surface,
that seems convincing. Reading and writing to I/O devices is slower
than processing data and reading and writing memory. So, if the
application receives a client request, and then can process it and send
a response before the next request is available to be read, the app
might as well be single-threaded. Is this true?
If all of your assumptions are true, then yes. However, from what I have
read here (and note that I personally haven't implemented an IOCP
server...all my network programming is small-scale stuff), there are some
key problems with the assumptions you state above. In particular, for a
server handling tens of thousands of clients, the assumption that processing
a client request can be done before there is another request available to
service is incorrect, especially if the server has a fast Internet
connection (as would be expected for a server handling tens of thousands of
clients).

Without knowing your particular situation, it's hard to answer your specific
question. But there seems to be no disagreement among the Winsock experts
that IOCP is *the* scalable, high-bandwidth paradigm to use in Winsock.
Post by j***@comcast.net
How typical is that scenario (that a server can prepare a response
before the next request is read from the wire) in the server universe?
For a large server handling tens of thousands of clients with a sufficiently
fast connection to the Internet, I would guess the scenario is not very
typical. The individual processing would have to be trivial. Anything
interesting could easily take enough time to make that the bottleneck,
rather than the network i/o.
Post by j***@comcast.net
My main argument for using multiple threads, IOCP, etc is that most
every reference I've seen indicates that's the right way, and using
single-threaded select() apps is wrong. Is there any documentation that
says otherwise?
For sure, using select() is wrong on Windows. As has been mentioned here
many times before, it's provided only for compatibility with Berkeley
sockets. Even for low-bandwidth applications, there are better choices.
I've never seen any documentation that suggests that select() is a
preferable solution for *any* Winsock application.
Post by j***@comcast.net
Currently, the app in question is multi-threaded, but all network I/O
is handled in a single thread in a select() loop. Is there anything I
can measure to determine if this is a problem, say, socket read buffer
sizes, number of ready-for-read handles returned by select, or
profiling the thread state (running vs. waiting for I/O)?
IMHO, the easiest way to tell is to look at your network utilization. If
your server can keep its network connection at something very close to 100%
of its capacity under load, then your other stuff isn't a bottleneck.

If, on the other hand, when you try to max out the load on your server, you
find that you cannot get the network utilization above (for example) 50%,
then your bottleneck is in the server and you should look at some other
solutions.

In other words, if you have a 10Mbps connection, and enough clients to
theoretically maximize the use of that connection, but you only see a data
throughput of 5Mbps, then your server could probably use a more efficient
paradigm. If you see a throughput of 9Mbps or better, then you're unlikely
to achieve better results with any changes to the server.

Pete
Michael K. O'Neill
2006-05-03 01:39:49 UTC
Permalink
Post by Peter Duniho
Post by j***@comcast.net
I realize this question isn't directly related to winsock programming,
please let me know if there's a more appropriate group.
It seems relevant enough. :)
Post by j***@comcast.net
I'm involved in a debate about the best way to implement a low-latency,
high-throughput, high-number-of-simultaneous-clients server. The
primary platform we'll support is the Windows Server family. Everything
I've read indicates these kinds of servers should use multiple threads,
non-blocking sockets, overlapped I/O, and I/O completion ports (hence
why I'm posting to this group). However, some influential people here
claim using the IOCP approach brings more pain than it's worth.
I assume that by "here" you mean in your own work team, and not this
newsgroup. I've never seen anyone here make any negative comments about
IOCP, from a cost-benefit point of view (yes, it's more complicated, but
everyone seems to be in agreement that it's the right way to go for
large-scale servers).
Post by j***@comcast.net
Their main point is that multiple threads don't provide much benefit
for an I/O-bound application, since most server configurations will
have only one NIC, and that is the "great serializer". On the surface,
that seems convincing. Reading and writing to I/O devices is slower
than processing data and reading and writing memory. So, if the
application receives a client request, and then can process it and send
a response before the next request is available to be read, the app
might as well be single-threaded. Is this true?
If all of your assumptions are true, then yes. However, from what I have
read here (and note that I personally haven't implemented an IOCP
server...all my network programming is small-scale stuff), there are some
key problems with the assumptions you state above. In particular, for a
server handling tens of thousands of clients, the assumption that processing
a client request can be done before there is another request available to
service is incorrect, especially if the server has a fast Internet
connection (as would be expected for a server handling tens of thousands of
clients).
Without knowing your particular situation, it's hard to answer your specific
question. But there seems to be no disagreement among the Winsock experts
that IOCP is *the* scalable, high-bandwidth paradigm to use in Winsock.
Post by j***@comcast.net
How typical is that scenario (that a server can prepare a response
before the next request is read from the wire) in the server universe?
For a large server handling tens of thousands of clients with a sufficiently
fast connection to the Internet, I would guess the scenario is not very
typical. The individual processing would have to be trivial. Anything
interesting could easily take enough time to make that the bottleneck,
rather than the network i/o.
Post by j***@comcast.net
My main argument for using multiple threads, IOCP, etc is that most
every reference I've seen indicates that's the right way, and using
single-threaded select() apps is wrong. Is there any documentation that
says otherwise?
For sure, using select() is wrong on Windows. As has been mentioned here
many times before, it's provided only for compatibility with Berkeley
sockets. Even for low-bandwidth applications, there are better choices.
I've never seen any documentation that suggests that select() is a
preferable solution for *any* Winsock application.
Post by j***@comcast.net
Currently, the app in question is multi-threaded, but all network I/O
is handled in a single thread in a select() loop. Is there anything I
can measure to determine if this is a problem, say, socket read buffer
sizes, number of ready-for-read handles returned by select, or
profiling the thread state (running vs. waiting for I/O)?
IMHO, the easiest way to tell is to look at your network utilization. If
your server can keep its network connection at something very close to 100%
of its capacity under load, then your other stuff isn't a bottleneck.
If, on the other hand, when you try to max out the load on your server, you
find that you cannot get the network utilization above (for example) 50%,
then your bottleneck is in the server and you should look at some other
solutions.
In other words, if you have a 10Mbps connection, and enough clients to
theoretically maximize the use of that connection, but you only see a data
throughput of 5Mbps, then your server could probably use a more efficient
paradigm. If you see a throughput of 9Mbps or better, then you're unlikely
to achieve better results with any changes to the server.
Pete
For some representative numbers, see the quotation (and the table) in this
thread: http://www.codeguru.com/forum/showthread.php?p=1182005#post1182005

The quote is from a book, "Network Programming for Microsoft Windows, Second
Edition", by Anthony Jones and Jim Ohlund, Microsoft Press, 2002. It gives
a comparison of network performance for various different socket
architectures, each implementing a simple echo server.

The numbers clearly show that IOCP is the winning architecture. Blocking
select() is the worst, and non-blocking select() is only slightly better.

IOCP is not supported on Win98, but that won't be an issue with the platform
you're targeting.

Mike
jGoodchild
2006-05-04 07:40:05 UTC
Permalink
Interesting that there aren't any numbers for Overlapped operation
using completion routines.

Anyway, I recently raised a similar issue - see:

http://groups.google.com/group/alt.winsock.programming/browse_frm/thread/c592603db56091ff

My views is that, if you have just a single CPU, and if a worker thread
doesn't block on anything other than the network events, a
single-threaded design based on Overlapped completion routines is just
effective as IOCP, as IOCP would only allow a single worker thread to
run in those circumstances.

Once you introduce a second CPU, or if the worker threads need to wait
for some other sort of I/O (e.g. disk access), then IOCP is clearly
more efficient.

My own interest comes from the fact that I was designing a network
interface framework to allow programs to be written in such as way to
allow common code on both Windows and Linux, and I found it very
difficult to make the design work with IOCP.

Jonathan
j***@comcast.net
2006-05-05 22:23:21 UTC
Permalink
Post by Michael K. O'Neill
The quote is from a book, "Network Programming for Microsoft Windows, Second
Edition", by Anthony Jones and Jim Ohlund, Microsoft Press, 2002. It gives
a comparison of network performance for various different socket
architectures, each implementing a simple echo server.
New copies are going for $150 through Amazon's resellers. Almost 3
times its price when it was still in print. Guess not everyone has
moved to .NET yet <g>
j***@comcast.net
2006-05-05 22:13:34 UTC
Permalink
Peter Duniho wrote:

[snip]
Post by Peter Duniho
IMHO, the easiest way to tell is to look at your network utilization. If
your server can keep its network connection at something very close to 100%
of its capacity under load, then your other stuff isn't a bottleneck.
If, on the other hand, when you try to max out the load on your server, you
find that you cannot get the network utilization above (for example) 50%,
then your bottleneck is in the server and you should look at some other
solutions.
In other words, if you have a 10Mbps connection, and enough clients to
theoretically maximize the use of that connection, but you only see a data
throughput of 5Mbps, then your server could probably use a more efficient
paradigm. If you see a throughput of 9Mbps or better, then you're unlikely
to achieve better results with any changes to the server.
Pete
Thanks for the reply. pretty much confirms what I've read. But there
are some pretty stubborn people around here <g>...when they start
talking about how since the x86 lock instruction prefix is inefficient
and how separate processors use "preferred" ranges of memory (anyone
hear of that one before? at least the first one makes sense),
multithreading costs more than any benefit you'd gain, I start feeling
like I'm wasting my time. But then again, Microsoft uses different
binaries for single-processor machines vs. SMP machines, so they might
be on to something.

I'll try to prove the difference with perfmon's network counters and a
heavy load.
Peter Duniho
2006-05-06 00:41:12 UTC
Permalink
Post by j***@comcast.net
Thanks for the reply. pretty much confirms what I've read. But there
are some pretty stubborn people around here <g>...when they start
talking about how since the x86 lock instruction prefix is inefficient
Inefficient compared to what? In general, multi-threading on a multi-CPU
computer, when coded correctly, provides the expected performance increase.
Clearly synchronization (whatever method one chooses to use) does not HAVE
to have a significant effect on performance.
Post by j***@comcast.net
and how separate processors use "preferred" ranges of memory (anyone
hear of that one before? at least the first one makes sense),
Never heard about the "preferred ranges" thing, but then there are probably
some esoteric things I have yet to run across. It sounds a little like some
of the issues I discovered doing multi-threaded code on a Hyperthreaded CPU.
With HT (and I believe with the new Intel "dual-core" CPUs), the execution
units share a cache. Because of the way that cache lines are assigned to
specific memory addresses, and because Windows by default always starts each
thread's stack at the same alignment, you can really kill performance on a
HT CPU if each thread is doing practically the same thing (as is common with
certain kinds of multi-threaded applications :( ). Each thread winds up
trashing the cache of the other thread, killing the performance of both
threads.

You can imagine my surprise the first time I ran into this, where I wound up
with my throughput being reduced by about a third running with two threads
versus one on a HT CPU, rather than getting some performance gain. (In the
end, it turned out that even optimizing for the HT case, my best performance
gain was under 20%...I decided it was better to not bother trying to treat a
HT CPU as a true dual-proc setup). Quite a head-scratcher until I did some
research.
Post by j***@comcast.net
multithreading costs more than any benefit you'd gain, I start feeling
like I'm wasting my time. But then again, Microsoft uses different
binaries for single-processor machines vs. SMP machines, so they might
be on to something.
There are multiple reasons to use different binaries, but the biggest is
licensing. Customers pay more for SMP-enabled software, so it doesn't make
sense to hand them the SMP-enabled version if they've only paid for the
single-CPU version. Otherwise, you wouldn't really need multiple
binaries...you'd just need to check the number of CPUs and/or let the user
tell you whether to try to take advantage of multiple CPUs.

The fact that there are SMP-enabled versions of Microsoft software *ought*
to tell you that it's worthwhile writing multi-threaded code. :) You might
point that out to your co-workers. :)

As far as whether multi-threading costs more than any benefit you'd gain,
that depends entirely on the situation. It is true that there are many
applications for which multi-threading is unlikely to provide much benefit
at all. But it is also true that there are many applications for which
multi-threading provides a significant benefit, either performance-wise,
architecturally, or both.

As with any software paradigm or technology, multi-threading is no silver
bullet. It's a specific tool to be used in specific situations, where it's
practical and worthwhile. It's not always the right thing to do, and
neither is it always the wrong thing to do.
Post by j***@comcast.net
I'll try to prove the difference with perfmon's network counters and a
heavy load.
Good idea. :)

Pete
Chris Thomasson
2006-05-08 01:27:29 UTC
Permalink
Post by Peter Duniho
Post by j***@comcast.net
Thanks for the reply. pretty much confirms what I've read. But there
are some pretty stubborn people around here <g>...when they start
talking about how since the x86 lock instruction prefix is inefficient
Inefficient compared to what? In general, multi-threading on a multi-CPU
computer, when coded correctly, provides the expected performance
increase. Clearly synchronization (whatever method one chooses to use)
does not HAVE to have a significant effect on performance.
However, sometimes there are situations where one may need to synchronize
"critical" shared data-structure(s).


FWIW, I have "run into" this issue a couple of times in the past. For
instance, during the development of an "enterprise-type" server that had to
handle 15,000-40,000+ concurrent connections. It also had to handle sporadic
periods of heavy load. The server did all it could to efficiently manage the
load(s). Let's see if I can very briefly describe the setup.

One part of the load management scheme was based on a per-socket timeout
algorithm that was able to quickly disconnect timedout connections, with no
extra network-overhead; a "timeout protocol" was avoided by keeping things
"local". A server stored "actively connected" per-socket state in a
hashed-linked data-structure that was based on a distributed per-node
locking algorithm. The locks were implemented as a single global array of
rw-mutexs, and access into them was managed by hashing a pointer to the
to-be-locked-structure into an index into the global array; similar to what
I did here:


http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?q=multi-mutex&rnum=1#3ca11e0c3dcf762c
(very early and crude version of my "multi-mutex" algorithm)


A per-socket had a pair of monotonic counters (e.g., old, cur) that got
non-atomically incremented by worker threads every time there was any IO
activity; luckily the count did not have to be accurate at all, so simple
increments worked fine. The timeout logic basically had to traverse through
the active per-socket list and sample each ones counter, and compare it to
the previously read value and a runtime configurable per-socket timeout
value/timestamp. I did NOT want the worker threads themselves to have to
traverse through the per-sockets. Any traversal could visit
tens-of-thousands of per-sockets. That's tens-of-thousands of expensive
rw-mutex acquire/release operations. That is a boat load of atomic
operations and memory barriers.


http://groups.google.com/group/comp.programming.threads/msg/7c15b3488642d1a6?hl=en
(lock-free vs. lock-based overheads)

http://groups.google.com/group/comp.programming.threads/msg/a3e38e27df971e0d?hl=en
(lock-free vs. lock-based overheads)

http://groups.google.com/group/comp.programming.threads/msg/ec103670e323111a?hl=en
(zero-overhead queues vs. atomic-op/member based queues)

http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf
(RCU vs. locking schemes)


So, I moved the timeout system to a dedicated thread. However, I did not
want the timeout thread to have to lock the rw-mutex for each per-socket.


http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce
(rw-mutexs. when to use?)

http://groups.google.ca/group/comp.lang.c++.moderated/msg/a8ef173bee4f0c03
("real" costs of locking)


The server's per-socket timeout "polling" intervals could be adjusted at
runtime by an administrator via. GUI, so the locking frequency was "random".
If the timeout was moderately/fairly small, the thread would be persistently
interrupting lock-holders that were doing "real work"; as opposed to the
simply loading a monotonic counter. The side effects of blocking (e.g.,
context-switching, lock-convoy, possible priority-inversion, ect.) and/or
frequent atomic-op/member usage would have a negative impact on your servers
scalability/throughput/performance characteristics in general.

So, I solved the problem with a solution to the reader/writer problem:


http://groups.google.ca/group/comp.lang.c++.moderated/msg/e8ed4f5f000c336a
(contains brief description of reader/writer problem)

http://home.comcast.net/~vzoom/demos/pc_sample.c
(one of my lock-free collectors)

http://appcore.home.comcast.net/
(my old AppCore library. Sun and Intel both reference my code/algorithms wrt
lock-free algorithms in general:

http://www.intel.com/cd/ids/developer/asmo-na/eng/151201.htm?page=4
(An Intel Press reference to AppCore)

- http://www.intel.com/cd/ids/developer/asmo-na/eng/151201.htm?page=4
(An Intel Press reference to my AppCore library)

- http://developers.sun.com/solaris/articles/chip_multi_thread.html#2
(A Sun reference, section 2.2.1.1.2, to my AppCore library)


http://groups.google.ca/group/comp.lang.c++.moderated/msg/0032f3cc0a6d75f0




Here is some generic info on proxy gc:

http://groups.google.com/group/comp.programming.threads/msg/41f29efe33e7f124?hl=en
(brief description)

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f80fb2901b42e447/9fe3ee93d3729321?lnk=raot&hl=en#9fe3ee93d3729321
(discussion on GC)

http://groups.google.ca/group/comp.lang.c++/msg/b3e1d10e219908f1?hl=en&
(I describe a technique for virtually zero-overhead garbage collection at
the end of this msg)
Post by Peter Duniho
Post by j***@comcast.net
and how separate processors use "preferred" ranges of memory (anyone
hear of that one before? at least the first one makes sense),
Never heard about the "preferred ranges" thing, but then there are
probably some esoteric things I have yet to run across. It sounds a
little like some of the issues I discovered doing multi-threaded code on a
Hyperthreaded CPU. With HT (and I believe with the new Intel "dual-core"
CPUs), the execution units share a cache. Because of the way that cache
lines are assigned to specific memory addresses, and because Windows by
default always starts each thread's stack at the same alignment, you can
really kill performance on a HT CPU if each thread is doing practically
the same thing (as is common with certain kinds of multi-threaded
applications :( ). Each thread winds up trashing the cache of the other
thread, killing the performance of both threads.
You can imagine my surprise the first time I ran into this, where I wound
up with my throughput being reduced by about a third running with two
threads versus one on a HT CPU, rather than getting some performance gain.
(In the end, it turned out that even optimizing for the HT case, my best
performance gain was under 20%...I decided it was better to not bother
trying to treat a HT CPU as a true dual-proc setup). Quite a
head-scratcher until I did some research.
Simple hack for this is to offset your worker threads stacks via alloca.
Post by Peter Duniho
Post by j***@comcast.net
multithreading costs more than any benefit you'd gain, I start feeling
like I'm wasting my time. But then again, Microsoft uses different
binaries for single-processor machines vs. SMP machines, so they might
be on to something.
There are multiple reasons to use different binaries, but the biggest is
licensing. Customers pay more for SMP-enabled software, so it doesn't
make sense to hand them the SMP-enabled version if they've only paid for
the single-CPU version. Otherwise, you wouldn't really need multiple
binaries...you'd just need to check the number of CPUs and/or let the user
tell you whether to try to take advantage of multiple CPUs.
The fact that there are SMP-enabled versions of Microsoft software *ought*
to tell you that it's worthwhile writing multi-threaded code. :) You
might point that out to your co-workers. :)
The fact that they have a separate atomic operation implementation for
uni-processor (e.g., NO lock prefix) should tell you that the lock prefix is
expensive.

;)


http://groups.google.com/group/comp.programming.threads/msg/aeb7e857ef440520
("real" costs of lock-prefix)


:O
Post by Peter Duniho
As far as whether multi-threading costs more than any benefit you'd gain,
that depends entirely on the situation. It is true that there are many
applications for which multi-threading is unlikely to provide much benefit
at all. But it is also true that there are many applications for which
multi-threading provides a significant benefit, either performance-wise,
architecturally, or both.
As with any software paradigm or technology, multi-threading is no silver
bullet. It's a specific tool to be used in specific situations, where
it's practical and worthwhile. It's not always the right thing to do, and
neither is it always the wrong thing to do.
Agreed.




Any thoughts?
Peter Duniho
2006-05-08 02:04:57 UTC
Permalink
Post by Chris Thomasson
[...]
However, sometimes there are situations where one may need to synchronize
"critical" shared data-structure(s).
An analysis of your specific situation is outside the scope of this
newsgroup, and would not offer much insight into the general issue anyway.

Suffice to say, yes...locking can be expensive, but there are plenty of
multi-threaded servers that gain a sufficient throughput advantage to
justify the difficulties of dealing with the various issues that
multi-threaded code causes.
Post by Chris Thomasson
[...]
Simple hack for this is to offset your worker threads stacks via alloca.
Yes, there are ways to work around the HT issues. As I mentioned, I was
able to optimize my code for the HT case, avoiding the performance problems.
That should have been a clue that such workarounds exist. :)

I didn't bother to state them here, because -- again -- that's not really
within the scope of this newsgroup and wasn't relevant to my points.
Post by Chris Thomasson
[...]
The fact that they have a separate atomic operation implementation for
uni-processor (e.g., NO lock prefix) should tell you that the lock prefix
is expensive.
;)
As I mentioned, there isn't just one reason for providing separate binaries
for single-threaded and multi-threaded implementations. You can't infer
from the existence of a multi-threaded implementation that locking is
prohibitively expensive. Obviously in the multi-threaded case, it's not.

The point here is that anyone who claims that multi-threaded code is never
worth the trouble of implementation is full of it. Sometimes it's worth it,
sometimes it's not.

Though, I suppose that one exception would be if the programmer is a
terrible programmer. In that case, probably best to stay away from
multi-threaded code altogether in all cases. Winsock might be something to
be avoided as well. There's already too much sucky multi-threaded network
code out there as it is.

Pete
codemutt
2006-05-09 16:17:10 UTC
Permalink
"Reading and writing to I/O devices is slower
than processing data and reading and writing memory."

I guess that's a server that doesn't do very much processing on data or
data requests.

Anyway at the company I work for IOCP is used for all servers that have
to move large amounts of data, whether using TCP or UDP or UDP
multicast.

For large numbers of clients it's a mistake to use any of the less
scalable I/O models. Using select or WSASelect would be a catastrophe
if there were 10000 clients (select doesn't scale worth a dang in
Windows and WSASelect relies on the message pump for network I/O
notification, and who wants a GUI mechanism in their network I/O code
path?).

Microsoft advertises IOCP as the best solution for socket I/O on high
usage servers. They own the code, not the user (ie, us). Debating
locking mechanism latency for IOCP threads outside of having real
source code knowledge is academic at best.

But the best teacher is experience. Try an IOCP implementation and
select(). See how many clients you can connect. See how much data you
can actually jam through.
Dieter Goepferich
2006-05-11 08:15:38 UTC
Permalink
Hi,

I personally can confirm the performance results stated in the MS book
about network programming in Windows as I did some performance testing
on my own.

IOCP is the far best method for servers that have to deal with a big
number of clients and need to have highest throughput.

The iocp code, though, might not be as easy to maintain as a "easy"
server that spawns a thread for every client (but that's not feasible
for maybe 100 or more clients).

Iocp is *the* scalable approach whereas som pitfalls may come into
effect when using multiple io-worker threads.

To be honest, I didn't find any satisfactory solution for the issue when
two or more threads send to the same (target) socket. Thats a really
hard thing for me to solve:
- thread #1 sends a block of data which cannot be sent in one call.
Partially sended data have to be resended. If thread #1 gets preempted
when attempting to resend the rest of the partially sended data, and
thread #2 sends to the very same socket another bucket of data,
data will be mixed up at the receiver's end.

But that's not the only problem with iovps....

Despite: iocp is the best-performance approach for windows!
b***@gmail.com
2006-05-11 08:56:12 UTC
Permalink
Hi,

Taking this on a slight tangent here.
Post by Dieter Goepferich
To be honest, I didn't find any satisfactory solution for the issue when
two or more threads send to the same (target) socket. Thats a really
- thread #1 sends a block of data which cannot be sent in one call.
Partially sended data have to be resended. If thread #1 gets preempted
when attempting to resend the rest of the partially sended data, and
thread #2 sends to the very same socket another bucket of data,
data will be mixed up at the receiver's end.
So how do you ever do multiple overlapped sends on the same socket?
Lets say you do 2 overlapped sends on the same socket, and the first
one is only partially done....what happens to the second send?

Even if you impose packetization on your data, the only way you can
handle it is at the protocol level (if the protocol is designed so the
client checksums incoming packets and rejects garbled data and requests
a resend...which obviously won't work if your client app is a web
browser for instance, and is terribly wasteful anyway).


If this is true, it looks to me like posting multiple overlapped sends
makes absolutely no sense. And since posting multiple overlapped recvs
is subject to the overhead of determining which one of them was posted
first (i.e. reordering the data since recv completion notifications may
come out of order, the recvs being posted in order), overlapped I/O
ends up looking pretty crippled!

Could anybody comment on this situation? I've been thinking about it
for a couple of days now and seized the opportunity to post when Dieter
made his comment on a related topic.

Thanks!
Dieter Goepferich
2006-05-11 09:33:01 UTC
Permalink
All multiple send/recv operations can run into trouble only when more
than one IOCP-worker thread is used, i.e. threads that issue
GetQueuedCompletionStatus() and thereafter make WSARecv()/WSASend() calls
Post by b***@gmail.com
So how do you ever do multiple overlapped sends on the same socket?
I don't post multiple sends on the *same* socket, but different sockets
are perfectly ok!
Post by b***@gmail.com
Even if you impose packetization on your data, the only way you can
handle it is at the protocol level (if the protocol is designed so the
client checksums incoming packets and rejects garbled data and requests
One could handle this on the app-layer but that would degrade
throupghupt and performance more than preventing the server to do
multiple sends on the very same socket.
Post by b***@gmail.com
If this is true, it looks to me like posting multiple overlapped sends
makes absolutely no sense.
Yes it does. But care must be taken when doing so to the same sockets. I
decided to post only one send to a socket at a time (so I don't
encounter the problem)

And since posting multiple overlapped recvs
Multiple recvs ain't a problem:
- with tcp you have to apply some packetizing scheme anyway.
- data are delivered in the order they were received. But the preempting
problem still persists
Post by b***@gmail.com
Could anybody comment on this situation? I've been thinking about it
I would appreciate that, too, indeed!
b***@gmail.com
2006-05-11 10:16:06 UTC
Permalink
Hi,
Post by Dieter Goepferich
All multiple send/recv operations can run into trouble only when more
than one IOCP-worker thread is used, i.e. threads that issue
GetQueuedCompletionStatus() and thereafter make WSARecv()/WSASend() calls
Not sure about that....lets say I'm writing a proxy server, then it may
well post a read to a remote server, do a WSASend to the client, and
immediately post another read to the remote server and end up doing
another overlapped WSASend (subject to bandwidth throttling, and
perhaps a limit of outstanding WSASends on a socket). That would result
in a similar situation.
Post by Dieter Goepferich
Post by b***@gmail.com
So how do you ever do multiple overlapped sends on the same
socket?
Post by Dieter Goepferich
I don't post multiple sends on the *same* socket, but different sockets
are perfectly ok!
Right....but then that misses out on the basic purpose of Overlapped
I/O...we're
back to good old Async I/O when we do that :)
Post by Dieter Goepferich
One could handle this on the app-layer but that would degrade
throupghupt and performance more than preventing the server to do
multiple sends on the very same socket.
Very much so!
Post by Dieter Goepferich
Post by b***@gmail.com
If this is true, it looks to me like posting multiple overlapped sends
makes absolutely no sense.
Yes it does. But care must be taken when doing so to the same sockets. I
decided to post only one send to a socket at a time (so I don't
encounter the problem)
Yep...so far I have only been doing that, but now since Im considering
posting
multiple overlapped sends on the same socket..the question arises :)
Post by Dieter Goepferich
- with tcp you have to apply some packetizing scheme anyway.
- data are delivered in the order they were received. But the preempting
problem still persists

Right....the problem is that one needs to *cache* the read data that
comes
out of order until all prior recvs posted have completed, so you can
send all of it.

Adds needless complexity so I just skipped it.


Thanks for the response :)
codemutt
2006-05-11 13:36:43 UTC
Permalink
Post by Dieter Goepferich
To be honest, I didn't find any satisfactory solution for the issue when
two or more threads send to the same (target) socket. Thats a really
- thread #1 sends a block of data which cannot be sent in one call.
Partially sended data have to be resended. If thread #1 gets preempted
when attempting to resend the rest of the partially sended data, and
thread #2 sends to the very same socket another bucket of data,
data will be mixed up at the receiver's end.
For multiple threads to the same socket try this: dump the data into a
queue or growable buffer from the multiple threads. Then send from that
queue from a different single thread. That way the source data remains
in order even if there is a temporary send buffer shortage in Winsock.
During that temporary buffer shortage you can also continue to put data
into the queue. Some might argue that you now have some contention
issues, but ultimately Winsock also has to control access to the socket
buffers so contention is unavoidable in either case when multiple
threads are used to write to a single socket (which is a common
requirement in servers, by the way). You can handle the contention or
Winsock will handle it for you. Of course there is an extra context
switch but in busy servers using IOCP that doesn't seem to have a
dramatic affect.

Also queuing received data to a single thread(from however many IOCP
threads) separates pulling data off the socket from the processing of
that data and allows you to process in a single thread and avoiding
multithreading issues that develop if processing in the context of an
IOCP thread. This is suprisingly efficient and I have seen servers
implemented this way process incredible amounts of data MORE
EFFICIENTLY than a server that does the same work but is implemented
with multiple threads.

It's fun to experiment -- which also defeats mere opinion and provides
the real answer.
bluethought
2006-05-12 00:50:28 UTC
Permalink
Interesting to hear...I was using a single thread to queue the posting
of Recvs/Sends, but not the processing of Recvs...

However you're still stuck to a single send-per socket from your single
thread, aren't you? I.e. if you do say two sends per socket from your
single thread, and the first send partially succeeds, what will happen
to the second? Would that send fail, or would it succeed, thus garbling
data?
codemutt
2006-05-18 12:52:07 UTC
Permalink
Post by bluethought
Interesting to hear...I was using a single thread to queue the posting
of Recvs/Sends, but not the processing of Recvs...
However you're still stuck to a single send-per socket from your single
thread, aren't you? I.e. if you do say two sends per socket from your
single thread, and the first send partially succeeds, what will happen
to the second? Would that send fail, or would it succeed, thus garbling
data?
I guess in theory it's possible when using overlapped sending that you
could see one overlapped result show that less data was delivered to
the kernel (AFD.sys) than you requested, and then the next overlapped
result shows that all the data was delivered to the kernel.

Are you thinking that you could get a WSAENOBUFS on the first send
after some of the data was copied, but then AFD clears some space and
the next overlapped send comes through and succeeds?

I don't see a remedy for such a case.

I have to say though that I haven't seen evidence of this causing major
problems in production environments. I have seen lack of buffer space
cause problems, but no buffer space usually indicates serious back ups
of data flow with the peer, and the seriousness of that winds up
causing definite (not theoretical) issues.

I think not much time will elapse between getting a WSAENOBUFS on an
overlapped result and seeing that return value directly on the send
call. A queuing object can be used (and is used where I work) to begin
collecting unsent data the moment WSAENOBUFS pops up, either on the
send or the overlapped result call. Properly synchronizing this means
that you can preserve your stream.

Unless what you are theorizing about actually happens. Then your stream
is hosed. And I can think of thread scheduling scenarios that could
bust even an attempt to syncrhonize access to a temporary holding
buffer (as I'm suggesting above) if error notification is not
coordinated between GetOverlappedResult and the Winsock functions.

I don't want to say what you're suggesting can't happen. But I will say
experience with IOCP in our servers for tens of thousands of hours in
production indicates the problem as you pose it doesn't seem to crop
up. Buffer exhaustion happens but our application queuing system seems
to handle it without stream corruption.

Perhaps the error notification with Winsock calls is indeed tied in to
error notification on GetOverlappedResult. If Microsoft didn't
coordinate that, I think what you're theorizing might show up a lot,
rather than rarely or maybe not at all.

I'm just waving my arms though. I've got no definitive proof.
j***@comcast.net
2006-05-12 00:21:12 UTC
Permalink
Post by Peter Duniho
Inefficient compared to what? In general, multi-threading on a multi-CPU
computer, when coded correctly, provides the expected performance increase.
Clearly synchronization (whatever method one chooses to use) does not HAVE
to have a significant effect on performance.
Inefficient compared to not using a lock prefix if you don't need it.
Here's a thread that gets into lock vs non-lock performance:

http://softwareforums.intel.com/ids/board/message?board.id=42&message.id=219
Post by Peter Duniho
As with any software paradigm or technology, multi-threading is no silver
bullet. It's a specific tool to be used in specific situations, where it's
practical and worthwhile. It's not always the right thing to do, and
neither is it always the wrong thing to do.
It'd be easy to prove that the current app, using a single-threaded
select loop, isn't a high-performance server. But your "when coded
correctly" and "does not HAVE to have a significant effect on
performance" caveats concern me too. I'd look really bad if I convinced
mgmt to let us write a multi-threaded IOCP server that turned out to be
not much faster than the current one, whether that was because of bad
programming decisions, or MT isn't a practical worthwhile tool in our
case, but I didn't know that before starting.

Guess I'm trying to think of the tradeoffs involved in MT programming,
and how you'd know when you're tipping too far in one direction. You
need synchronization, but you want to minimize it. That impacts your
decision on which objects/data structures can be concurrently accessed.
your locks can't be too coarse, but they can't be too fine either. IOCP
manages a set of worker threads, which are free to block on other I/O
or locks while processing requests, but sometimes you need dedicated
threads that never make it on to the IOCP's list. When does the number
of these dedicated threads and the IOCP's runnable threads get to the
point where there's too much context switching overhead? And the list
goes on.

I know, hard to say without actually building it...just thinking out
loud.
Peter Duniho
2006-05-12 03:11:14 UTC
Permalink
Post by j***@comcast.net
[...]
I know, hard to say without actually building it...just thinking out
loud.
Well, for what it's worth, at least you're coming up with the questions that
need to be asked. Too many programmers just dive right in to adding
multi-threading without understanding that there are costs involved, and
that doing it wrong could result in worse results (incorrect and/or lower
throughput) compared to a single-threaded solution.

For me personally, if my server was operating at 90% of the theoretical
maximum (once network overhead is accounted for), I'd call that good and
forget about trying to improve the server throughput. IOCP and
multi-threaded code in general has a lot of potential pitfalls. None are
insurmountable, obviously, but unless you're doing it just for your own
personal edification (and it doesn't sound like you are :) ), you need to
weigh the significant cost of design and implementation against potential
benefits. A 10% gain in throughput probably doesn't warrant the development
cost, since you can easily get that kind of gain just by spending a little
more money on hardware (at a far lower cost).

Conversely, a 100% gain in performance (ie a server currently running at 50%
of theoretical capacity) could easily justify the development costs, since
doubling performance with hardware improvements cost a LOT, and the
investment in the technology will benefit you as your client count expands
and as you inevitably improve your hardware and network connection.

Somewhere in between is a break-even point, and where that lies specifically
depends on your comfort level with the multi-threading and IOCP aspects,
your employer's ability to incur a short-term cost for a long-term benefit,
and other factors.

IMHO, I wouldn't worry too much about the overhead of multi-threading. If
your server is underperforming, and you can run it on a multi-processor
computer, it will benefit from multi-threading. Yes, you'll need to be
careful to learn how to write correct multi-threaded code, but for a
competent programmer these are not major problems. It just requires
learning some new parameters for your existing coding skills.

One thing I will point out: as far as I know, IOCP is the only Winsock
paradigm that offers excellent performance for arbitrarily large numbers of
clients. Someone else mentioned completion routines, and I have to admit
that I wonder why those aren't used more often (as near as I can tell,
everyone avoids them, but I've never seen anyone say why). But otherwise,
all of the other notification schemes either rely on using the message queue
(which is not optimized for dealing with network i/o...that may or may not
be a real problem), on using Windows event handles (limited to 64 per
thread, which results in an excessively large number of threads as the
number of clients goes up), or on an extremely inefficient method of
determining which socket was actually in need of servicing (that is, using
select()).

If you know that you will always have a small number of clients, it's
possible you can get acceptable throughput with one of the other paradigms.
But if you're expecting to have to deal with tens of thousands clients, you
need something that won't bog down at those numbers, and as near as I can
tell, that's IOCP. At that point, the question of having to learn IOCP and
multi-threading becomes moot; it's something you simply have to do if you
want a server that doesn't fall apart when heavily loaded.

One last thing, just to complicate matters: I finally got around to taking a
look at the quoted text mentioned earlier in this thread. One thing they
discuss is the difference between their implementation of the server and one
in which the server attempts to drain the receive buffer for a given client
before moving on. IMHO, this *does* affect the results, and it means that
their results are mostly relevant for a situation where one expects all of
the clients to be busy all (or nearly so) of the time.

One area in which WSAAsyncSelect() suffers is dealing with processing a
large number of window messages, and it's likely that a situation where the
server can dramatically reduce the number of window messages by looping on a
receive until getting WSAEWOULDBLOCK (it would have to disable posting of
FD_READs by calling WSAAsyncSelect() without FD_READ enabled, while it
looped on the receive, and reenabling it after) could result in much better
performance.

I don't know that fully optimizing a WSAAsyncSelect() implementation would
improve things enough to make it comparable to IOCP in that case (that is,
where clients are more "bursty", and not constantly streaming data), but
it's quite possible that it could. In any case, it's important to
understand the parameters of the quoted data, and whether and/or how it
applies to your own situation.

Pete
Chris Thomasson
2006-05-12 03:54:17 UTC
Permalink
Post by Peter Duniho
Post by j***@comcast.net
[...]
I know, hard to say without actually building it...just thinking out
loud.
[...]
Post by Peter Duniho
One thing I will point out: as far as I know, IOCP is the only Winsock
paradigm that offers excellent performance for arbitrarily large numbers
of clients. Someone else mentioned completion routines, and I have to
admit that I wonder why those aren't used more often (as near as I can
tell, everyone avoids them, but I've never seen anyone say why).
Perhaps it had something to do with this:

http://support.microsoft.com/default.aspx?scid=kb;en-us;192569

or this:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/transmitfile_2.asp

The docs for the TF_USE_KERNEL_APC flag clearly state that there is a
possibility that APC's will be prevented from being called during certain
periods... Humm...
Peter Duniho
2006-05-12 06:12:38 UTC
Permalink
Post by Chris Thomasson
Post by Peter Duniho
One thing I will point out: as far as I know, IOCP is the only Winsock
paradigm that offers excellent performance for arbitrarily large numbers
of clients. Someone else mentioned completion routines, and I have to
admit that I wonder why those aren't used more often (as near as I can
tell, everyone avoids them, but I've never seen anyone say why).
http://support.microsoft.com/default.aspx?scid=kb;en-us;192569
Only applies to Win95 and Win98, on which IOCP isn't an option anyway. If
you're coding for Win95, performance isn't your highest priority in the
first place, and this whole conversation is moot.
Post by Chris Thomasson
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/transmitfile_2.asp
The docs for the TF_USE_KERNEL_APC flag clearly state that there is a
possibility that APC's will be prevented from being called during certain
periods... Humm...
I suspect something *like* this is more likely to be the issue. It's hard
to tell, since even the TransmitFile docs suggest that in most cases, using
the kernel APC works fine and presumably there's some way to take advantage
of that for the non-TF case (haven't bothered to look into it myself).

Regardless, it's one of those things that ought to be better documented. If
there's some drawback to using completion routines (and it seems that there
probably is), then the Winsock documentation should say so!

Pete
jGoodchild
2006-05-13 08:13:27 UTC
Permalink
Speaking as someone who has implemented a WinSock provider, not just
WinSock applications, I don't understand why there should be a problem
with completion routines unless there's an inherent problem in APCs.
And surely APCs are used much more widely for a variety of different
applications?

I can't believe I'm the only one who has implemented applications that
rely on completion routines! But I must admit that my applications are
not currently being used for ten of thousands of simultaneous
connections - hundreds would be more like it.
bluethought
2006-05-23 22:35:24 UTC
Permalink
Post by Chris Thomasson
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winsock/winsock/transmitfile_2.asp
The docs for the TF_USE_KERNEL_APC flag clearly state that there is a
possibility that APC's will be prevented from being called during certain
periods... Humm...
[quote]
Use of TF_USE_KERNEL_APC can deliver significant performance benefits.
It is possible (though unlikely), however, that the thread in which
context TransmitFile is initiated is being used for heavy computations;
this situation may prevent APCs from launching. Note that the Winsock
kernel mode driver uses normal kernel APCs, which launch whenever a
thread is in a wait state, which differs from user-mode APCs, which
launch whenever a thread is in an alertable wait state initiated in
user mode).
[/quote]

Doesn't the above only hold for kernel APCs? And hence not be an issue,
since using completion routines in the context that we're talking about
would imply the use of alertable waits (SleepEx/WFMO etc), and hence
user-mode APCs?

This below newer thread on the same topic is also perhaps worth a
thought?
http://groups.google.com/group/alt.winsock.programming/browse_frm/thread/61618a770c9a92e4
b***@gmail.com
2006-05-11 09:06:23 UTC
Permalink
Post by j***@comcast.net
Their main point is that multiple threads don't provide much benefit
for an I/O-bound application, since most server configurations will
have only one NIC, and that is the "great serializer". On the surface,
that seems convincing. Reading and writing to I/O devices is slower
than processing data and reading and writing memory. So, if the
application receives a client request, and then can process it and send
a response before the next request is available to be read, the app
might as well be single-threaded. Is this true?
To add to what the other people here have to say, IOCP *can be* and
very often *is* used in mode which sets the maximum count of
simultaneous threads active at any given instant to be the number of
logical CPUs on the system (2 for dual core/dual cpu as an example).

This allows a single thread to be running on a single CPU at any given
time (except for a rare condition in which two threads run on a single
CPU when one of the threads has just woken up from a wait state).

The basic problem here is that threads do enter wait states..and IOCP
ensures that when a thread enters a wait state, another thread from the
thread pool can wake up and handle other responses while the first
thread waits. Other tricks like using LIFO threads in the thread pool
to minimize overheads optimize this approach.

All the other network methods mentioned (select, WSAAsyncSelect,
WSAEventSelect etc) are subject to problems of scale - their internals
(the way they work and are implemented) make them very difficult to use
to handle thousands of connections (as mentioned by other posters in
this thread).

So IOCP is pretty much the first framework (perhaps with the exception
of completion routines - I dont have much experience with them, so I
can't say) to even offer such possibilities of scalable execution using
just one thread *at a time* and minimizing context switches.

Hope that gives you a better idea/more ideas to examine...while my
grasp of the internals / nuances isn't very high, my above
observations/conclusions from what I've seen and read should offer you
some starting points to work with.
d***@webmaster.com
2006-05-16 22:17:23 UTC
Permalink
I have pretty much spent the last six years on just this question. If
it helps, you're all right and all wrong.

It is harder to develop a high-performance multi-threaded networking
application on WIN32. However, if you need the highest performance that
platform is capable of, you have no choice. IOCP is it.

The main thing the influential people are missing are major page
faults. They happen more often that you might think, you can't avoid
them, and they are fatal to single-threaded servers.

Consider a server that's at close to capacity. One client somehow
triggers an unusual condition in the server code. The server code has
to run a handler that very rarely runs -- in fact, the code has either
never faulted in or faulted out. The entire server is now useless until
the disk can read in the page(s) for that piece of code, and it repeats
when one rarely-used function calls another.

Meanwhile, load starts backing up. Due to the backup, even once the
page faults are serviced, the server will run slower than it did
before. Data that was normally processed twice while still hot in the
cache is now out of the cache by the time you get back to it. The
server as a whole is less efficient, and the load backs up more,
causing the code to run slower, and boom, you've spin out.

Multi-threaded servers are not immune to increasing load causing
decreasing efficiency. They just resist it much better than
single-threaded ones, allowing the maximum useful load of a
multi-threaded server to be much higher than a single-threaded one.

Similar problems can occur with reads from local disks. For example, a
web server is quite a challenge to develop with a single thread. The
details may depend upon the precise load characteristics of the
proposed server.

However, once you get your multi-threaded core working well and using
IOCP, you can put any load mix on it and not have to worry.

Note that the single-thread versus multi-thread issue is often
conflated with UP versus SMP. They are two different issues. The OS
will be running a version based on the hardware, and if you have two
CPUs, it's pretty dumb (hyper-threading is a different issue though) to
run a UP kernel just because it's more efficient -- it's not twice as
efficient.

DS
codemutt
2006-05-18 13:01:09 UTC
Permalink
Great point!

Page faults in Windows servers are a considerable performance issue.
And it doesn't have to be paging to disk either, just having to jump
from one page of physical memory to another causes significant
performance downgrades.

The sad part is that the same server process can have different rates
of page faults between runs, even when the same code paths are being
executed. It's a very frustrating mystery when you're trying to
maximize performance.

Certain kinds of servers (depending on the job they are doing) will run
better with a single thread processing data that has been taken off the
network by multiple IOCP threads. As long as the server isn't waiting
on I/O to process requests, single thread processing can be more
effective.
Peter Duniho
2006-05-18 15:11:08 UTC
Permalink
[...] As long as the server isn't waiting
on I/O to process requests, single thread processing can be more
effective.
IMHO, the *only* time you'll see a performance advantage from
correctly-written multi-threaded code versus correctly-written
single-threaded code is when you have additional CPUs (one or more) to take
advantage of the additional thread. Otherwise, you just waste time doing
things that are irrelevant.

Note my liberal use of the phrase "correctly-written". This means that in
the single-threaded case, your code never gets stuck waiting on something
(like i/o, as you mention :) ) when it could have been doing useful work.
In the multi-threaded case, it means that and much more (as already
described elsewhere).

Yes, paging is an issue, but that's not unique to multi-threaded code.
Multi-threading just gives you more opportunity to have different execution
paths fighting for physical RAM.

Pete
d***@webmaster.com
2006-05-23 03:50:12 UTC
Permalink
Post by Peter Duniho
Yes, paging is an issue, but that's not unique to multi-threaded code.
Multi-threading just gives you more opportunity to have different execution
paths fighting for physical RAM.
I'm not sure whether you misunderstood my argument or just ignored it.
You are wrong.

A single-threaded server, no matter how well written, will suffer from
poor performance when code has to fault in. This can occur when a
client triggers a very infrequently executed code path.

A multi-threaded server will continue to run the frequently-executed
code even as one threads runs slowly because it has to fault in code.

DS
Peter Duniho
2006-05-23 04:26:52 UTC
Permalink
Post by d***@webmaster.com
I'm not sure whether you misunderstood my argument or just ignored it.
By the time codemutt got around to replying, I had already forgotten the
topic you had written about. My apologies. Still...
Post by d***@webmaster.com
You are wrong.
You sure like saying that. Does it make you feel good?

I agree that the issue you describe exists. I also agree multithreading can
address it as you describe.

However, it's not incorrect to point out that multithreading can also cause
contention issues, including contention for physical RAM. If you have
sufficient RAM, you can take advantage of one thread continuing to work
while the other pages its needed code in. If you don't, then having
multiple threads fighting with each other for physical RAM can hurt
performance even more as opposed to allowing one code path to fully complete
before switching (in a single thread) to another. Sure, the server *ought*
to have enough RAM, but not everyone lives in a perfect world.

But thanks for your ever-diplomatic approach to conversations about coding.
You are nothing if not consistent.

Pete
d***@webmaster.com
2006-05-23 11:35:54 UTC
Permalink
I don't get it. I said that you were wrong, you agreed that you were
wrong, yet you are hostile that I pointed out that you are wrong. You
Post by Peter Duniho
IMHO, the *only* time you'll see a performance advantage from
correctly-written multi-threaded code versus correctly-written
single-threaded code is when you have additional CPUs (one or more) to take
advantage of the additional thread. Otherwise, you just waste time doing
things that are irrelevant.
This is wrong, you will also see a performance advantage when the
correctly-written single-threaded code is slowed down by page faults.
Post by Peter Duniho
I agree that the issue you dscribe exists. I also agree multithreading can
address it as you describe.
So additional CPUs are not the only time you will see a performance
benefit with multi-threaded code. What you said is fundamentally
majorly wrong. It is a serious misunderstanding of the whole point of
multi-threading. Seriously, making accidental blocking non-fatal is a
*major* benefit of multi-threading -- in many cases, it's the main
benefit. To completely ignore it, to act as if it didn't exist, is as
wrong as wrong can be.
Post by Peter Duniho
But thanks for your ever-diplomatic approach to conversations about coding.
You are nothing if not consistent.
I don't understand why stating that you are wrong when you are in fact
wrong is not diplomatic. Would you prefer I said "That is not true."?
Is that somehow nicer than "You are wrong"?

All the supposed problems with multi-threaded code are solvable. Some
easily, some not so easily. But the problems with single-threaded code
are fundamental and you can do nothing about it. You must write every
last tiny bit of code such that it cannot ever block, on pain of a
disastrous loss of performance. This is, in many cases, far more work
than getting multi-threading right.

With multi-threading, it is nice to minimize blocking. You get more
performance if you let each thread run for as long as possible,
minimizing context switches. But for the 80% of the code that has no
measurable affect on performance, you needn't avoid even the most
remote possibility that you might block.

This coding ease more than pays back any difficulty with getting
multi-threading right. That accidental or unavoidable blocking does not
kill performance is quite possibly the single largest benefit to
multi-threading. For non-CPU-bound tasks, it is the largest benefit.

DS
Peter Duniho
2006-05-23 16:48:30 UTC
Permalink
Post by d***@webmaster.com
I don't understand why stating that you are wrong when you are in fact
wrong is not diplomatic. Would you prefer I said "That is not true."?
Is that somehow nicer than "You are wrong"?
Well, perhaps if you had been clear about what part of my post you disagreed
with, I wouldn't have taken offense. I didn't realize you were only
applying your comment to the one incorrect *opinion* stated in my post,
rather than to the whole thing.

Again, it's the difference between making a blanket statement, and being
considerate enough (part of diplomacy) to restrict your statement to what
you actually mean.
Post by d***@webmaster.com
All the supposed problems with multi-threaded code are solvable.
I don't see how you address the overhead of synchronization. How is that
solvable?
Post by d***@webmaster.com
Some
easily, some not so easily. But the problems with single-threaded code
are fundamental and you can do nothing about it. You must write every
last tiny bit of code such that it cannot ever block, on pain of a
disastrous loss of performance. This is, in many cases, far more work
than getting multi-threading right.
In reality, the entire PC is "single-threaded" when there is one CPU. It is
theoretically possible to construct a single-threaded application such that
it initializes the paging in of the code that would cause it to block before
actually trying to execute that code, and then goes off and does real work.
In the olden days of Windows, this is what we did when performance was an
issue.

I will grant that on an operating system, it's much easier and less
error-prone to just use multiple threads. But that doesn't mean that it's
impossible to solve the issue with a single-threaded server.
Post by d***@webmaster.com
[...]
This coding ease more than pays back any difficulty with getting
multi-threading right.
That is an entirely subjective claim, and it depends a lot on who is doing
the coding and on the type of server.

Pete
d***@webmaster.com
2006-05-23 20:50:47 UTC
Permalink
Post by Peter Duniho
I don't see how you address the overhead of synchronization. How is that
solvable?
You minimize it by using the right synchronization primitives and
minimizing shared state to only that which needs to be shared. For the
most part, you have to seriously mis-design an application for
synchronization overhead to be noticeable.
Post by Peter Duniho
In reality, the entire PC is "single-threaded" when there is one CPU. It is
theoretically possible to construct a single-threaded application such that
it initializes the paging in of the code that would cause it to block before
actually trying to execute that code, and then goes off and does real work.
In the olden days of Windows, this is what we did when performance was an
issue.
You can play with definitions so that it becomes impossible to
distinguish single-threaded code from multi-threaded code if you want.
Code such as you suggest does blur the boundaries. I submit that in
almost any real world case, you're better off using the threading
functions preferred by the OS.
Post by Peter Duniho
I will grant that on an operating system, it's much easier and less
error-prone to just use multiple threads. But that doesn't mean that it's
impossible to solve the issue with a single-threaded server.
The types of solutions you suggest are fundamentally equivalent to
putting your own threading in your application.

I think we're getting buried in trying to prove that we are right while
ignoring what's really important here. What is important is that in
many common real-world applications, muilti-threading solves many major
issues even on a single-CPU system. A single-threaded server
application must be painfully engineered so that no bit of code
anywhere can ever block.

DS
Eric Jensen
2006-05-23 10:31:01 UTC
Permalink
I hav'nt read all the posts in this thread but i would go for the IOCP
server
no matter what.

I saw you mentioned select(), and select() is evil.
select() is only implemented in windows socket due to
portability. I never used select() on a windows based system
only on linux. Passing the socket arrays to the kernel with
select() so it can scan the sockets and return a result to you
is very slow. If you findout the amount of time it takes
for the kernel to scan a single socket, and then look at
the time slicing the OS does, then its clear that select() does not
scale well.

When using IOCP with 1 or more threads there are some advantages.
Unlike select() with IOCP we only post the receive and we will get
notification
when this operation is completed and we dont need to pass on the
socket array to the kernel each time we check.
If 2 or more threads serves the same IOCP they will all wait for a
completion
packet to process, if there is no completion packets in queue the OS will
not give any time slices to the threads waiting.

I run my own server, it has about 9000 clients and it uses IOCP.
I see daily DDoS/DoS attempts and command flooding and such.
I found that a IOCP server can handle this better.
In my server i use only IOCP worker threads for socket IO
they dont handle anything else because their purpose is to make
sure that new clients are accepted fast, and pending incomming IO
are removed from the OS as fast as possible so we dont run out
of resources even when attacked. When there is inbound data
its put in a queue and a single thread will then process the queue
by parsing the data and replying or dispatch the message.
Finally i have a last thread that scans all sockets and closes the ones
timed out (ghosted conns and such).

All this works very good and scales well, and IOCP is actually very
easy to work with.
However, some influential people here claim using the IOCP
approach brings more pain than it's worth.
I would say the same about select().
imho IOCP is far better to work with then select().

//eric
Jeremy Chen
2006-05-24 22:49:37 UTC
Permalink
Post by Eric Jensen
I hav'nt read all the posts in this thread but i would go for the IOCP
server
no matter what.
I saw you mentioned select(), and select() is evil.
select() is only implemented in windows socket due to
portability. I never used select() on a windows based system
only on linux. Passing the socket arrays to the kernel with
select() so it can scan the sockets and return a result to you
is very slow. If you findout the amount of time it takes
for the kernel to scan a single socket, and then look at
the time slicing the OS does, then its clear that select() does not
scale well.
When using IOCP with 1 or more threads there are some advantages.
Unlike select() with IOCP we only post the receive and we will get
notification
when this operation is completed and we dont need to pass on the
socket array to the kernel each time we check.
If 2 or more threads serves the same IOCP they will all wait for a
completion
packet to process, if there is no completion packets in queue the OS will
not give any time slices to the threads waiting.
I run my own server, it has about 9000 clients and it uses IOCP.
I see daily DDoS/DoS attempts and command flooding and such.
I found that a IOCP server can handle this better.
In my server i use only IOCP worker threads for socket IO
they dont handle anything else because their purpose is to make
sure that new clients are accepted fast, and pending incomming IO
are removed from the OS as fast as possible so we dont run out
of resources even when attacked. When there is inbound data
its put in a queue and a single thread will then process the queue
by parsing the data and replying or dispatch the message.
Finally i have a last thread that scans all sockets and closes the ones
timed out (ghosted conns and such).
Interesting... so when the server logic tries to send a piece of data,
how does it pass the operation to IOCP threads, through the iocp port
again? If the logic tries to submit multiple sends for the same
connection, how do you make sure the same iocp thread is handling them
all (or you have only one iocp thread) ?

Thanks,
Jeremy
Post by Eric Jensen
All this works very good and scales well, and IOCP is actually very
easy to work with.
However, some influential people here claim using the IOCP
approach brings more pain than it's worth.
I would say the same about select().
imho IOCP is far better to work with then select().
//eric
d***@webmaster.com
2006-05-24 23:30:29 UTC
Permalink
Post by Jeremy Chen
Interesting... so when the server logic tries to send a piece of data,
how does it pass the operation to IOCP threads, through the iocp port
again? If the logic tries to submit multiple sends for the same
connection, how do you make sure the same iocp thread is handling them
all (or you have only one iocp thread) ?
You don't need to pass an operation to an IOCP thread, you can just
start it in whatever thread happens to be running. And you don't need
operations to be handled by the same IOCP thread since all IOCP threads
are the same. The problems you are mentioning just don't exist with
IOCP.

Any thread can start an operation with a non-blocking operation such
that the completion indication will be queued to the IOCP. Any IOCP
thread can pull the completion indication. So it doesn't matter.

DS
Jeremy Chen
2006-05-25 17:55:24 UTC
Permalink
Post by d***@webmaster.com
Post by Jeremy Chen
Interesting... so when the server logic tries to send a piece of data,
how does it pass the operation to IOCP threads, through the iocp port
again? If the logic tries to submit multiple sends for the same
connection, how do you make sure the same iocp thread is handling them
all (or you have only one iocp thread) ?
You don't need to pass an operation to an IOCP thread, you can just
start it in whatever thread happens to be running. And you don't need
operations to be handled by the same IOCP thread since all IOCP threads
are the same. The problems you are mentioning just don't exist with
IOCP.
My post was based on the following statement in Eric's:
"In my server i use only IOCP worker threads for socket IO ".
Maybe I misunderstood his design because I saw such ones before.
Post by d***@webmaster.com
Any thread can start an operation with a non-blocking operation such
that the completion indication will be queued to the IOCP. Any IOCP
thread can pull the completion indication. So it doesn't matter.
DS
If io operatiosn are not performed inside IOCP threads, this is
obvious. But I was trying to understand how beneficial it would be to
have multiple IOCP threads once they are only used to process
completion packets.

Jeremy
Eric Jensen
2006-05-25 09:39:48 UTC
Permalink
Post by Jeremy Chen
Interesting... so when the server logic tries to send a piece of data,
how does it pass the operation to IOCP threads, through the iocp port
again?
IOCP threads are only used to process notification upon completion.
In cases of send, when there is notification it will simply only delete
the OVERLAPPED objects used for the operation and its buffers.
With IOCP you can have a per handle pointer, means that you can make
a class or any object - lets call it a userobject, this pointer is passed on
to
you on each completion. In that object i have in and out buffers + states
and such.
My protocol is string based, so when there is data on the socket
and one of the IOCP threads process the notification it will check
for bad chars and add to the inbuffer in this object (or KILL if there
is bad chars), then it repost the receive call using same overlapped object.

When you're sending data on a socket there is bound to a IOCP port
you will use WSASend(). The kernel (the driver that handles this)
will simply copy the data you send into some internal buffer.
In case of multiply sends from more then 1 thread, the data will
be merged into the system internal buffer or something like that
(maybe not if you disable neagle algo on the socket).
Post by Jeremy Chen
If the logic tries to submit multiple sends for the same
connection, how do you make sure the same iocp thread is handling them
all (or you have only one iocp thread) ?
In my case i have 2 IOCP ports, 1 used only for accepting connections
with AcceptEx(). The other IOCP is for send/recv notifications only
and receiving notification from my own async gethostbyname.
I have more then 1 thread, and how many threads used are a option/
configuration. Normally i have 2 threads on listen IOCP and 2 on the
send/recv
IOCP. As i mentioned i have a single thread that processes all data from
clients.
So basicly the core of the program is "single threaded", and only network IO
are handled
by multiply threads.

In addition, in some cases with IOCP where there are worker threads
that only do IO like in my case, the per handle key passed to the thread
for the current socket can be dangerous to use. Lets say that my core thread
detects a flood attack (command spam) then it will decide to kill the user
(disconnect and ban the ip addr). When this happens there is a outstanding
recv (and maybe a send) pending. When the core thread kills the user
it will also delete its userobject (the per handle key) and when
the socket is closed, outstanding IO will complete with error.
Or in rare cases there is already a done recv that is being handled.
This can be dangerous, because if the object pointed by per handle key
is already deleted = crash/access violation. I handle this by making a
"garbage
collector" so when the core thread deletes the object, it simply only passed
on
to the garbage collector, that will keep the object until there is no
references
to any outstanding IO and so. After then it will decide to keep object for
next
new user, or if the pool has many objects it might just delete it.
Thats is almost the only "problem" with IOCP but its easy to handle.

Another problem to be attentive to is the listen socket.
In case the backlog gets filled, the OS will stop accepting users
and send a RST package to the peer. When using a socket placed
in listen mode with IOCP the listen part is bit different.
Because when the backlog is filled, the os stops "listening".
Normally a call to accept will restore/re-activate the listen.
But that is not the case with AcceptEx. If the backlog gets filled
you must have code in the worker thread that will call listen()
again if this happens or else the server stops accepting new
clietns.

Those 2 is the only "pains" i had with IOCP.

A thing there is a bit annoying is AcceptEx():
When using WSASend / WSARecv there will be posted notification
on IOCP even if the call completes immediately. AcceptEx() do not
post completion package on IOCP if it completed immediately.
Some people think it does since the others do. msdn is not clear about that,
actually they dont mention it at all for AcceptEx(). There are 2 ways of
handling it:
1. Use PostQueuedCompletionStatus()
2. jump back in code (i do this)

e.g. for 2

LABEL_ACCEPT:

/* here is code to handle the new client */

//------------------- Repost accept call
memset(&OverlapPlus->ol, 0, sizeof(OVERLAPPED));
OverlapPlus->sclient = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
// code to handle error
if (lpfnAcceptEx(OverlapPlus->s, OverlapPlus->sclient,
OverlapPlus->wbuf.buf, OverlapPlus->wbuf.len - ((sizeof(sockaddr_in) + 16) *
2), sizeof(sockaddr_in) + 16, sizeof(sockaddr_in) + 16, &dwBytes,
&OverlapPlus->ol)) {
// case 1: PostQueuedCompletionStatus(hIocpListen, 0, 0, Overlap);
goto LABEL_ACCEPT; // this is prolly faster then posting
} else {
if (WSAGetLastError() != WSA_IO_PENDING) {
// this is bad
}
}

//eric
Jeremy Chen
2006-05-25 19:49:21 UTC
Permalink
Post by Eric Jensen
Post by Jeremy Chen
Interesting... so when the server logic tries to send a piece of data,
how does it pass the operation to IOCP threads, through the iocp port
again?
IOCP threads are only used to process notification upon completion.
Thanks for the clarification, I thought you were using IOCP threads for
IO operations.
Post by Eric Jensen
In my case i have 2 IOCP ports, 1 used only for accepting connections
with AcceptEx(). The other IOCP is for send/recv notifications only
and receiving notification from my own async gethostbyname.
I have more then 1 thread, and how many threads used are a option/
configuration. Normally i have 2 threads on listen IOCP and 2 on the
send/recv
IOCP. As i mentioned i have a single thread that processes all data from
clients.
Since iocp threads are handling completion packets only, have you
measured how it performs if you use one thread instead of two? (I am
assuming one processor case)
Post by Eric Jensen
In addition, in some cases with IOCP where there are worker threads
that only do IO like in my case, the per handle key passed to the thread
for the current socket can be dangerous to use. Lets say that my core thread
detects a flood attack (command spam) then it will decide to kill the user
(disconnect and ban the ip addr). When this happens there is a outstanding
recv (and maybe a send) pending. When the core thread kills the user
it will also delete its userobject (the per handle key) and when
the socket is closed, outstanding IO will complete with error.
Or in rare cases there is already a done recv that is being handled.
This can be dangerous, because if the object pointed by per handle key
is already deleted = crash/access violation. I handle this by making a
"garbage
collector" so when the core thread deletes the object, it simply only passed
on
to the garbage collector, that will keep the object until there is no
references
to any outstanding IO and so. After then it will decide to keep object for
next
new user, or if the pool has many objects it might just delete it.
Thats is almost the only "problem" with IOCP but its easy to handle.
If I undetstand correctly, you have the overlapped structure inside the
user object, and have iocp thread to free the overlapped structure,
but upper-layer worker thread free the user object.

Wont it be possible to just let your iocp thread free the user object
on the error completion?
Post by Eric Jensen
Another problem to be attentive to is the listen socket.
In case the backlog gets filled, the OS will stop accepting users
and send a RST package to the peer. When using a socket placed
in listen mode with IOCP the listen part is bit different.
Because when the backlog is filled, the os stops "listening".
Normally a call to accept will restore/re-activate the listen.
But that is not the case with AcceptEx. If the backlog gets filled
you must have code in the worker thread that will call listen()
again if this happens or else the server stops accepting new
clietns.
I guess you are talking about how to keep pending AcceptEx calls on the
listen socket?
Post by Eric Jensen
Those 2 is the only "pains" i had with IOCP.
When using WSASend / WSARecv there will be posted notification
on IOCP even if the call completes immediately. AcceptEx() do not
post completion package on IOCP if it completed immediately.
Did you personally experience this?

Thanks,
Jeremy
d***@webmaster.com
2006-05-25 23:37:56 UTC
Permalink
The advantage of having a few more IOCP threads than you have CPUs is
that you don't have to be super careful to make sure the IOCP threads
can never, ever block. Even if the IOCP threads put incomind data on a
queue, there's always the possibility that a non-IOCP thread will hold
a lock on that queue when the IOCP thread goes to service it.

So make sure IOCP threads never, ever block also means making sure
every thread that accesses a resource and IOCP thread might need also
never, ever blocks while holding that resource.

The whole advantage of IOCP is that you *don't* have to live with these
limitations. So why impose them on yourself?
Jeremy Chen
2006-05-25 23:58:37 UTC
Permalink
Post by d***@webmaster.com
The advantage of having a few more IOCP threads than you have CPUs is
that you don't have to be super careful to make sure the IOCP threads
can never, ever block. Even if the IOCP threads put incomind data on a
queue, there's always the possibility that a non-IOCP thread will hold
a lock on that queue when the IOCP thread goes to service it.
Well, i do not disagree, but this is more like a textbook answer :).
Say you set the concurrent thread # to 2 when creating the iocp, the
actual active thread # can go up to 4 if those 2 are blocked, and then
unblocked after 2 extra threads kick in. This also means increased
context switching overhead... but by how much I don't know, nor do I
see any published result of tuning these iocp thread number.
Post by d***@webmaster.com
So make sure IOCP threads never, ever block also means making sure
every thread that accesses a resource and IOCP thread might need also
never, ever blocks while holding that resource.
The whole advantage of IOCP is that you *don't* have to live with these
limitations. So why impose them on yourself?
d***@webmaster.com
2006-05-26 05:53:15 UTC
Permalink
Post by Jeremy Chen
Well, i do not disagree, but this is more like a textbook answer :).
Say you set the concurrent thread # to 2 when creating the iocp, the
actual active thread # can go up to 4 if those 2 are blocked, and then
unblocked after 2 extra threads kick in. This also means increased
context switching overhead... but by how much I don't know, nor do I
see any published result of tuning these iocp thread number.
More active threads does not increase context switching overhead unless
each thread only does a small amount of work. Under normal
circumstances, each thread will use up its full timeslice doing useful
work and the number of context switches will be constant.

For example, consider two machines each with two CPUs and otherwise
identical. One one, there are 10 threads always ready-to-run. On the
other, 100. Why would the context switch rate be higher on the one with
100 threads? The timeslice is the same.

The problem with a large number of active threads is when each thread
does a small amount of work. For example, a thread-per-client web
server cannot do a little work for each of its 1,000 clients without
1,000 context switches.

DS
bluethought
2006-05-26 12:16:48 UTC
Permalink
Post by Jeremy Chen
Well, i do not disagree, but this is more like a textbook answer :).
Say you set the concurrent thread # to 2 when creating the iocp, the
actual active thread # can go up to 4 if those 2 are blocked, and then
unblocked after 2 extra threads kick in. This also means increased
context switching overhead... but by how much I don't know, nor do I
see any published result of tuning these iocp thread number.
So obviously using more threads helps you not have to keep your code
absolutely clear of situations in which time spent blocking is greater
than the time spent in context switching.

Which I believe is pretty much what the post you were replying to
implied.
David Gravereaux
2006-05-27 02:51:21 UTC
Permalink
Post by d***@webmaster.com
The advantage of having a few more IOCP threads than you have CPUs is
that you don't have to be super careful to make sure the IOCP threads
can never, ever block.
Yes, but then you have to manage WSABUF ordering problems by having to put
later buffers in a holdoff area and incure additional processing time
reconstructing, thus causing more queue contention.

I reverted my code when I tried managing that. The beauty of
CompletionPorts as I see it is FIFO, not like CompletionRoutines using
APC. If I want to post many WSARecv calls, the order they get posted, is
truely the order they come back out of GQCS. I like that!
Post by d***@webmaster.com
Even if the IOCP threads put incomind data on a
queue, there's always the possibility that a non-IOCP thread will hold
a lock on that queue when the IOCP thread goes to service it.
Yup. Using well debugged queue code is very important. Then there's the
mythical non-blocking, wait-free, lock-free AtomicPtr queue SenderX was
talking about. Remember him?

I got as far as looking into '__asm cmpxchg64', but got scared :)
Post by d***@webmaster.com
So make sure IOCP threads never, ever block also means making sure
every thread that accesses a resource and IOCP thread might need also
never, ever blocks while holding that resource.
The whole advantage of IOCP is that you *don't* have to live with these
limitations. So why impose them on yourself?
Well, if you structure your code so your socket stuff uses a single thread
on GQCS in a way that it becomes just a producer (doesn't drive the rest
of your application) and alerts to the user side of your API [that can
viewed as consumers] you'll find the complexity of your socket code goes
way down, and is ripping fast, too. And your consumers can be in multiple
threads if you like.

IMO, of course.
d***@webmaster.com
2006-05-27 05:26:20 UTC
Permalink
Post by David Gravereaux
I reverted my code when I tried managing that. The beauty of
CompletionPorts as I see it is FIFO, not like CompletionRoutines using
APC. If I want to post many WSARecv calls, the order they get posted, is
truely the order they come back out of GQCS. I like that!
Operations always complete in order. You just may not see the
completion indications in order. But you can still process them in
order. If you start operations 1, 2, and 3, and get a completion
indication for 2, just lock the object, and process all outstanding
operations (in this direction) up to and including 2.

I personally recommend avoiding more than one operation in the same
direction per connection. It generally impairs scalability anyway by
requiring more memory to be locked.

DS
Eric Jensen
2006-05-27 10:15:48 UTC
Permalink
Post by d***@webmaster.com
I personally recommend avoiding more than one operation in the same
direction per connection. It generally impairs scalability anyway by
requiring more memory to be locked.
If the application posts overlapped recv's in a sequential manner, it wastes
the time window between one recv completion and the posting of the next
recv. If it had another buffer already posted, the transport would be able
to use that buffer immediately and not wait for the application's next recv
operation.

//eric
David Gravereaux
2006-05-27 18:04:31 UTC
Permalink
Post by Eric Jensen
Post by d***@webmaster.com
I personally recommend avoiding more than one operation in the same
direction per connection. It generally impairs scalability anyway by
requiring more memory to be locked.
If the application posts overlapped recv's in a sequential manner, it wastes
the time window between one recv completion and the posting of the next
recv. If it had another buffer already posted, the transport would be able
to use that buffer immediately and not wait for the application's next recv
operation.
I tried that.. But didn't get an improvement in throughput for WSARecv. I
tried three approaches..

1) zero byte WSARecvs --
Ok, yes this was slower, but useful for many open sockets to keep
resources down.

2) flow-controlled WSARecvs --
Only one WSARecv posted to a socket, and only replaced when my upper level
comes around to do a recv() on the socket. WSABUF size is equal to my
upper layer buffer size.

3) burst-mode WSARecvs using page sized buffers --
Only one WSARecv is initially posted to a socket with a configurable limit
to the outstanding count. The replacement is posted in the GQCS thread.
If the replacement WSARecv returns immediate rather than posted, recursion
happens until WSA_IO_PENDING is returned or the outstanding limit is
reached.


In testing, I found no "long term" improvement between aproaches 2 and 3
for effective throughput. I could just as easily achieve wire speed on
both. Though in the immediate, I could see burst-mode kicking in.

Approach #1 was not made for speed, and it was a bit slower, but didn't
collect any numbers for it.

But what I did find important for throughput was the count of allowed
outstanding calls on WSASend. The more the better. But do use a limit as
you don't want to alloc all of a 5Gb file in a chain of WSABUF at once :)
For me, a limit of 20 or so (4k each) and I was able to hit wire speed.

YMMV
David Gravereaux
2006-05-27 20:31:06 UTC
Permalink
Post by David Gravereaux
For me, a limit of 20 or so (4k each) and I was able to hit wire speed.
Here's the graph, FWIW: Loading Image...

As I increased the "sendcap" on the right hand local control for the
sender, thoughput increased. What I found interesting is that querying
the receiver's WSARecv concurrent count with burst-mode on, nearly equaled
the "sendcap".

Later when I experimented with my flow-controlled receive logic, I found
no improvements in throughput. It was quite rewarding.
d***@webmaster.com
2006-05-28 22:46:24 UTC
Permalink
Post by Eric Jensen
If the application posts overlapped recv's in a sequential manner, it wastes
the time window between one recv completion and the posting of the next
recv. If it had another buffer already posted, the transport would be able
to use that buffer immediately and not wait for the application's next recv
operation.
In typical cases, there's nothing the transport can do in that time
interval anyway. The time between when you process one receive
completion and post the next should generally be smaller than the
inter-packet arrival time.

However, it's fair to point out that as load goes up, performance will
get worse due to the lack of an available application buffer.

DS
bluethought
2006-05-28 10:22:51 UTC
Permalink
">The advantage of having a few more IOCP threads than you have CPUs is
Post by d***@webmaster.com
that you don't have to be super careful to make sure the IOCP threads
can never, ever block.
Yes, but then you have to manage WSABUF ordering problems by having to
put
later buffers in a holdoff area and incure additional processing time
reconstructing, thus causing more queue contention. "

That seems likely only if you're chaining multiple WSABUFs in a WSARecv
call, or posting multiple overlapped WSARecvs on the same socket.
Without either of these two conditions/situations holding, I don't see
this as an issue if you have more IOCP threads than CPUs (I usually use
2 * NumberOfLogicalProcessors, since I'm only using on AMD64 servers,
no HT).



"Yup. Using well debugged queue code is very important. Then there's
the
mythical non-blocking, wait-free, lock-free AtomicPtr queue SenderX was
talking about. Remember him? "

I do believe that if you look carefully you'll see that he posted on
this thread :)



"I got as far as looking into '__asm cmpxchg64', but got scared :) "

I've found that very carefully programmed lock-free queues, or
finely-but-not-too-finely grained locking seems to do just fine. For
instance, I use hash functions to break up unordered lists into N
disjoint sets, each with its guardian critical section. That reduces
contention greatly, and since critical sections first spin (spinlock
style) before transitioning to kernel mode, most of the time I don't
incur transitions to kernel mode.
David Gravereaux
2006-05-28 18:47:55 UTC
Permalink
Post by bluethought
">The advantage of having a few more IOCP threads than you have CPUs is
Post by d***@webmaster.com
that you don't have to be super careful to make sure the IOCP threads
can never, ever block.
Yes, but then you have to manage WSABUF ordering problems by having to put
later buffers in a holdoff area and incure additional processing time
reconstructing, thus causing more queue contention. "
That seems likely only if you're chaining multiple WSABUFs in a WSARecv
call, or posting multiple overlapped WSARecvs on the same socket.
Without either of these two conditions/situations holding, I don't see
this as an issue if you have more IOCP threads than CPUs (I usually use
2 * NumberOfLogicalProcessors, since I'm only using on AMD64 servers,
no HT).
That'll guarantee proper order, yes. Personally, I didn't find that more
than one concurrent WSARecv improved throughput.

I did discover one thing... If I allowed my IOCP code to receive faster
than my consumers could consume it, I made a DoS hole. So I'm in the
middle of testing the server, slowed down the recv() processing, and I'm
watching my recvBuffer queue count climb and climb and... with no bounds.

Then it dawned on me what an idiot I was :) flow control need not be
disconnected for good reason..
Post by bluethought
"Yup. Using well debugged queue code is very important. Then there's
the
mythical non-blocking, wait-free, lock-free AtomicPtr queue SenderX was
talking about. Remember him? "
I do believe that if you look carefully you'll see that he posted on
this thread :)
My newsreader only goes back a month, guess I'll check groups.google.com
Post by bluethought
"I got as far as looking into '__asm cmpxchg64', but got scared :) "
I've found that very carefully programmed lock-free queues, or
finely-but-not-too-finely grained locking seems to do just fine. For
instance, I use hash functions to break up unordered lists into N
disjoint sets, each with its guardian critical section. That reduces
contention greatly, and since critical sections first spin (spinlock
style) before transitioning to kernel mode, most of the time I don't
incur transitions to kernel mode.
For a time, I did something stupid for my close() routine I give to my
upper API. In it, I tried to destroy my per-socket info right then and
tried to manage WSA_OPERATION_ABORTED results coming in off GQCS with a
dangling per-socket info and just plain made a mess of things :)

After removing a large EnterCriticalSection/LeaveCriticalSection block
from the completion thread matched in my close and reoganized for a posted
DisconnectEx so the per-socket struct is only reclaimed in the GQCS thread
when the ref count hits zero, my troubles went away and GCQS iterations
per second went way up..

I was stupid, but I found my mistake :) and real closure should only
happen in the completion thread.
Eric Jensen
2006-05-26 10:47:11 UTC
Permalink
Post by Jeremy Chen
Thanks for the clarification, I thought you were using IOCP threads for
IO operations.
Im not because the client collection/list would need a lock and the IOCP
threads would have to fight for it.
Then there is the locking on the queue that could take away speed. Normally
you should not worrie
so much about locks in my app, because there is no blocking operations at
all
(except for hostname lookup, but that has own thread so no problem) and the
threads would not wait long. However there is no reason to have the threads
fight for a lock. the queue is a chained list build of pointers. Each object
in queue has a pointer to next object. So when they are processed
the core thread will only get pointer to first object in queue, it will not
get them one by one. So here is the iocp threads not blocked for long.
To remove the issue about the the IOCP threads fighting, they simply
each have a "queue". And the core thread will process them all.
The queue is just a class, the class needs to know how many threads needs
a queue, and internally it have a sub queue for each thread.
Post by Jeremy Chen
Since iocp threads are handling completion packets only, have you
measured how it performs if you use one thread instead of two? (I am
assuming one processor case)
The server runs perfectly with only 1 thread to handle both listen and
send/recv
notification. However, with around 9000 online clients many gets banned from
the server everyday, and many of those people do also have TCP flood tools.
So when a SPS users gets banned i get a nice dos attack. When the server is
attacked 1 thread is not enough on listen IOCP. However 2 threads seems to
handle it well
and not that many connections are refused compared to only 1 thread.
Another reason for me having 2 threads on listen IOCP is that i have more
than
one listen port.

Im running on a single cpu, and the multiply threads would perform better
on a dual processor system. However, when a thread waits for a IOCP
package, the system do not give a timeslice to the threads unless there
is work for them. So having to many threads are not hurting the program at
all.
However, if 1 thread has been sleeping too long it might be slow to start up
again
due to that the OS might have moved it into swap and such.
But i handle that by not having threads wait infinite on notification.
Instead it breaks out and check if the stopeven handle has been signaled,
if so it means that server program wants to exit. This check keeps the
thread fresh and alive :)

The reasoning for having more then 1 thread on the send/recv notification:
When having this many clients, the OS can easily run out of resources.
The amount of concurrent clients i can have connected is directly
proportional to how much resources there is for the OS.
So its important to process this as fast as possible.
Lets say i can have 30.000 clients if they are only connected and there
are no sends and receives. If there is also send and receives i might
only be able to have 15.000 because the OS needs its internal resources
for the send/receives and that leaves less resources for having the rest
connected. If the OS runs out of resources it can mean that
all new clients will get RST package because OS do not
have resources.

However, there are many things you can do to avoid this.
Like removing send buffers in the kernel, and disable neagle algo and such.
But removing these send buffers in kernel (AFD.SYS) means that
your porgram needs to have multiply outstanding sends and not wait
for the rest to complete. So that when one send is done the OS can
start on sending the next data without having to wait for your program
to repost it. And by disabling send buffers you avoid that AFD.SYS copies
the buffer. Disabling recv buffer is not really good/no gain, because it
will just
cause the OS to buffer on a lower level than winsock and the buffer copying
will take place anyway.
Post by Jeremy Chen
If I undetstand correctly, you have the overlapped structure inside the
user object, and have iocp thread to free the overlapped structure,
but upper-layer worker thread free the user object.
I only have a reference (not really a ref) to the overlapped structures.
When a send is started it requies a OVERLAPPED structure.
I simply just keep track of the count of these stucts per user.
ANd the user object is only free'd if there is no outstanding objects.
It would be okay just to handle the user objects
directly in the IOCP thread that receives the notification.
But i can risk (very rarely) that 1 iocp thread is working with
the userobject while another thread wants to delete it.
This can't happen cause it will mean crash/access violation.
Of course i could have a lock on the userobject, but this
other way scales better, atleast with my code.
Post by Jeremy Chen
Wont it be possible to just let your iocp thread free the user object
on the error completion?
Lets say i have 2 outstanding overlapped operations.
A send and a receive. I call closesocket() and those 2
operations will fail, and i get 2x error notification on some IOCP
thread. Both completions return the pointer to the user object.
So the first completion package will result in deletion of the object.
But when the second completion arrives, it will cause a access violation
by deleting the same object again. I could use a API call to see if
this is a bad pointer, however i use my garbage collection instead.
And since this server dispatch a lot of small messages and there
is many logins/outs, it saves resources to keep the objects instead
of new and delete all the time, and im sure the object is only deleted
when its perfectly safe.
Post by Jeremy Chen
I guess you are talking about how to keep pending AcceptEx calls on the
listen socket?
Correct.
Post by Jeremy Chen
Post by Eric Jensen
Those 2 is the only "pains" i had with IOCP.
When using WSASend / WSARecv there will be posted notification
on IOCP even if the call completes immediately. AcceptEx() do not
post completion package on IOCP if it completed immediately.
Did you personally experience this?
Yes i did. When i started out with this server it was not using IOCP, but
WSASelect.
I found that IOCP might be better so i re-wrote the "socket handling" part
to be IOCP.
Then one of the popular dos attacks started, and the backlog (i set it to
no. of threads)
was filled and it stopped accepting new connections. Normally a call to
accept()
would restore the listening, but this is not the case with AcceptEx(). And
since WSASend/WSARecv
functions give notification no matter what, i counted on it was the same for
AcceptEx()
but it was not :) However, this only gives me problems when server is under
attack.
Buts it no problem to handle in your code as long as one know it can happen.

//eric
David Gravereaux
2006-05-27 01:39:47 UTC
Permalink
Post by Eric Jensen
AcceptEx() do not
Post by Jeremy Chen
Post by Eric Jensen
post completion package on IOCP if it completed immediately.
Did you personally experience this?
Yes i did. When i started out with this server it was not using IOCP, but
WSASelect. I found that IOCP might be better so i re-wrote the "socket handling" part
to be IOCP. Then one of the popular dos attacks started, and the backlog (i set it to
no. of threads) was filled and it stopped accepting new connections.
Intriguing.. Why not just post more AcceptEx calls? I do about 200 for
my web server, though I didn't find much improvement over 25. It handles
a spoofed IP SYN flood like it isn't there. I only use a single thread
recursing on GQCS. I found the count of pending AcceptEx calls on the
listening socket *is* your backlog and you have direct control over the
size of it.

I've never seen AcceptEx *ever* complete immediatly. It always returns
WSA_IO_PENDING for me.

/*
* Use the special function for overlapped accept() that is provided
* by the LSP of this socket type.
*/

rc = infoPtr->proto->AcceptEx(infoPtr->socket, bufPtr->socket,
bufPtr->buf, bufPtr->buflen - (addr_storage * 2),
addr_storage, addr_storage, &bytes, &bufPtr->ol);

if (rc == FALSE) {
if ((WSAerr = winSock.WSAGetLastError()) != WSA_IO_PENDING) {
InterlockedDecrement(&infoPtr->outstandingOps);
InterlockedDecrement(&infoPtr->outstandingAccepts);
bufPtr->WSAerr = WSAerr;
return WSAerr;
}
} else if (useBurst) {
/*
* Tested under extreme listening socket abuse it was found that
* this logic condition is never met. AcceptEx _never_ completes
* immediatly. It always returns WSA_IO_PENDING. Too bad,
* as this was looking like a good way to detect and handle burst
* conditions.
*/

BufferInfo *newBufPtr;

/*
* The AcceptEx() has completed now, and was posted to the port.
* Keep giving more AcceptEx() calls to drain the internal
* backlog. Why should we wait for the time when the completion
* routine is run if we know the listening socket can take another
* right now? IOW, keep recursing until the WSA_IO_PENDING
* condition is achieved.
*/

newBufPtr = GetBufferObj(infoPtr, buflen);
if (PostOverlappedAccept(infoPtr, newBufPtr, 1) != NO_ERROR) {
FreeBufferObj(newBufPtr);
}
}

return NO_ERROR;
}
Eric Jensen
2006-05-27 10:30:14 UTC
Permalink
Post by David Gravereaux
Intriguing.. Why not just post more AcceptEx calls? I do about 200 for
my web server, though I didn't find much improvement over 25. It handles
a spoofed IP SYN flood like it isn't there. I only use a single thread
recursing on GQCS. I found the count of pending AcceptEx calls on the
listening socket *is* your backlog and you have direct control over the
size of it.
Actually, I do not have good answer for that. For recv operations i do have
more outstanding recv's - And i should have that as well for AcceptEx()
because with my current way, im actually wasting the time window :/

Im gonna change that in my server now, i should have thought of that :)
Post by David Gravereaux
I've never seen AcceptEx *ever* complete immediatly. It always returns
WSA_IO_PENDING for me.
I only see that when there is already a conn in the queue (backlog) and i
call AcceptEx()
However, im gonna change my code to post more AcceptEx() and i guess that
will remove this immediate completion. I guess you dont see this due to your
200 outstanding operations.

//eric
David Gravereaux
2006-05-27 17:25:25 UTC
Permalink
Post by Eric Jensen
Post by David Gravereaux
Intriguing.. Why not just post more AcceptEx calls? I do about 200 for
my web server, though I didn't find much improvement over 25. It handles
a spoofed IP SYN flood like it isn't there. I only use a single thread
recursing on GQCS. I found the count of pending AcceptEx calls on the
listening socket *is* your backlog and you have direct control over the
size of it.
Actually, I do not have good answer for that. For recv operations i do have
more outstanding recv's - And i should have that as well for AcceptEx()
because with my current way, im actually wasting the time window :/
Im gonna change that in my server now, i should have thought of that :)
Post by David Gravereaux
I've never seen AcceptEx *ever* complete immediatly. It always returns
WSA_IO_PENDING for me.
I only see that when there is already a conn in the queue (backlog) and i
call AcceptEx()
However, im gonna change my code to post more AcceptEx() and i guess that
will remove this immediate completion. I guess you dont see this due to your
200 outstanding operations.
Oddly enough, no, and I don't know why, but not at first though...

During initial experiments of my new IOCP code [from the ol' WSAAsynSelect
code] when creating a listening socket I varied the count of posted
AcceptEx calls looking a "pocket". If the count was low, like around two
or three, I could still get the "classic" winsock bug of dropped accept
notices [peer got its ACK, but no AcceptEx came back for it]. A couple
dozen of those (20 minutes or so of maximum abuse with ab.exe) and the
listening socket stops returning any AcceptEx calls and I have to close
and reopen the socket for the server to work again. The count did matter
and larger was better to avoid dropped accepts.

But later on in development after I perfected my thread procedure on GQCS,
I went back to test the old bug, and couldn't get it to happen with even
just one. I didn't try a SYN flood test, just ab.exe requesting a test
page from multiple machines to my one web server.

To this day, I don't know what I did, but must have done it right..

The best guess I have is that my thread on GQCS has little blocking, if
any, and reposts a new AcceptEx early after determining the socket isn't
closing. That thread doesn't do any application work either, and is
solely for the purpose of sucking winsock dry and stuffing it back with
fresh new buffers. I alert to the thread the socket is bound for that it
has a new job, so it can come and grab the new work off a queue when the
thread is ready to do so. Classic producer/consumer.

Maybe my attention to avoid contention in the GQCS thread worked.. I
don't have a clear answer. I wish I had a clear answer.
bluethought
2006-05-28 10:27:50 UTC
Permalink
"Maybe my attention to avoid contention in the GQCS thread worked.. I
don't have a clear answer. I wish I had a clear answer. "

Surely easily done by running some test code, cutting down the # of
AcceptEx-es to maybe 20, and bombarding the server?
David Gravereaux
2006-05-28 17:55:25 UTC
Permalink
Post by bluethought
"Maybe my attention to avoid contention in the GQCS thread worked.. I
don't have a clear answer. I wish I had a clear answer. "
Surely easily done by running some test code, cutting down the # of
AcceptEx-es to maybe 20, and bombarding the server?
I did that. I brought the AcceptEx count down to even one, and the old
bug wasn't showing itself anymore like it had originally. Why? I have no
idea.
bluethought
2006-05-28 10:25:40 UTC
Permalink
"Actually, I do not have good answer for that. For recv operations i do
have
more outstanding recv's - And i should have that as well for AcceptEx()
because with my current way, im actually wasting the time window :/

Im gonna change that in my server now, i should have thought of that :)
"

Would be interesting to know if you still face issues with syn floods
in your production server once you implement this.

Like David, I use 200+ AcceptEx-es pending at any given time (I refill
when the count goes down to less than 100, by keeping a thread that
posts AcceptEx-es waiting on an event). My code is still not deployed
on a full-load production server though, so some real world results
would be interesting.
Eric Jensen
2006-05-28 14:52:01 UTC
Permalink
Post by bluethought
Would be interesting to know if you still face issues with syn floods
in your production server once you implement this.
I changed the code now, and will do some test later today.
However, i will first install it the updated code on my server on a later
point since its located in a data center 400 kilometers from here, and
remoteing is too risky if something goes wrong.
Post by bluethought
Like David, I use 200+ AcceptEx-es pending at any given time (I refill
when the count goes down to less than 100, by keeping a thread that
posts AcceptEx-es waiting on an event). My code is still not deployed
on a full-load production server though, so some real world results
would be interesting.
I changed to post 200 AcceptEx's when the server goes online.
Then like you do, i repost if the count goes below some configurable limit,
default is 150. This is handled by same thread that scans the socket pool
for ghosted conns and such.

My problem is not so much syn floods, because the ISP in my case (UNI2)
protects the server against that. However, some people found that flooding
full connections is smarter, because these full connections slip thru the
packet inspection. However the packet inspection do only allow for x new
conns per x sec per host, but these attacks are often distributed. And when
they are distibuted, it the purpose to use up the amount of sockets
available on my server. To prevent that, only 1 (configurable) connection
per ip addr is allowed, and if the handshake is not completed with in 5 secs
(also configurable) the client is disconnected.
However, i must accept all before i know if there is more than 1 connection
from a host, and its here it hurts when acception.. But i think posting more
AcceptEx's will help. But i will of course let you know here in the group
how it turns out when real clients will connect to the updated server
version.

To go a bit off topic, next step for me is to implement some compression for
messages dispatched to all clients. So that i can save some outbound
bandwidth (wich is quite expensive). I'm considering using Zline, so that
this can be done without being a cpu hog. But i would like to hear if any of
you have a better suggestion for compression?
People can chat on this server and send private messages, so lets say a
message of 100 bytes are sent to mainchat then my server has to transmit
this message to all clients and will round up at about 800 kbyte or so, and
9000 people chat fast, so its easily a even larger amount to send often.
However, most chatter is short messages and most goes in private messages.
Besides that other messages are dispatched, it can be searches, connection
requests, version requests and more - its very irc alike.
The outbound average is ca. 13 mbit/s over 24 hours, but most of this data
is transfered between 16:00 - 01:00 CET where most users are active, duing
the night and early day most are idle and the load little. And its the peaks
in this time i want to bring down with compression. Sending channel list and
userlists and such is very bandwidth consuming even more then normal client
activities. And therefore i do also have anti login-flooding, so a tool cant
login get client list+channel list and then logoff and do it in a loop. Well
it can, but not very fast. The reason that my average is only 13 mbit/s is
that i limit the amount of commands a client can send, e.g. only 1 chat
message per 2 secs, else its not dispatched.

//eric
bluethought
2006-06-02 00:45:51 UTC
Permalink
Post by Eric Jensen
But i will of course let you know here in the group
how it turns out when real clients will connect to the updated server
version.
Thanks.
Post by Eric Jensen
in this time i want to bring down with compression. Sending channel list and
userlists and such is very bandwidth consuming even more then normal client
activities. And therefore i do also have anti login-flooding, so a tool cant
How about snapshot-based updates of channel list/user list? Update the
channel list snapshot every second, and store all changes till the next
snapshot update. Send the user only the compressed snapshot (one
compression per second, rather than per send), and the uncompressed
changes.
Eric Jensen
2006-06-03 11:00:07 UTC
Permalink
Post by bluethought
How about snapshot-based updates of channel list/user list? Update the
channel list snapshot every second, and store all changes till the next
snapshot update. Send the user only the compressed snapshot (one
compression per second, rather than per send), and the uncompressed
changes.
I do already do snapshots, but they are not timed.
The channel list is just a simple collection of pointers
(a chained list).

Core->ChanList.SendChannelList(clsClient *client);
or
Core->ChanList.SendChannelListBig(clsClient *client);

That function will send the snapshot, unless a new channel has formed since
last snap. If there is a new channel a new snapshot will be created and
sent, unless flood-control forbids it. SendChannelList uses a optimized
list, where only popular channels and channels with a resonable amount of
users are listed. SendChannelListBig is the complete version.

So what im looking for is the best algo to do the compression. Where the
compression lib is not a cpu hog, and still offers a good compression ratio.
So far is ZLine my best option.

I will compress a lot of data. I have one buffer used for messages
dispatched to all clients. It will be compressed and flushed to the clients
on a timed interval (often, I dont want to create to much lag on chat).

// eric
bluethought
2006-06-03 17:20:22 UTC
Permalink
Post by Eric Jensen
I do already do snapshots, but they are not timed.
The channel list is just a simple collection of pointers
(a chained list).
Core->ChanList.SendChannelList(clsClient *client);
or
Core->ChanList.SendChannelListBig(clsClient *client);
That function will send the snapshot, unless a new channel has formed since
last snap. If there is a new channel a new snapshot will be created and
sent, unless flood-control forbids it. SendChannelList uses a optimized
list, where only popular channels and channels with a resonable amount of
users are listed. SendChannelListBig is the complete version.
Lets count compressions per second...lets say your channel list is
updated N times per second, and sent to M users in that time - so that
would involve min(M, N) compressions. If this is greater than 1, then
just do one compression per second, and send out that compressed list +
changes since the compressed list was built. Thats a time/memory
tradeoff, you save computation time at the expense of a little memory
(bandwidth in this case). Then the compression time becomes less
critical.

I can't advise on an algo in specific to use for your compression - the
above is more an algo to use independent of your compression method.
d***@webmaster.com
2006-06-06 02:55:44 UTC
Permalink
Post by Eric Jensen
So what im looking for is the best algo to do the compression. Where the
compression lib is not a cpu hog, and still offers a good compression ratio.
So far is ZLine my best option.
I will compress a lot of data. I have one buffer used for messages
dispatched to all clients. It will be compressed and flushed to the clients
on a timed interval (often, I dont want to create to much lag on chat).
// eric
How much data are you compressing? If you want to be able to send the
same compressed buffer to all clients, then the compression algorithm
has to be stateless or operate on blocks rather than streams. Most
likely, a block compression algorithm will be what you want, but many
algorithms cannot effectively compress blocks that are small. If you're
talking less than about 2Kb, I wouldn't waste your time.

Try out various algorithms. Make a snapshot of a typical block that you
would want to compress. Try out various algorithms on it. Note how much
CPU time was used and the final output size. Then you can make an
intelligent decision.

DS
Eric Jensen
2006-06-06 05:15:27 UTC
Permalink
Post by d***@webmaster.com
How much data are you compressing? If you want to be able to send the
same compressed buffer to all clients, then the compression algorithm
has to be stateless or operate on blocks rather than streams. Most
likely, a block compression algorithm will be what you want, but many
algorithms cannot effectively compress blocks that are small. If you're
talking less than about 2Kb, I wouldn't waste your time.
The userlist alone when server is full grows over 150 kbyte...
I already implemtated zlib (http://www.zlib.net/) in my test server code.
And it operate on blocks.. First byte client sends is 0x1B means this is
compressed.. next 2 bytes is length of the block, and then the compressed
data follows.
I'm talking about alot more then 2 kb, and the server only compress if the
buffer is > 4 kb, otherwise it sends the data uncompressed. If the buffer is
< 4 kb the server might choose to delay the dispatch unless the data is
getting "old"..

//eric
d***@webmaster.com
2006-06-06 23:03:11 UTC
Permalink
Post by Eric Jensen
The userlist alone when server is full grows over 150 kbyte...
I already implemtated zlib (http://www.zlib.net/) in my test server code.
And it operate on blocks.. First byte client sends is 0x1B means this is
compressed.. next 2 bytes is length of the block, and then the compressed
data follows.
I'm talking about alot more then 2 kb, and the server only compress if the
buffer is > 4 kb, otherwise it sends the data uncompressed. If the buffer is
< 4 kb the server might choose to delay the dispatch unless the data is
getting "old"..
//eric
Then it's definitely worth compressing. Probably zlib is what you want.
LZMA will get you a higher compression ratio, but it takes a lot more
CPU to compress.

One caution with zlib, though I don't think it matters in your case
because you don't accept compressed input -- small amounts of data
maliciously constructed can expand to huge amounts of data. In some
cases, this can lead to a denial of service attack or flood.

DS
Eric Jensen
2006-06-07 07:35:29 UTC
Permalink
<***@webmaster.com> skrev i en meddelelse news:***@g10g2000cwb.googlegroups.com...

ej> First byte client sends is 0x1B means this is

I meant server, not client here :)
Post by d***@webmaster.com
Then it's definitely worth compressing. Probably zlib is what you want.
LZMA will get you a higher compression ratio, but it takes a lot more
CPU to compress.
One caution with zlib, though I don't think it matters in your case
because you don't accept compressed input -- small amounts of data
maliciously constructed can expand to huge amounts of data. In some
cases, this can lead to a denial of service attack or flood.
The data that needs being compressed is text and it can be compressed at a
good ratio compared to other data types. Yes it's only the server that will
do compression, the amount of data from clients are little and not worth
compressing.
Clients can establish a connection to each other and client<->client already
uses compression. Even though i'd like a algo that goes easy on the cpu, i'd
rather activate the server link, and set up a extra server so the load is
shared. The cost of bandwidth that can be saved justifies the cost of adding
a extra server and therefore more resources to do the compression.

I guess i will be doing some benchmark tests on algos today..

//eric
d***@webmaster.com
2006-06-09 21:11:28 UTC
Permalink
Post by Eric Jensen
Clients can establish a connection to each other and client<->client already
uses compression. Even though i'd like a algo that goes easy on the cpu, i'd
rather activate the server link, and set up a extra server so the load is
shared. The cost of bandwidth that can be saved justifies the cost of adding
a extra server and therefore more resources to do the compression.
Then use *extreme* caution in the client code that accepts compressed
input. With many compression algorithms, artificially constructed
compressed data can expand to millions of times its compressed size.
Naive expansion coding can lead to crashes, memory exhaustion, disks
filling up with billions of zeroes, and so on.

DS

Loading...