Post by Peter DunihoPost by j***@comcast.netThanks for the reply. pretty much confirms what I've read. But there
are some pretty stubborn people around here <g>...when they start
talking about how since the x86 lock instruction prefix is inefficient
Inefficient compared to what? In general, multi-threading on a multi-CPU
computer, when coded correctly, provides the expected performance
increase. Clearly synchronization (whatever method one chooses to use)
does not HAVE to have a significant effect on performance.
However, sometimes there are situations where one may need to synchronize
"critical" shared data-structure(s).
FWIW, I have "run into" this issue a couple of times in the past. For
instance, during the development of an "enterprise-type" server that had to
handle 15,000-40,000+ concurrent connections. It also had to handle sporadic
periods of heavy load. The server did all it could to efficiently manage the
load(s). Let's see if I can very briefly describe the setup.
One part of the load management scheme was based on a per-socket timeout
algorithm that was able to quickly disconnect timedout connections, with no
extra network-overhead; a "timeout protocol" was avoided by keeping things
"local". A server stored "actively connected" per-socket state in a
hashed-linked data-structure that was based on a distributed per-node
locking algorithm. The locks were implemented as a single global array of
rw-mutexs, and access into them was managed by hashing a pointer to the
to-be-locked-structure into an index into the global array; similar to what
I did here:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011baf08844c4/3ca11e0c3dcf762c?q=multi-mutex&rnum=1#3ca11e0c3dcf762c
(very early and crude version of my "multi-mutex" algorithm)
A per-socket had a pair of monotonic counters (e.g., old, cur) that got
non-atomically incremented by worker threads every time there was any IO
activity; luckily the count did not have to be accurate at all, so simple
increments worked fine. The timeout logic basically had to traverse through
the active per-socket list and sample each ones counter, and compare it to
the previously read value and a runtime configurable per-socket timeout
value/timestamp. I did NOT want the worker threads themselves to have to
traverse through the per-sockets. Any traversal could visit
tens-of-thousands of per-sockets. That's tens-of-thousands of expensive
rw-mutex acquire/release operations. That is a boat load of atomic
operations and memory barriers.
http://groups.google.com/group/comp.programming.threads/msg/7c15b3488642d1a6?hl=en
(lock-free vs. lock-based overheads)
http://groups.google.com/group/comp.programming.threads/msg/a3e38e27df971e0d?hl=en
(lock-free vs. lock-based overheads)
http://groups.google.com/group/comp.programming.threads/msg/ec103670e323111a?hl=en
(zero-overhead queues vs. atomic-op/member based queues)
http://www.rdrop.com/users/paulmck/RCU/lockperf.2004.01.17a.pdf
(RCU vs. locking schemes)
So, I moved the timeout system to a dedicated thread. However, I did not
want the timeout thread to have to lock the rw-mutex for each per-socket.
http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce
(rw-mutexs. when to use?)
http://groups.google.ca/group/comp.lang.c++.moderated/msg/a8ef173bee4f0c03
("real" costs of locking)
The server's per-socket timeout "polling" intervals could be adjusted at
runtime by an administrator via. GUI, so the locking frequency was "random".
If the timeout was moderately/fairly small, the thread would be persistently
interrupting lock-holders that were doing "real work"; as opposed to the
simply loading a monotonic counter. The side effects of blocking (e.g.,
context-switching, lock-convoy, possible priority-inversion, ect.) and/or
frequent atomic-op/member usage would have a negative impact on your servers
scalability/throughput/performance characteristics in general.
So, I solved the problem with a solution to the reader/writer problem:
http://groups.google.ca/group/comp.lang.c++.moderated/msg/e8ed4f5f000c336a
(contains brief description of reader/writer problem)
http://home.comcast.net/~vzoom/demos/pc_sample.c
(one of my lock-free collectors)
http://appcore.home.comcast.net/
(my old AppCore library. Sun and Intel both reference my code/algorithms wrt
lock-free algorithms in general:
http://www.intel.com/cd/ids/developer/asmo-na/eng/151201.htm?page=4
(An Intel Press reference to AppCore)
- http://www.intel.com/cd/ids/developer/asmo-na/eng/151201.htm?page=4
(An Intel Press reference to my AppCore library)
- http://developers.sun.com/solaris/articles/chip_multi_thread.html#2
(A Sun reference, section 2.2.1.1.2, to my AppCore library)
http://groups.google.ca/group/comp.lang.c++.moderated/msg/0032f3cc0a6d75f0
Here is some generic info on proxy gc:
http://groups.google.com/group/comp.programming.threads/msg/41f29efe33e7f124?hl=en
(brief description)
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/f80fb2901b42e447/9fe3ee93d3729321?lnk=raot&hl=en#9fe3ee93d3729321
(discussion on GC)
http://groups.google.ca/group/comp.lang.c++/msg/b3e1d10e219908f1?hl=en&
(I describe a technique for virtually zero-overhead garbage collection at
the end of this msg)
Post by Peter DunihoPost by j***@comcast.netand how separate processors use "preferred" ranges of memory (anyone
hear of that one before? at least the first one makes sense),
Never heard about the "preferred ranges" thing, but then there are
probably some esoteric things I have yet to run across. It sounds a
little like some of the issues I discovered doing multi-threaded code on a
Hyperthreaded CPU. With HT (and I believe with the new Intel "dual-core"
CPUs), the execution units share a cache. Because of the way that cache
lines are assigned to specific memory addresses, and because Windows by
default always starts each thread's stack at the same alignment, you can
really kill performance on a HT CPU if each thread is doing practically
the same thing (as is common with certain kinds of multi-threaded
applications :( ). Each thread winds up trashing the cache of the other
thread, killing the performance of both threads.
You can imagine my surprise the first time I ran into this, where I wound
up with my throughput being reduced by about a third running with two
threads versus one on a HT CPU, rather than getting some performance gain.
(In the end, it turned out that even optimizing for the HT case, my best
performance gain was under 20%...I decided it was better to not bother
trying to treat a HT CPU as a true dual-proc setup). Quite a
head-scratcher until I did some research.
Simple hack for this is to offset your worker threads stacks via alloca.
Post by Peter DunihoPost by j***@comcast.netmultithreading costs more than any benefit you'd gain, I start feeling
like I'm wasting my time. But then again, Microsoft uses different
binaries for single-processor machines vs. SMP machines, so they might
be on to something.
There are multiple reasons to use different binaries, but the biggest is
licensing. Customers pay more for SMP-enabled software, so it doesn't
make sense to hand them the SMP-enabled version if they've only paid for
the single-CPU version. Otherwise, you wouldn't really need multiple
binaries...you'd just need to check the number of CPUs and/or let the user
tell you whether to try to take advantage of multiple CPUs.
The fact that there are SMP-enabled versions of Microsoft software *ought*
to tell you that it's worthwhile writing multi-threaded code. :) You
might point that out to your co-workers. :)
The fact that they have a separate atomic operation implementation for
uni-processor (e.g., NO lock prefix) should tell you that the lock prefix is
expensive.
;)
http://groups.google.com/group/comp.programming.threads/msg/aeb7e857ef440520
("real" costs of lock-prefix)
:O
Post by Peter DunihoAs far as whether multi-threading costs more than any benefit you'd gain,
that depends entirely on the situation. It is true that there are many
applications for which multi-threading is unlikely to provide much benefit
at all. But it is also true that there are many applications for which
multi-threading provides a significant benefit, either performance-wise,
architecturally, or both.
As with any software paradigm or technology, multi-threading is no silver
bullet. It's a specific tool to be used in specific situations, where
it's practical and worthwhile. It's not always the right thing to do, and
neither is it always the wrong thing to do.
Agreed.
Any thoughts?