[fossil-users] The backoffice. Was: strange delay and messages during `fossil (uv) sync'

Discussion:

joerg van den hoff

2018-08-06 21:52:12 UTC

Here is the technical problem I am having difficulty with. My
currently solution is (probably) what is causing the delays and errors
you are seeing. Suggestions from you are any mailing list reader on
how to solve this problem are appreciated.

I've read through your explanations below but there are sure no
suggestion from my side... I hardly understand the nature of the
problem. especially I am at a loss where all this backoffice machinery
interferes with CLI/ssh communication between two repos (or cloning, see
below).

however, I would like to augment the list of problems I am experiencing
right now (this is with version [71260ba2] as proposed by you earlier
today):

* fossil uv sync no longer works correctly. I did try it this evening
(still accessing our server via ssh but now over the wire from home
rather than from wihtin the file system (if this makes any difference)
and was not able to get one new uv file that was added to the server
repo sync to my labtop.

* I then tried to re-clone the whole repo. cloning failed completely with

multiple calls to backoffice_run() received: 113

server did not reply
Clone done, sent: 676 received: 364935 ip: {our_server}
server returned an error - clone aborted

I also want to emphasize that these problems all happen without ever
opening the fossil ui or trying to communicate via https. this is a pure
CLI/ssh scenario.

so this all is rather bad. it is also the first time in years that I can
remember that the tip of trunk is not working essentially without any
serious problems.

I would appreciate being pointed to the most recent (or thereabouts)
checkin on trunk that is not experiencing these problems in order to
get fossil back to working as nicely as it usually does.

If I can provide any further information or help otherwise in tracking
the present problems down, I would of course be happy to do so.

overall, I really hope that not too much complexity is currently added
to fossil that could lead to a situation where fossil no longer excels
by ease of use and stability/absence of problems or serious bugs as it
has done now for so many years, thankfully.

thank you,
joerg

Basically: Fossil needs to run some operations in the background.
With the introduction of email notification, some computations need to
be performed in the background on the Fossil server. Right now, the
only background operation is sending email alerts to subscribers. In
the future, I'll probably want to add automatic synchronization to
peer servers and maybe other things.
I call the module that does background processing the "backoffice".
The name comes from the analogy of a business where an order is placed
at the counter in the front of the store, but the actually fulfillment
of the order takes place in the "back office", out of sight of the
customer.
(1) Only a single process should be running backoffice processing at a
time. There can be multiple processes serving up web pages, but there
is only a single process sending email notification. Serialization of
backoffice processing is handled by making atomic updates to the entry
in the repository CONFIG table with name='backoffice'.
(2) Backoffice processing should happen independently of webpage
generation. The results of an HTTP request should not need to wait
for some backoffice process to complete.
(3) Once one backoffice process completes, no other should run for
another 60 seconds or so. In other words, backoffice processes should
be rate limited.
(4) At least one backoffice process should be run within about 60
seconds after the any successful HTTP request. A single backoffice
run can satisfy this requirement for any number of HTTP requests. So,
for example, if there is a flurry of 1000 HTTP requests then 5 seconds
later there is a single backoffice run, then that one run of the
backoffice is sufficient for all 1000 HTTP requests.
(5) If there are no HTTP request in the past minute or two, then there
should not be any backoffice processes running or waiting to run. The
idea here is that a website like sqlite.org has literally about a
hundred separate Fossil repositories. SQLite is very busy and it
would be ok to have a persistent backoffice process running for it.
But many of the other 99 repos are accessed much less frequently, and
we don't want 99 processes waiting around for activity for days on
end. Also, I want to be able to upgrade the Fossil executable by
simply overwriting it, and then have any backoffice processes
eventually (within a minute or two) switch over to using the new
executable.
(6) In keeping with the easy-to-setup goal of Fossil, running the
backoffice process should not require any special setup or
configuration by the site administrator. It should just work.
The way the above is implemented now is that after every successful
HTTP request is completed, the process that generated the reply checks
to see if a backoffice process is needed, and if so it morphs itself
into a backoffice process. If there is already one backoffice process
running, then the new one might become the "on-deck" process that is
to run next. The HTTP reply has been sent, so you would think that
would be sufficient to avoid long delays in the user interface. But
somehow, keeping that process around for a little longer to do
backoffice processing is causing delays in the HTTP reply. A lot
depends on webserver and (surprisingly) the webbrowser.
Anyhow, I'm looking for a better and more reliable way to implement
the backoffice such that it satisfies the six requirements listed
above.
Alternative design ideas are welcomed.

Richard Hipp

2018-08-06 22:12:17 UTC

Permalink

Post by joerg van den hoff
this is a pure
CLI/ssh scenario.

The ssh transport works by invoking the same HTTP processing engine as
is used on a website, just on the far end of an ssh tunnel. It's all
the same under the covers.

That said, I do remember making some minor changes to the ssh
transport as part of the refactoring that is currently taking place.
Maybe something broke. Can you bisect back to the last release and
figure out where the problem was introduced? That would be a big
help.

Post by joerg van den hoff
overall, I really hope that not too much complexity is currently added
to fossil that could lead to a situation where fossil no longer excels
by ease of use and stability/absence of problems or serious bugs as it
has done now for so many years, thankfully.

There is a lot of refactoring going on right now, in an effort to
avoid introducing unnecessary complexity.

--
D. Richard Hipp
***@sqlite.org

joerg van den hoff

2018-08-07 10:04:38 UTC

Permalink

question: the observation that it seemingly is related specifically to
repos holding uv files is unimportant/irrelevant? or does that have
implications where to look?

This is not much help in debugging. But it is helpful to you in
bisecting, because it allows you to quickly and easily determine if a
particular various is working or not.
BTW, when doing the bisect, be sure to use the same Fossil version on
both ends of the SSH connection. We do not know on which end the
problem exists, so it is better to eliminate that variable.

then I do wish you much success with achieving that, of course.

Your assistance in bisecting the delay problem is a step in that
direction. Thanks.

this is what I've got:

bisect complete
1 BAD 2018-07-31 23:38:39 71260ba25e79f4aa
# error message, clone fails
5 BAD 2018-07-24 22:01:12 42d821a714d092a8
# error message, clone fails
7 BAD 2018-07-19 18:54:39 ac6657e2d35c974d
# clone hangs infinitely, no error message
9 BAD 2018-07-19 17:22:04 ada7ecde5bf91e18
# clone hangs infinitely, no error message
10 BAD 2018-07-19 16:27:51 f525b6d5e9dcb775
# clone hangs infinitely, no error message
11 BAD 2018-07-19 15:58:36 a32a92d227be5663 CURRENT
# clone hangs infinitely, no error message
8 GOOD 2018-07-19 15:52:43 aa17077eafbbad37
6 GOOD 2018-07-18 19:22:31 752ea432d1cf20fa
4 GOOD 2018-07-14 20:11:37 023ce4edde8ceb2d
3 GOOD 2018-06-23 15:48:49 aeb98be80f1d51f5
2 GOOD 2018-01-07 21:38:26 5b558bc76bb9fb5f

so the last good one is aa17077eafbbad37.

NOTE:
1.
I ran this with another repo not holding any uv files and still got the
errors. so my previous observation (only affecting uv sync seems
spurious or at least not true for all repos)
2.
the bisect is true for the scenario: use ssh-transport from a local
machine over the wire to the remote server. it does *not* happen (with
this repo) if I do the same while being logged in to the server holding
the remote repo. in the latter case the clone suceeds with tip of trunk...
3.
I have added comments in the bisect log which refer to the checkin
_above_ the comment. the point I want to stress is that the BAD
behaviour changes during the bisect: initially after the last GOOD
checkin, the clone just hangs and never comes back, later/more recently
the error messages and manifest "clone aborted/failed" messages appear.
4.
the bisect was performed on the (linux/ubuntu) server while keeping
the local (osx) labtop version of fossil constant (on 71260ba25e79f4aa,
I believe). so I think this proves that the problem is happening "on the
other side", not locally).

hth,
joerg

Richard Hipp

2018-08-07 13:53:35 UTC

Permalink

Please build the from the tip of the forum-v2 branch and let me know
whether or not it is working for you.

Post by joerg van den hoff

question: the observation that it seemingly is related specifically to
repos holding uv files is unimportant/irrelevant? or does that have
implications where to look?

then I do wish you much success with achieving that, of course.

Your assistance in bisecting the delay problem is a step in that
direction. Thanks.

bisect complete
1 BAD 2018-07-31 23:38:39 71260ba25e79f4aa
# error message, clone fails
5 BAD 2018-07-24 22:01:12 42d821a714d092a8
# error message, clone fails
7 BAD 2018-07-19 18:54:39 ac6657e2d35c974d
# clone hangs infinitely, no error message
9 BAD 2018-07-19 17:22:04 ada7ecde5bf91e18
# clone hangs infinitely, no error message
10 BAD 2018-07-19 16:27:51 f525b6d5e9dcb775
# clone hangs infinitely, no error message
11 BAD 2018-07-19 15:58:36 a32a92d227be5663 CURRENT
# clone hangs infinitely, no error message
8 GOOD 2018-07-19 15:52:43 aa17077eafbbad37
6 GOOD 2018-07-18 19:22:31 752ea432d1cf20fa
4 GOOD 2018-07-14 20:11:37 023ce4edde8ceb2d
3 GOOD 2018-06-23 15:48:49 aeb98be80f1d51f5
2 GOOD 2018-01-07 21:38:26 5b558bc76bb9fb5f
so the last good one is aa17077eafbbad37.
1.
I ran this with another repo not holding any uv files and still got the
errors. so my previous observation (only affecting uv sync seems
spurious or at least not true for all repos)
2.
the bisect is true for the scenario: use ssh-transport from a local
machine over the wire to the remote server. it does *not* happen (with
this repo) if I do the same while being logged in to the server holding
the remote repo. in the latter case the clone suceeds with tip of trunk...
3.
I have added comments in the bisect log which refer to the checkin
_above_ the comment. the point I want to stress is that the BAD
behaviour changes during the bisect: initially after the last GOOD
checkin, the clone just hangs and never comes back, later/more recently
the error messages and manifest "clone aborted/failed" messages appear.
4.
the bisect was performed on the (linux/ubuntu) server while keeping
the local (osx) labtop version of fossil constant (on 71260ba25e79f4aa,
I believe). so I think this proves that the problem is happening "on the
other side", not locally).
hth,
joerg

--
D. Richard Hipp
***@sqlite.org

Andy Bradford

2018-08-07 14:06:02 UTC

Permalink

Please build the from the tip of the forum-v2 branch and let me know
whether or not it is working for you.

I was just about to submit a similar fix but you beat me to it.

The reason why this doesn't work the same as a traditional HTTP
connection is because the SSH transport uses a single connection for all
communications, whereas the HTTP transport uses one connection per
request.

If we want to continue to have backoffice_run() work with an SSH client
we would need a counter to keep track of how many times a sync has
happened (there may already be one available but I would have to look at
the code).

Does it make sense to have backoffice_run() in the SSH transport? If
not, then your fix is apropos.

Thanks,

Andy

--
TAI64 timestamp: 400000005b69a76f

joerg van den hoff

2018-08-07 14:10:15 UTC

Permalink

Post by Richard Hipp
Please build the from the tip of the forum-v2 branch and let me know
whether or not it is working for you.

if the server machine is running that version, yes it does indeed. thanks a lot for looking into
this issue ...

I see from the checkin message that the "fix" is achieved by "Disable the backoffice for SSH
client". whatever the actual interaction between backoffice and ssh client: why did I see
the problem only when actually cloning from another machine, whereas a clone using ssh while being
loggedin to the server machine still worked? in both cases the ssh communication should be the same, no?

Post by Richard Hipp

Post by joerg van den hoff

question: the observation that it seemingly is related specifically to
repos holding uv files is unimportant/irrelevant? or does that have
implications where to look?

then I do wish you much success with achieving that, of course.

Your assistance in bisecting the delay problem is a step in that
direction. Thanks.

Andy Bradford

2018-08-07 14:16:13 UTC

Permalink

why did I see the problem only when actually cloning from another
machine, whereas a clone using ssh while being loggedin to the server
machine still worked? in both cases the ssh communication should be
the same, no?

How were you cloning while logged in to the server machine? What
commands did you use while loggged in to the server to clone?

Thanks,

Andy

--
TAI64 timestamp: 400000005b69a9d2

joerg van den hoff

2018-08-07 14:28:38 UTC

Permalink

Post by Andy Bradford

How were you cloning while logged in to the server machine? What
commands did you use while loggged in to the server to clone?

uups, you are right, sorry for me spreading "misinformation":

actually the clone while being logged in to the server (via ssh ... ;)) was done via file system
transport, _not_ via ssh transport (I simply used `fossil clone /path/to/repo.fossil ...'). that of
course explains it, I presume...

again, sorry for the noise,

joerg

Post by Andy Bradford
Thanks,
Andy

Richard Hipp

2018-08-07 14:21:34 UTC

Permalink

Post by Andy Bradford
Does it make sense to have backoffice_run() in the SSH transport? If
not, then your fix is apropos.

yes, we do want backoffice to run for SSH transport. My "fix" is
really a work-around, not a true fix. We need to devise a proper fix
for this, but I don't yet have a vision of how to do that. I'm in the
process of writing a long post of the "forumtest1" site now that tries
to discuss the problems and asks for ideas for a solution. I'll send
follow-up email to this list when that post is up.

--
D. Richard Hipp
***@sqlite.org

Andy Bradford

2018-08-07 14:36:18 UTC

Permalink

Post by Richard Hipp
yes, we do want backoffice to run for SSH transport.

Then I suggest that we simply make backoffice_run() smart enough to know
that it has already run once:

For example, perhaps instead of a panic here, backoffice_run() should
just return?

http://www.fossil-scm.org/index.html/artifact?udc=1&ln=225-227&name=d2cf5ceb442267a8

Thanks,

Andy

--
TAI64 timestamp: 400000005b69ae87

Richard Hipp

2018-08-07 15:02:03 UTC

Permalink

Post by Andy Bradford
Then I suggest that we simply make backoffice_run() smart enough to know

That would delay the second HTTP request coming over the SSH connection.

I have the post up now at:

https://www.fossil-scm.org/forumtest1/forumpost/9ef9bd2d47

Please reply there. All suggestions are welcomed.

--
D. Richard Hipp
***@sqlite.org

Andy Bradford

2018-08-07 18:52:11 UTC

Permalink

Post by Richard Hipp
That would delay the second HTTP request coming over the SSH connection.

When I suggested that, I didn't understand enough about the backoffice
design---specifically that it was a long-running task. After reading the
forum page you sent, I see that what you're looking for is a way to
fork() off a backend process and only have one of those operating at a
given time.

Andy

--
TAI64 timestamp: 400000005b69ea80