Discussion:
Proposal: new fsfs.conf properties
Paul Hammant
2017-07-07 23:51:12 UTC
Permalink
1. compression-exempt-suffixes = mp3,mp4,jpeg

2. deltification-exempt-suffixes = mp3,mp4,jpeg

Regardless of the setting of 'compression-level', #1 above two mean certain
things can skip the compression attempt. It must give up at a certain
point right?

Same for deltification re #2

I'm assuming debate happens now. Then y'all let me go off and diligently
file a Jira ticket for this feature request, or I slink away suitably
admonished...

- Paul
Pavel Lyalyakin
2017-07-11 12:36:10 UTC
Permalink
Hello Paul,
Post by Paul Hammant
1. compression-exempt-suffixes = mp3,mp4,jpeg
2. deltification-exempt-suffixes = mp3,mp4,jpeg
Regardless of the setting of 'compression-level', #1 above two mean certain things can skip the compression attempt. It must give up at a certain point right?
Same for deltification re #2
I'm assuming debate happens now. Then y'all let me go off and diligently file a Jira ticket for this feature request, or I slink away suitably admonished...
- Paul
I'm not sure whether this is going to be useful. How do you expect
these exemptions to help Subversion users? What's the story?
--
With best regards,
Pavel Lyalyakin
VisualSVN Team
Markus Schaber
2017-07-11 12:45:54 UTC
Permalink
Hi,


Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: ***@codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is stri
Markus Schaber
2017-07-11 12:53:45 UTC
Permalink
Hi,

(Sorry, it seems my previous message was sent _very_ prematurely :-(
Post by Pavel Lyalyakin
Hello Paul,
Post by Paul Hammant
1. compression-exempt-suffixes = mp3,mp4,jpeg
2. deltification-exempt-suffixes = mp3,mp4,jpeg
Regardless of the setting of 'compression-level', #1 above two mean certain
things can skip the compression attempt. It must give up at a certain point
right?
Post by Paul Hammant
Same for deltification re #2
I'm assuming debate happens now. Then y'all let me go off and diligently
file a Jira ticket for this feature request, or I slink away suitably
admonished...
Post by Paul Hammant
- Paul
I'm not sure whether this is going to be useful. How do you expect these
exemptions to help Subversion users? What's the story?
I agree partly. Skipping compression for known "incompressible" formats like mpX, png or gif can come with performance benefits, saving some CPU cycles (see the recent performance disccussions on this list)

However, I'm not sure whether the same amounts for deltification. There are editing tasks which do not reencode the whole image / movie, and they can profit from deltification, for example:

- Lossless rotation / cropping of jpeg images.
- Editing / stripping the EXIF data of jpeg images.
- Embedding / dropping the preview thumbnail of jpeg images.
- Lossless MP3 editing (e. G. via mp3DirectCut).
- Editing MP3 meta data (e. G. Song Title)
(... and more...)

In all those cases, skipping deltification can drastically increase storage.


Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: ***@codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly f
Paul Hammant
2017-07-11 13:20:52 UTC
Permalink
Markus - I've read your section on deltification and I can see evidence in
what you wrote that you're concurrently in favor of and against it (the
file-suffix exclusion idea). Can you re-read and clarify?

Thanks,

- Paul
Post by Markus Schaber
Hi,
(Sorry, it seems my previous message was sent _very_ prematurely :-(
Post by Pavel Lyalyakin
Hello Paul,
Post by Paul Hammant
1. compression-exempt-suffixes = mp3,mp4,jpeg
2. deltification-exempt-suffixes = mp3,mp4,jpeg
Regardless of the setting of 'compression-level', #1 above two mean
certain
Post by Pavel Lyalyakin
things can skip the compression attempt. It must give up at a certain
point
Post by Pavel Lyalyakin
right?
Post by Paul Hammant
Same for deltification re #2
I'm assuming debate happens now. Then y'all let me go off and
diligently
Post by Pavel Lyalyakin
file a Jira ticket for this feature request, or I slink away suitably
admonished...
Post by Paul Hammant
- Paul
I'm not sure whether this is going to be useful. How do you expect these
exemptions to help Subversion users? What's the story?
I agree partly. Skipping compression for known "incompressible" formats
like mpX, png or gif can come with performance benefits, saving some CPU
cycles (see the recent performance disccussions on this list)
However, I'm not sure whether the same amounts for deltification. There
are editing tasks which do not reencode the whole image / movie, and they
- Lossless rotation / cropping of jpeg images.
- Editing / stripping the EXIF data of jpeg images.
- Embedding / dropping the preview thumbnail of jpeg images.
- Lossless MP3 editing (e. G. via mp3DirectCut).
- Editing MP3 meta data (e. G. Song Title)
(... and more...)
In all those cases, skipping deltification can drastically increase storage.
Best regards
Markus Schaber
CODESYS® a trademark of 3S-Smart Software Solutions GmbH
Inspiring Automation Solutions
3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50
store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com
Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner |
Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915
This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy
this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.
Markus Schaber
2017-07-11 13:39:56 UTC
Permalink
Hi, Paul,
Markus - I've read your section on deltification and I can see evidence in what you wrote that you're concurrently in favor of and against it (the file-suffix exclusion idea). Can you re-read and clarify?
Post by Markus Schaber
I agree partly. Skipping compression for known "incompressible" formats like mpX, png or gif can come with performance benefits, saving some CPU cycles (see the recent performance disccussions on this list)
- Lossless rotation / cropping of jpeg images.
- Editing / stripping the EXIF data of jpeg images.
- Embedding / dropping the preview thumbnail of jpeg images.
- Lossless MP3 editing (e. G. via mp3DirectCut).
- Editing MP3 meta data (e. G. Song Title)
(... and more...)
In all those cases, skipping deltification can drastically increase storage.
To summarize it up:

I expect significant benefits in some use cases by skipping the compression, thus I'm +1 if benchmarks prove it's worth the effort.

I see the danger of drastically increased bandwith and storage size (transferring/storing the whole mp3 instead of just some changed meta data bytes) in some common use cases when deltification is skipped. Thus, I'm skeptical (count it as -0), and I'd kindly suggest to do some benchmarks for those cases before implementation, and clear documentation of the possible negative effects if it's implemented.


Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: ***@codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.





Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions
________________________________________
3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: ***@codesys.com | Web: codesys.com | CODESYS store: store.codesys.com
CODESYS forum: forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915
________________________________________
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly
Branko Čibej
2017-07-11 19:11:58 UTC
Permalink
Post by Markus Schaber
Hi, Paul,
Markus - I've read your section on deltification and I can see evidence in what you wrote that you're concurrently in favor of and against it (the file-suffix exclusion idea). Can you re-read and clarify?
Post by Markus Schaber
I agree partly. Skipping compression for known "incompressible" formats like mpX, png or gif can come with performance benefits, saving some CPU cycles (see the recent performance disccussions on this list)
- Lossless rotation / cropping of jpeg images.
- Editing / stripping the EXIF data of jpeg images.
- Embedding / dropping the preview thumbnail of jpeg images.
- Lossless MP3 editing (e. G. via mp3DirectCut).
- Editing MP3 meta data (e. G. Song Title)
(... and more...)
In all those cases, skipping deltification can drastically increase storage.
I expect significant benefits in some use cases by skipping the compression, thus I'm +1 if benchmarks prove it's worth the effort.
I see the danger of drastically increased bandwith and storage size (transferring/storing the whole mp3 instead of just some changed meta data bytes) in some common use cases when deltification is skipped. Thus, I'm skeptical (count it as -0), and I'd kindly suggest to do some benchmarks for those cases before implementation, and clear documentation of the possible negative effects if it's implemented.
So, first of all, if this is server-side configuration, it has _no_
effect on the client so the client will continue to send (compressed)
deltas. This will have exactly zero effect on bandwidth or client CPU
utilization.

Another issue I have with the proposal is the idea to use file suffixes.
That's usually the wrong way to go about things (case in point: Windows
does it, with didastrous results). It's much better to determine file
format by inspection, such as, e.g., libmagic does. We already have
optional support for libmagic in the client (to set svn:mime-type).

-- Brane
Paul Hammant
2017-07-11 19:52:59 UTC
Permalink
I'm perfectly happy for the solution to be mime-type based.

Maybe we can take the mime-type to *suffix table* from Apache itself to do
the translation :- https://svn.apache.org/repos/asf/httpd/httpd/trunk/
docs/conf/mime.types :-P

I used it (implicitly) in a Subversion backed wysi-wiki *ten years ago - *
*yeesh!*:
(26 seconds of your
time: Svn, DAV, Auto-increment, a JavaWeb-app to add on a site experience
via Sitemesh, and Mozilla's SeaMonkey editing a page and browsing it).
Stefan Sperling
2017-07-11 20:50:37 UTC
Permalink
Post by Branko Čibej
Another issue I have with the proposal is the idea to use file suffixes.
That's usually the wrong way to go about things (case in point: Windows
does it, with didastrous results). It's much better to determine file
format by inspection, such as, e.g., libmagic does. We already have
optional support for libmagic in the client (to set svn:mime-type).
I would not feel comfortable having the server parse arbitrary data with
libmagic. The libmagic code is not very safe to run on untrusted input.
I have seen libmagic crash my svn client on several occasions even on
text files I wrote.

At the client side it's a bit less dangerous because users have already
told svn to add the files in question to version control, and a libmagic
exploit running on the client machine can do less harm than a server-side one.

Granted, commits are usually authenticated. If we did this we should at
least make really sure that no unauthenticated access can trigger this code.
Ideally, it would be sandboxed somehow if we started using it on the server.
Branko Čibej
2017-07-12 10:09:27 UTC
Permalink
Post by Stefan Sperling
Post by Branko Čibej
Another issue I have with the proposal is the idea to use file suffixes.
That's usually the wrong way to go about things (case in point: Windows
does it, with didastrous results). It's much better to determine file
format by inspection, such as, e.g., libmagic does. We already have
optional support for libmagic in the client (to set svn:mime-type).
I would not feel comfortable having the server parse arbitrary data with
libmagic. The libmagic code is not very safe to run on untrusted input.
I have seen libmagic crash my svn client on several occasions even on
text files I wrote.
At the client side it's a bit less dangerous because users have already
told svn to add the files in question to version control, and a libmagic
exploit running on the client machine can do less harm than a server-side one.
Granted, commits are usually authenticated. If we did this we should at
least make really sure that no unauthenticated access can trigger this code.
Ideally, it would be sandboxed somehow if we started using it on the server.
I wasn't really proposing to use libmagic on the server. My point is
that instead of using file name suffixes (which the compression and
deltification code don't know about), we'd do some sort of inspection
instead. Detecting ZIP files, or gzip/bzip2/xz-compressed files, etc.,
is fairly easy just from looking at a few bytes of headers. Same goes
for most image and video formats.

Of course, one could always concoct a file that would trick such
inspection, but at least that's marginally harder to do than commit a
large text file full of spaces and calling it 'spaceinvaders.jpg'. :)

Random binary data is harder to detect, but we already deal with that
after the fact by using the plain text if the delta is too large.

-- Brane
Branko Čibej
2017-07-12 10:17:06 UTC
Permalink
Post by Branko Čibej
Post by Stefan Sperling
Post by Branko Čibej
Another issue I have with the proposal is the idea to use file suffixes.
That's usually the wrong way to go about things (case in point: Windows
does it, with didastrous results). It's much better to determine file
format by inspection, such as, e.g., libmagic does. We already have
optional support for libmagic in the client (to set svn:mime-type).
I would not feel comfortable having the server parse arbitrary data with
libmagic. The libmagic code is not very safe to run on untrusted input.
I have seen libmagic crash my svn client on several occasions even on
text files I wrote.
At the client side it's a bit less dangerous because users have already
told svn to add the files in question to version control, and a libmagic
exploit running on the client machine can do less harm than a server-side one.
Granted, commits are usually authenticated. If we did this we should at
least make really sure that no unauthenticated access can trigger this code.
Ideally, it would be sandboxed somehow if we started using it on the server.
I wasn't really proposing to use libmagic on the server. My point is
that instead of using file name suffixes (which the compression and
deltification code don't know about), we'd do some sort of inspection
instead. Detecting ZIP files, or gzip/bzip2/xz-compressed files, etc.,
is fairly easy just from looking at a few bytes of headers. Same goes
for most image and video formats.
Of course, one could always concoct a file that would trick such
inspection, but at least that's marginally harder to do than commit a
large text file full of spaces and calling it 'spaceinvaders.jpg'. :)
Random binary data is harder to detect, but we already deal with that
after the fact by using the plain text if the delta is too large.
Oh and another thing: I'd prefer to _not_ make such a feature
configurable with yet another knob. We have too many knobs ... either
make it safe to use and always-on, or forget about it. IMO.

-- Brane
Johan Corveleyn
2017-07-12 10:24:10 UTC
Permalink
Post by Branko Čibej
Post by Stefan Sperling
Post by Branko Čibej
Another issue I have with the proposal is the idea to use file suffixes.
That's usually the wrong way to go about things (case in point: Windows
does it, with didastrous results). It's much better to determine file
format by inspection, such as, e.g., libmagic does. We already have
optional support for libmagic in the client (to set svn:mime-type).
I would not feel comfortable having the server parse arbitrary data with
libmagic. The libmagic code is not very safe to run on untrusted input.
I have seen libmagic crash my svn client on several occasions even on
text files I wrote.
At the client side it's a bit less dangerous because users have already
told svn to add the files in question to version control, and a libmagic
exploit running on the client machine can do less harm than a server-side one.
Granted, commits are usually authenticated. If we did this we should at
least make really sure that no unauthenticated access can trigger this code.
Ideally, it would be sandboxed somehow if we started using it on the server.
I wasn't really proposing to use libmagic on the server. My point is
that instead of using file name suffixes (which the compression and
deltification code don't know about), we'd do some sort of inspection
instead. Detecting ZIP files, or gzip/bzip2/xz-compressed files, etc.,
is fairly easy just from looking at a few bytes of headers. Same goes
for most image and video formats.
Of course, one could always concoct a file that would trick such
inspection, but at least that's marginally harder to do than commit a
large text file full of spaces and calling it 'spaceinvaders.jpg'. :)
Random binary data is harder to detect, but we already deal with that
after the fact by using the plain text if the delta is too large.
We could also make the process driven by a "client-side suggestion".
Driving it from the client-side also gives us the possibility to
eliminate the client-side deltification overhead.

I.e. the client has some logic (libmagic, suffix, looking at the first
100 bytes, ...) to determine that it's not worth deltifying and/or
compressing. It doesn't do deltification itself, and lets the server
know that it probably shouldn't either (or the server sees that the
client hasn't deltified, so accepts the content automatically as
"non-deltifiable / non-compressable"?).

Maybe needs a server-side config setting to make it respect or ignore
the client-side suggestion.
--
Johan
Branko Čibej
2017-07-12 10:27:55 UTC
Permalink
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
Post by Branko Čibej
Another issue I have with the proposal is the idea to use file suffixes.
That's usually the wrong way to go about things (case in point: Windows
does it, with didastrous results). It's much better to determine file
format by inspection, such as, e.g., libmagic does. We already have
optional support for libmagic in the client (to set svn:mime-type).
I would not feel comfortable having the server parse arbitrary data with
libmagic. The libmagic code is not very safe to run on untrusted input.
I have seen libmagic crash my svn client on several occasions even on
text files I wrote.
At the client side it's a bit less dangerous because users have already
told svn to add the files in question to version control, and a libmagic
exploit running on the client machine can do less harm than a server-side one.
Granted, commits are usually authenticated. If we did this we should at
least make really sure that no unauthenticated access can trigger this code.
Ideally, it would be sandboxed somehow if we started using it on the server.
I wasn't really proposing to use libmagic on the server. My point is
that instead of using file name suffixes (which the compression and
deltification code don't know about), we'd do some sort of inspection
instead. Detecting ZIP files, or gzip/bzip2/xz-compressed files, etc.,
is fairly easy just from looking at a few bytes of headers. Same goes
for most image and video formats.
Of course, one could always concoct a file that would trick such
inspection, but at least that's marginally harder to do than commit a
large text file full of spaces and calling it 'spaceinvaders.jpg'. :)
Random binary data is harder to detect, but we already deal with that
after the fact by using the plain text if the delta is too large.
We could also make the process driven by a "client-side suggestion".
Driving it from the client-side also gives us the possibility to
eliminate the client-side deltification overhead.
I.e. the client has some logic (libmagic, suffix, looking at the first
100 bytes, ...) to determine that it's not worth deltifying and/or
compressing. It doesn't do deltification itself, and lets the server
know that it probably shouldn't either (or the server sees that the
client hasn't deltified, so accepts the content automatically as
"non-deltifiable / non-compressable"?).
Maybe needs a server-side config setting to make it respect or ignore
the client-side suggestion.
That's such an easy way to make a malicious client explode the
repository size. And ... there's realy no reason to complicate. The
server's storage layer can cheaply do all the necessary checks without
having to believe the client, and without adding yet another
(dangerous!) config knob.

-- Brane
Johan Corveleyn
2017-07-12 10:33:18 UTC
Permalink
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
Post by Branko Čibej
Another issue I have with the proposal is the idea to use file suffixes.
That's usually the wrong way to go about things (case in point: Windows
does it, with didastrous results). It's much better to determine file
format by inspection, such as, e.g., libmagic does. We already have
optional support for libmagic in the client (to set svn:mime-type).
I would not feel comfortable having the server parse arbitrary data with
libmagic. The libmagic code is not very safe to run on untrusted input.
I have seen libmagic crash my svn client on several occasions even on
text files I wrote.
At the client side it's a bit less dangerous because users have already
told svn to add the files in question to version control, and a libmagic
exploit running on the client machine can do less harm than a server-side one.
Granted, commits are usually authenticated. If we did this we should at
least make really sure that no unauthenticated access can trigger this code.
Ideally, it would be sandboxed somehow if we started using it on the server.
I wasn't really proposing to use libmagic on the server. My point is
that instead of using file name suffixes (which the compression and
deltification code don't know about), we'd do some sort of inspection
instead. Detecting ZIP files, or gzip/bzip2/xz-compressed files, etc.,
is fairly easy just from looking at a few bytes of headers. Same goes
for most image and video formats.
Of course, one could always concoct a file that would trick such
inspection, but at least that's marginally harder to do than commit a
large text file full of spaces and calling it 'spaceinvaders.jpg'. :)
Random binary data is harder to detect, but we already deal with that
after the fact by using the plain text if the delta is too large.
We could also make the process driven by a "client-side suggestion".
Driving it from the client-side also gives us the possibility to
eliminate the client-side deltification overhead.
I.e. the client has some logic (libmagic, suffix, looking at the first
100 bytes, ...) to determine that it's not worth deltifying and/or
compressing. It doesn't do deltification itself, and lets the server
know that it probably shouldn't either (or the server sees that the
client hasn't deltified, so accepts the content automatically as
"non-deltifiable / non-compressable"?).
Maybe needs a server-side config setting to make it respect or ignore
the client-side suggestion.
That's such an easy way to make a malicious client explode the
repository size. And ... there's realy no reason to complicate. The
server's storage layer can cheaply do all the necessary checks without
having to believe the client, and without adding yet another
(dangerous!) config knob.
Yes, well in any case allowing this by server-side inspection will
also open up possibilities for blowing up the repository by a
malicious client.

In fact, making it coupled with "client also non-deltifies" forces the
client to also send those huge files over the wire, making it a little
bit more difficult to DoS the server by blowing it up. If the client
can still deltify (only sending a few bytes), but trick the server
into storing those as full-texts, the attack can be more powerful I
guess.
--
Johan
Markus Schaber
2017-07-12 11:43:10 UTC
Permalink
Hi,
Post by Branko Čibej
That's such an easy way to make a malicious client explode the
repository size. And ... there's realy no reason to complicate. The
server's storage layer can cheaply do all the necessary checks without
having to believe the client, and without adding yet another
(dangerous!) config knob.
Yes, well in any case allowing this by server-side inspection will also open
up possibilities for blowing up the repository by a malicious client.
A malicious user can always "explode" the server by just uploading/overwriting huge random files. Using svnmucc and a unix pipe, he doesn't even need a local file or working copy for that.

Thus, I think listening to a client hint in general will not open a completely new security hole. SVN repositories are a kind of data storage, and we cannot prevent users from abusing it by storing data...
In fact, making it coupled with "client also non-deltifies" forces the client
to also send those huge files over the wire, making it a little bit more
difficult to DoS the server by blowing it up. If the client can still deltify
(only sending a few bytes), but trick the server into storing those as full-
texts, the attack can be more powerful I guess.
Yes, I think allowing deltification for the client while storing non-deltified on the server amplifies the possible attack, so we should be careful.

Could the server use the already pre-deltified and -compressed representation coming from the client, without compressing and re-deltifying itself (but still verifying it, of course).

On the other hand, I'd also hesitate to automatically skip deltification and compression just because the client delivers uncompressed or non-deltified content. This will effectively disable deltification and compression for svnmucc, DAV-autoversioning and maybe some other use cases.


Best regards

Markus Schaber

CODESYS® a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: ***@codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received
this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure
or distribution of the material in this e-mail is strictly forbidden.
Paul Hammant
2017-07-12 19:50:38 UTC
Permalink
You know, in all seriousness I think the (empty by default) list of
exempted files suffixes the the best way forward. If suffixes is good
enough for Apache itself to use (link provided earlier), it is good enough
in this scenario on the server side of Svn. If the function in question
doesn't know the file name then I that param should be added to the
functions args (and backwards through all the methods in the stack until
it's reached the place where the resource name was known).

I'd be cranking up the JetBrains' Clion myself to do the refactoring and
giving you *cough* a pull request, but I've not done any C since 1991 -
Loading Image... - (top right).
Anyone in NYC want to bring be up to speed with the build, and acclimate
me to the source? Or by ScreenHero (will send invites).

- Paul
Post by Branko Čibej
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
Post by Branko Čibej
Another issue I have with the proposal is the idea to use file
suffixes.
Windows
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
Post by Branko Čibej
does it, with didastrous results). It's much better to determine file
format by inspection, such as, e.g., libmagic does. We already have
optional support for libmagic in the client (to set svn:mime-type).
I would not feel comfortable having the server parse arbitrary data
with
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
libmagic. The libmagic code is not very safe to run on untrusted
input.
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
I have seen libmagic crash my svn client on several occasions even on
text files I wrote.
At the client side it's a bit less dangerous because users have
already
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
told svn to add the files in question to version control, and a
libmagic
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
exploit running on the client machine can do less harm than a
server-side one.
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
Granted, commits are usually authenticated. If we did this we should
at
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
least make really sure that no unauthenticated access can trigger
this code.
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
Post by Stefan Sperling
Ideally, it would be sandboxed somehow if we started using it on the
server.
Post by Branko Čibej
Post by Johan Corveleyn
Post by Branko Čibej
I wasn't really proposing to use libmagic on the server. My point is
that instead of using file name suffixes (which the compression and
deltification code don't know about), we'd do some sort of inspection
instead. Detecting ZIP files, or gzip/bzip2/xz-compressed files, etc.,
is fairly easy just from looking at a few bytes of headers. Same goes
for most image and video formats.
Of course, one could always concoct a file that would trick such
inspection, but at least that's marginally harder to do than commit a
large text file full of spaces and calling it 'spaceinvaders.jpg'. :)
Random binary data is harder to detect, but we already deal with that
after the fact by using the plain text if the delta is too large.
We could also make the process driven by a "client-side suggestion".
Driving it from the client-side also gives us the possibility to
eliminate the client-side deltification overhead.
I.e. the client has some logic (libmagic, suffix, looking at the first
100 bytes, ...) to determine that it's not worth deltifying and/or
compressing. It doesn't do deltification itself, and lets the server
know that it probably shouldn't either (or the server sees that the
client hasn't deltified, so accepts the content automatically as
"non-deltifiable / non-compressable"?).
Maybe needs a server-side config setting to make it respect or ignore
the client-side suggestion.
That's such an easy way to make a malicious client explode the
repository size. And ... there's realy no reason to complicate. The
server's storage layer can cheaply do all the necessary checks without
having to believe the client, and without adding yet another
(dangerous!) config knob.
Yes, well in any case allowing this by server-side inspection will
also open up possibilities for blowing up the repository by a
malicious client.
In fact, making it coupled with "client also non-deltifies" forces the
client to also send those huge files over the wire, making it a little
bit more difficult to DoS the server by blowing it up. If the client
can still deltify (only sending a few bytes), but trick the server
into storing those as full-texts, the attack can be more powerful I
guess.
--
Johan
Branko Čibej
2017-07-13 06:21:10 UTC
Permalink
Post by Paul Hammant
You know, in all seriousness I think the (empty by default) list of
exempted files suffixes the the best way forward. If suffixes is good
enough for Apache itself to use (link provided earlier), it is good
enough in this scenario on the server side of Svn.
The cases are completely different. Yes, httpd (optionally) uses file
name suffixes to set the Content-Type header, but the file names and
their content are completely controlled by the server administrator.

That's not the case with Subversion; quite the opposite.
Post by Paul Hammant
If the function in question doesn't know the file name then I that
param should be added to the functions args (and backwards through all
the methods in the stack until it's reached the place where the
resource name was known).
Mmm ... interesting proposition. Also huge -1 because it'd be a really
awsome abstraction leak. :)

-- Brane
Paul Hammant
2017-07-13 10:34:17 UTC
Permalink
Post by Branko Čibej
Mmm ... interesting proposition. Also huge -1 because it'd be a really
awesome abstraction leak. :)
Is there any way to implement the (reasonable in my opinion) feature
request, without transgressing on programming standards? I think turning
off all deltification is a bit brute force, when I only really want it
turned off files I'd sync that would be over 50Mb (.mp4 files and alike).

Is that the alternate request?

1. compression-exempt-size-limit = 50MB
2. deltification-exempt-size-limit = 100KB

Or am I really wanting Svn's backend compression and deltification to be
out of band ?

1. compression-strategy = defer-to-idle-time-even-if-days-later
2. deltification-strategy = defer-to-idle-time-even-if-days-later

- Paul
Johan Corveleyn
2017-07-13 12:27:42 UTC
Permalink
Post by Paul Hammant
Post by Branko Čibej
Mmm ... interesting proposition. Also huge -1 because it'd be a really
awesome abstraction leak. :)
Is there any way to implement the (reasonable in my opinion) feature
request, without transgressing on programming standards? I think turning
off all deltification is a bit brute force, when I only really want it
turned off files I'd sync that would be over 50Mb (.mp4 files and alike).
Is that the alternate request?
1. compression-exempt-size-limit = 50MB
2. deltification-exempt-size-limit = 100KB
Or am I really wanting Svn's backend compression and deltification to be out
of band ?
1. compression-strategy = defer-to-idle-time-even-if-days-later
2. deltification-strategy = defer-to-idle-time-even-if-days-later
- Paul
It occurred to me that there are two more fsfs.conf settings that you
can try to reduce the cpu cost on the server:

- max-deltification-walk: try setting that to 1 or even to 0 ("Very
small values may be useful in repositories that are dominated by
large, changing binaries. Should be a power of two minus 1. A value of
0 will effectively disable deltification"). ==> i.e. you can use this
to test what happens if you disable deltification entirely.

- max-linear-deltification (default=16): try setting that to a low
value, perhaps 1.

Also: for compression maybe you should try "compression-level=1" too,
to see if it gives you a good tradeoff between "not too much wasted
time for large binaries, and still useful if someone commits text
files". If that tradeoff is acceptable, maybe that's good enough for
now, without the need for the new features you've been proposing.

From fsfs.conf:
[[[
### During commit, the server may need to walk the whole change history of
### of a given node to find a suitable deltification base. This linear
### process can impact commit times, svnadmin load and similar operations.
### This setting limits the depth of the deltification history. If the
### threshold has been reached, the node will be stored as fulltext and a
### new deltification history begins.
### Note, this is unrelated to svn log.
### Very large values rarely provide significant additional savings but
### can impact performance greatly - in particular if directory
### deltification has been activated. Very small values may be useful in
### repositories that are dominated by large, changing binaries.
### Should be a power of two minus 1. A value of 0 will effectively
### disable deltification.
### For 1.8, the default value is 1023; earlier versions have no limit.
# max-deltification-walk = 1023
###
### The skip-delta scheme used by FSFS tends to repeatably store redundant
### delta information where a simple delta against the latest version is
### often smaller. By default, 1.8+ will therefore use skip deltas only
### after the linear chain of deltas has grown beyond the threshold
### specified by this setting.
### Values up to 64 can result in some reduction in repository size for
### the cost of quickly increasing I/O and CPU costs. Similarly, smaller
### numbers can reduce those costs at the cost of more disk space. For
### rarely read repositories or those containing larger binaries, this may
### present a better trade-off.
### Should be a power of two. A value of 1 or smaller will cause the
### exclusive use of skip-deltas (as in pre-1.8).
### For 1.8, the default value is 16; earlier versions use 1.
# max-linear-deltification = 16
]]]
--
Johan
Daniel Shahaf
2017-07-13 14:07:03 UTC
Permalink
Post by Paul Hammant
Or am I really wanting Svn's backend compression and deltification to be
out of band ?
1. compression-strategy = defer-to-idle-time-even-if-days-later
2. deltification-strategy = defer-to-idle-time-even-if-days-later
That's indeed conceivable: compression/deltification could, in principle,
be deferred to 'svnadmin pack' time, so a commit would create PLAIN reps
(or DELTA reps against whatever base the client happened to choose) and
a subsequent 'svnadmin pack' would convert them to skip-deltas DELTA reps.

Wouldn't even require a format bump :-).
Branko Čibej
2017-07-13 18:11:17 UTC
Permalink
Post by Daniel Shahaf
Post by Paul Hammant
Or am I really wanting Svn's backend compression and deltification to be
out of band ?
1. compression-strategy = defer-to-idle-time-even-if-days-later
2. deltification-strategy = defer-to-idle-time-even-if-days-later
That's indeed conceivable: compression/deltification could, in principle,
be deferred to 'svnadmin pack' time, so a commit would create PLAIN reps
(or DELTA reps against whatever base the client happened to choose) and
a subsequent 'svnadmin pack' would convert them to skip-deltas DELTA reps.
Wouldn't even require a format bump :-)
I agree, I've been thinking for a long time that compression and/or
deltification is a waste of time during commit. I'm not sure we'd really
want to defer it to 'svnadmin pack'; but, e.g., spawning off a daemon
process to post-process the commit might not be a completely silly idea.
Especially as we're not exactly good at using up all available cores on
the server.

-- Brane
Paul Hammant
2017-07-13 18:24:45 UTC
Permalink
Is Mod_Dav_Svn a TSR* thing? Or wholly re-entered for each request that
would require it?

* https://en.wikipedia.org/wiki/Terminate_and_stay_resident_program
Post by Branko Čibej
Post by Daniel Shahaf
Post by Paul Hammant
Or am I really wanting Svn's backend compression and deltification to be
out of band ?
1. compression-strategy = defer-to-idle-time-even-if-days-later
2. deltification-strategy = defer-to-idle-time-even-if-days-later
That's indeed conceivable: compression/deltification could, in principle,
be deferred to 'svnadmin pack' time, so a commit would create PLAIN reps
(or DELTA reps against whatever base the client happened to choose) and
a subsequent 'svnadmin pack' would convert them to skip-deltas DELTA
reps.
Post by Daniel Shahaf
Wouldn't even require a format bump :-)
I agree, I've been thinking for a long time that compression and/or
deltification is a waste of time during commit. I'm not sure we'd really
want to defer it to 'svnadmin pack'; but, e.g., spawning off a daemon
process to post-process the commit might not be a completely silly idea.
Especially as we're not exactly good at using up all available cores on
the server.
-- Brane
Branko Čibej
2017-07-13 18:28:59 UTC
Permalink
Post by Paul Hammant
Is Mod_Dav_Svn a TSR* thing? Or wholly re-entered for each request
that would require it?
* https://en.wikipedia.org/wiki/Terminate_and_stay_resident_program
Well since it doesn't run on DOS, it's not TSR :)

mod_dav_svn is a shared library loaded into the httpd process, it's not
a process itself. How long httpd's request handler process (or thread)
lives depends on a number of things, e.g., which MPM is in use and how
it's configured.

-- Brane
Post by Paul Hammant
Post by Daniel Shahaf
Post by Paul Hammant
Or am I really wanting Svn's backend compression and
deltification to be
Post by Daniel Shahaf
Post by Paul Hammant
out of band ?
1. compression-strategy = defer-to-idle-time-even-if-days-later
2. deltification-strategy =
defer-to-idle-time-even-if-days-later
Post by Daniel Shahaf
That's indeed conceivable: compression/deltification could, in
principle,
Post by Daniel Shahaf
be deferred to 'svnadmin pack' time, so a commit would create
PLAIN reps
Post by Daniel Shahaf
(or DELTA reps against whatever base the client happened to
choose) and
Post by Daniel Shahaf
a subsequent 'svnadmin pack' would convert them to skip-deltas
DELTA reps.
Post by Daniel Shahaf
Wouldn't even require a format bump :-)
I agree, I've been thinking for a long time that compression and/or
deltification is a waste of time during commit. I'm not sure we'd really
want to defer it to 'svnadmin pack'; but, e.g., spawning off a daemon
process to post-process the commit might not be a completely silly idea.
Especially as we're not exactly good at using up all available cores on
the server.
-- Brane
Paul Hammant
2017-07-13 18:42:47 UTC
Permalink
A Janitor process that runs independently of Apache isn't a *paradigm shift*
but it is a huge amount of work to production harden, even if it was done
in today's new hotness Go or Rust. As is the minefield of maintaining a
list that's persistent of things yet to do post response. I'm on he Fossil
list too, and I see infrequent emails about how SqlLite can barf in some
situations.

A cron type thing that's allowed to bisect commits for the repo looking for
the earliest one that has a property set (deferred-delta=true), and
removing that property when it's made the delta?

- Paul
Branko Čibej
2017-07-13 18:46:07 UTC
Permalink
A Janitor process that runs independently of Apache isn't a /paradigm
shift/ but it is a huge amount of work to production harden, even if
it was done in today's new hotness Go or Rust. As is the minefield of
maintaining a list that's persistent of things yet to do post
response. I'm on he Fossil list too, and I see infrequent emails about
how SqlLite can barf in some situations.
Agreed. That's why I didn't propose this. I proposed spawning off a
daemon that would post-process _one_ commit and exit. It could do all
sorts of analysis of the content and finding the best (for some
definition of "best") source for the delta, etc.
A cron type thing that's allowed to bisect commits for the repo
looking for the earliest one that has a property set
(deferred-delta=true), and removing that property when it's made the
delta?
That'd work too, but needs administrator intervention. Spawning a daemon
process doesn't.

-- Brane
Philip Martin
2017-07-13 19:57:27 UTC
Permalink
Post by Branko Čibej
Agreed. That's why I didn't propose this. I proposed spawning off a
daemon that would post-process _one_ commit and exit. It could do all
sorts of analysis of the content and finding the best (for some
definition of "best") source for the delta, etc.
One difficulty is the that FSFS makes the assumption that revision files
are immutable. The pack operation can get around this as it deletes the
immutable revision files and adds distinct immutable pack files.

A separate deltification operation would probably need to do something
similar, introduce a second form of revision file distinguished by name
or path. Commit would write immutable pre-delta revision files and
deltification would write immutable post-delta revision files before
deleting the pre-delta file. All the FSFS code would have to fall over
from pre-delta revision file to post-delta revision file much like it
falls over from revision file to pack file.
--
Philip
Branko Čibej
2017-07-13 19:59:59 UTC
Permalink
Post by Philip Martin
Post by Branko Čibej
Agreed. That's why I didn't propose this. I proposed spawning off a
daemon that would post-process _one_ commit and exit. It could do all
sorts of analysis of the content and finding the best (for some
definition of "best") source for the delta, etc.
One difficulty is the that FSFS makes the assumption that revision files
are immutable. The pack operation can get around this as it deletes the
immutable revision files and adds distinct immutable pack files.
A separate deltification operation would probably need to do something
similar, introduce a second form of revision file distinguished by name
or path. Commit would write immutable pre-delta revision files and
deltification would write immutable post-delta revision files before
deleting the pre-delta file. All the FSFS code would have to fall over
from pre-delta revision file to post-delta revision file much like it
falls over from revision file to pack file.
Agreed on all points.

Whether this actually forces a format bump or not is a different
question which I don't know the answer to.

-- Brane
Philip Martin
2017-07-13 20:36:04 UTC
Permalink
Post by Branko Čibej
Whether this actually forces a format bump or not is a different
question which I don't know the answer to.
I think we would have to bump. The old code could either read the
pre-delta or the post-delta files, depending on how we decided to name
things, but not both. Either way the old code would not be able to read
all the revision files and the repository would look broken.
--
Philip
Daniel Shahaf
2017-07-13 21:04:56 UTC
Permalink
Post by Philip Martin
Post by Branko Čibej
Whether this actually forces a format bump or not is a different
question which I don't know the answer to.
I think we would have to bump. The old code could either read the
pre-delta or the post-delta files, depending on how we decided to name
things, but not both. Either way the old code would not be able to read
all the revision files and the repository would look broken.
If we invent a "second form of revision file distinguished by name
or path", then yes, we would require a format bump, to ensure all
readers know to cope with the situation that the revision file has been
unlinked from the currently-well-known name. It would also require us
to figure out how to update all codepaths that open a revision file, to
do the correct triple lookup (old name, new name, packed name).

When I said format bump wouldn't be required, I envisioned that the rev
file that contains a PLAIN rep could be replaced by a rev file that
contains a DELTA rep, *if the DELTA rep is shorter*. A replacement rev
file could be prepared (and atomically renamed into place) that replaces
the PLAIN rep by the shorter DELTA rep, and updates the unexpanded-len
member of the node-rev header. That would result in some never-read
padding bytes, but FSFS f7's packing operation could regain them. (If
the number of digits of unexpanded-len changed, the replacement rev file
would need to add some padding to ensure the number of bytes in the
node-rev header — and hence, offsets to the remainder of the file —
don't change.)

Existing readers don't care whether a rep is a DELTA rep or a PLAIN rep;
they just care that it starts at the given byte offset, has "ENDREP\n"
after the given length, the resulting file checksums to the given value.

Now that I write this down I realize that rep-sharing complicates
matters. The replacement would only be sound if the rep is not the
target of rep-sharing from another revision; that is easily handled by
only adding the rep to rep-cache.db after replacing the PLAIN rep by the
equivalent, shorter DELTA rep. The remaining problem is what to do if
the rep is shared between two noderevs inside a single revision, but
<handwave>that's solvable</handwave>.

Regarding the recompress-at-pack alternative, I note that we (= the 1.9
release notes) recommend to pack FSFS f7 repositories regularly.

Cheers,

Daniel
Daniel Shahaf
2017-07-12 13:37:07 UTC
Permalink
Post by Branko Čibej
I wasn't really proposing to use libmagic on the server. My point is
that instead of using file name suffixes (which the compression and
deltification code don't know about), we'd do some sort of inspection
instead. Detecting ZIP files, or gzip/bzip2/xz-compressed files, etc.,
is fairly easy just from looking at a few bytes of headers. Same goes
for most image and video formats.
That's an option, but it would mean re-solving the problem libmagic
solves. Is there a way for us to use libmagic securely?

E.g., we could give to libmagic only the first 10 or 20 bytes of the
file (which is enough for it to recognise mpeg/jpeg/xz files, in my
testing), or we could ask libmagic to provide an API that only runs
'safe' magic file tests (e.g., strcmp/memcmp-based tests only)…

Cheers,

daniel
Daniel Shahaf
2017-07-11 20:00:03 UTC
Permalink
Post by Markus Schaber
I expect significant benefits in some use cases by skipping the
compression, thus I'm +1 if benchmarks prove it's worth the effort.
It is easy to have deltification without compression, either by using
svndiff0 (instead of svndiff1) or by using svndiff1 with zlib
compression level set appropriately.
Post by Markus Schaber
I see the danger of drastically increased bandwith and storage size
(transferring/storing the whole mp3 instead of just some changed meta
data bytes) in some common use cases when deltification is skipped.
Thus, I'm skeptical (count it as -0), and I'd kindly suggest to do
some benchmarks for those cases before implementation, and clear
documentation of the possible negative effects if it's implemented.
Regarding compression, would it make sense for the server to compute the
compressed delta, and it turns out to be larger than X% of the
uncompressed&undeltified file, to just store the latter? I.e., to
compute the DELTA rep but use a PLAIN rep if the DELTA rep would be
larger than X% (in bytes) of the PLAIN rep?

IIRC this is already so with X=100, but for some filetypes it might make
sense to set X lower.

Cheers,

Daniel
Paul Hammant
2017-07-11 20:07:24 UTC
Permalink
So I'm after a time saving. I'm perfectly happy for the backend to waste
space (in my configuration), I just don't want it to take 15 mins to
transfer a single 15GB file into Subversion.

In my configuration, I'd like to pre-advise Subversion to save as much time
as possible for uploads, by skipping steps that are known in advance to be
meaningless for the use case.

- Paul
Post by Daniel Shahaf
Post by Markus Schaber
I expect significant benefits in some use cases by skipping the
compression, thus I'm +1 if benchmarks prove it's worth the effort.
It is easy to have deltification without compression, either by using
svndiff0 (instead of svndiff1) or by using svndiff1 with zlib
compression level set appropriately.
Post by Markus Schaber
I see the danger of drastically increased bandwith and storage size
(transferring/storing the whole mp3 instead of just some changed meta
data bytes) in some common use cases when deltification is skipped.
Thus, I'm skeptical (count it as -0), and I'd kindly suggest to do
some benchmarks for those cases before implementation, and clear
documentation of the possible negative effects if it's implemented.
Regarding compression, would it make sense for the server to compute the
compressed delta, and it turns out to be larger than X% of the
uncompressed&undeltified file, to just store the latter? I.e., to
compute the DELTA rep but use a PLAIN rep if the DELTA rep would be
larger than X% (in bytes) of the PLAIN rep?
IIRC this is already so with X=100, but for some filetypes it might make
sense to set X lower.
Cheers,
Daniel
Mark Phippard
2017-07-12 20:10:17 UTC
Permalink
I cannot find it in archives so maybe this happened in IRC, but I remember
one time suggesting we add a new versioned svn:XXX property to control
this. This could then be set by the client based on extension if desired.
I recall my suggestion was a compression on|off property that when turned
off would cause us to omit the deltification the client does when sending
to the server. If it makes sense for the server to do something similar
when it stores the file, great.

I suggested if we wanted to get really fancy we could also let the property
control the size of the window used when run xdelta or whatever we do. I
recall for some binary files if you have it a larger window it could do a
better job of compressing. I am no expert here I just recall this being
mentioned in the past as a tuning option that could make a difference. So
a property would be a way to expose that option.

I think the main use case though is to just disable it for files where
someone knows they would be better off just skipping deltification because
it is not going to reduce the size and just use a lot of time and CPU.
--
Thanks

Mark Phippard
http://markphip.blogspot.com/
Paul Hammant
2017-07-12 20:40:57 UTC
Permalink
I'd be fine with that too if it is *also* settable via curl --header
"svn:compression: no" for non-client auto-increment operations.
Post by Mark Phippard
I cannot find it in archives so maybe this happened in IRC, but I remember
one time suggesting we add a new versioned svn:XXX property to control
this. This could then be set by the client based on extension if desired.
I recall my suggestion was a compression on|off property that when turned
off would cause us to omit the deltification the client does when sending
to the server. If it makes sense for the server to do something similar
when it stores the file, great.
I suggested if we wanted to get really fancy we could also let the
property control the size of the window used when run xdelta or whatever we
do. I recall for some binary files if you have it a larger window it could
do a better job of compressing. I am no expert here I just recall this
being mentioned in the past as a tuning option that could make a
difference. So a property would be a way to expose that option.
I think the main use case though is to just disable it for files where
someone knows they would be better off just skipping deltification because
it is not going to reduce the size and just use a lot of time and CPU.
--
Thanks
Mark Phippard
http://markphip.blogspot.com/
Johan Corveleyn
2017-07-12 21:45:58 UTC
Permalink
I'd be fine with that too if it is also settable via curl --header
"svn:compression: no" for non-client auto-increment operations.
I'm wondering whether you'll still need this. You ended up with
curl+autoversioning (at Philip's suggestion) because that eliminates
the client-side deltification and compression overhead. But if the
normal svn client grows support to skip those (with the svn: property
Mark pointed to, or through some other technique), you might be able
to use a normal client with the same (or very close) performance as
curl+autoversioning.
--
Johan
Paul Hammant
2017-07-13 02:07:30 UTC
Permalink
I flipped *back* to Python's requests.put(..) in my solution - from a
regular Svn client. That relies on 'autoversioning=on' for it to work over
DAV, I mean. In that configuration it functions like curl, of course.

*Commit Messages*

I'd love a --header "svn:message: my message" too. I raised it before in
https://issues.apache.org/jira/browse/SVN-4454 but the team was split back
in 2013 on yes vs no. I'm thinking I'd like it again, having reviewed all
the comments. Can I ask for that issue/request to be reopened, please ?

*An Aside*

My CMS thing from 11 years ago used SeaMonkey (or MsWord2004 (not sure of
the year) before Microsoft killed DAV support in Office) for a vanilla
'PUT' of a page. I would have needed to have contributed to *SeaMonkey's*
codebase to have an inline *commit message* added to the 'publish' dialog
box for the page editor. The code's in GitHub these days if anyone is
interested, as all of Codehaus' repos were copied there before shutdown of
that fondly remembered portal. Codehaus being the first OSS code portal to
provision Subversion (pre 1.0).
Post by Johan Corveleyn
I'd be fine with that too if it is also settable via curl --header
"svn:compression: no" for non-client auto-increment operations.
I'm wondering whether you'll still need this. You ended up with
curl+autoversioning (at Philip's suggestion) because that eliminates
the client-side deltification and compression overhead. But if the
normal svn client grows support to skip those (with the svn: property
Mark pointed to, or through some other technique), you might be able
to use a normal client with the same (or very close) performance as
curl+autoversioning.
--
Johan
Branko Čibej
2017-07-13 06:27:25 UTC
Permalink
I flipped _back_ to Python's requests.put(..) in my solution - from a
regular Svn client. That relies on 'autoversioning=on' for it to work
over DAV, I mean. In that configuration it functions like curl, of
course.
_Commit Messages_
I'd love a --header "svn:message: my message" too. I raised it before
in https://issues.apache.org/jira/browse/SVN-4454
<https://issues.apache.org/jira/browse/SVN-4454> but the team was
split back in 2013 on yes vs no. I'm thinking I'd like it again,
having reviewed all the comments. Can I ask for that issue/request to
be reopened, please ?
I'm wondering what you gain with curl and autoversioning over, e.g.,
svnmucc or using our bindings (or even our libraries)? Other than that
you can't set the log message or any other properties.

FWIW, I strongly disagree with the idea of adding this feature, given
that there are already _two_ ways of doing this without having a working
copy.

-- Brane
Julian Foad
2017-07-13 07:21:20 UTC
Permalink
I'd love a --header "svn:message: my message" [...]
I'm wondering what you gain [...]
Paul,

I encourage you to start a new thread if any further discussion of this
point is needed.

- Julian
Johan Corveleyn
2017-07-13 07:36:31 UTC
Permalink
Post by Branko Čibej
I flipped _back_ to Python's requests.put(..) in my solution - from a
regular Svn client. That relies on 'autoversioning=on' for it to work
over DAV, I mean. In that configuration it functions like curl, of
course.
_Commit Messages_
I'd love a --header "svn:message: my message" too. I raised it before
in https://issues.apache.org/jira/browse/SVN-4454
<https://issues.apache.org/jira/browse/SVN-4454> but the team was
split back in 2013 on yes vs no. I'm thinking I'd like it again,
having reviewed all the comments. Can I ask for that issue/request to
be reopened, please ?
I'm wondering what you gain with curl and autoversioning over, e.g.,
svnmucc or using our bindings (or even our libraries)? Other than that
you can't set the log message or any other properties.
FWIW, I strongly disagree with the idea of adding this feature, given
that there are already _two_ ways of doing this without having a working
copy.
As I said in my previous response here, I think the reason for Paul to
go for curl+autoversioning is speed, because it eliminates client-side
deltification. It was suggested and demonstrated by Philip here:

https://svn.haxx.se/dev/archive-2017-07/0040.shtml

But I'm wondering if that curl advantage won't dissapear if we develop
a solution for a normal svn client to skip deltification.
--
Johan
Jacek Materna
2017-07-13 09:09:27 UTC
Permalink
Post by Johan Corveleyn
Post by Branko Čibej
I flipped _back_ to Python's requests.put(..) in my solution - from a
regular Svn client. That relies on 'autoversioning=on' for it to work
over DAV, I mean. In that configuration it functions like curl, of
course.
_Commit Messages_
I'd love a --header "svn:message: my message" too. I raised it before
in https://issues.apache.org/jira/browse/SVN-4454
<https://issues.apache.org/jira/browse/SVN-4454> but the team was
split back in 2013 on yes vs no. I'm thinking I'd like it again,
having reviewed all the comments. Can I ask for that issue/request to
be reopened, please ?
I'm wondering what you gain with curl and autoversioning over, e.g.,
svnmucc or using our bindings (or even our libraries)? Other than that
you can't set the log message or any other properties.
FWIW, I strongly disagree with the idea of adding this feature, given
that there are already _two_ ways of doing this without having a working
copy.
As I said in my previous response here, I think the reason for Paul to
go for curl+autoversioning is speed, because it eliminates client-side
https://svn.haxx.se/dev/archive-2017-07/0040.shtml
But I'm wondering if that curl advantage won't dissapear if we develop
a solution for a normal svn client to skip deltification.
Keen to understand this - speeding up commits in a supported way (for
specific workflows) would be a major win.
Post by Johan Corveleyn
--
Johan
Paul Hammant
2017-07-13 09:46:37 UTC
Permalink
On rationale ...

This is a file sync agent. 600 lines of Python that I'm going to open
source. The same thing I was working on in 2016 when I chatting on this
list.

Goals: 1) *not have two copies of everything on the clien*t, 2) have
different semantics to pushing back changed files.
Secondary goals: 3) not require a svn install on the client.

Users: families (movies, music, photos). Corporates (docs, s/sheets etc).

The break through last year was the techniques to lean on the server-side
Merkle tree, as well as config settings to expose them to a place of
usefulness.

I'd paused the project because I was busy with other stuff, and I picked it
up again recently - coinciding with the purchase of a SFF computer that
spoke USB3. Last year the performance of the thing was unacceptable on a
range of Raspberry Pi sized things, and all of them spoke USB2 to my 4TB
Seagate external drive. Performance is adequate on the new combo (this
thread, and others in the last two weeks), but it could be better with a
few config choices, I think.

Functionally, I'm nearly there (one bug that's hard to reproduce). Later I
want to get into permissions on directories.
Post by Branko Čibej
I'm wondering what you gain with curl and autoversioning over, e.g.,
svnmucc or using our bindings (or even our libraries)? [...]
As I said in my previous response here, I think the reason for Paul to
go for curl+autoversioning is speed, because it eliminates client-side
deltification. [...]
But I'm wondering if that curl advantage won't disappear if we develop
Post by Branko Čibej
a solution for a normal svn client to skip deltification.
Philip Martin
2017-07-13 11:01:43 UTC
Permalink
Post by Johan Corveleyn
As I said in my previous response here, I think the reason for Paul to
go for curl+autoversioning is speed, because it eliminates client-side
Using a trunk build, I'm uploading 500MB urandom files:

dd if=/dev/urandom bs=1M count=500 > u1.dat
dd if=/dev/urandom bs=1M count=500 > u2.dat
dd if=/dev/urandom bs=1M count=500 > u3.dat

using

curl -T u1.dat http://localhost/repo/curl.dat
curl -T u2.dat http://localhost/repo/curl.dat
curl -T u3.dat http://localhost/repo/curl.dat

and

svnmucc -mm put u1.dat http://localhost/repo/svn.dat
svnmucc -mm put u2.dat http://localhost/repo/svn.dat
svnmucc -mm put u3.dat http://localhost/repo/svn.dat

to a repository with compression-level=0 and enable-rep-sharing=false.

The upload that creates the file is faster than the upload that updates
the file:

curl create time: 4.2s
curl update time: 6.7s
curl CPU in both cases: 0.2s

svnmucc create time: 5.2s
svnmucc update time: 7.7s
svnmucc CPU in both cases: 1.2s

The difference between curl and svnmucc matches the extra CPU used by
svnmucc.

The difference between create and replace is due to the extra server CPU

httpd CPU create: 3.0s
httpd CPU replace: 5.5s

The fact that the create is faster than the update means that if
uploading the data is the only thing that matters then breaking the
history and changing the update into a delete+recreate is faster, i.e.

svnmucc -mm -U http://localhost/repo put u1.dat svn.dat
svnmucc -mm -U http://localhost/repo rm svn.dat put u2.dat svn.dat
svnmucc -mm -U http://localhost/repo rm svn.dat put u3.dat svn.dat

The delete+recreate takes the same 5.2s as the original create and is
faster than curl's 6.7s update.
--
Philip
Paul Hammant
2017-07-13 11:07:36 UTC
Permalink
Post by Philip Martin
The delete+recreate takes the same 5.2s as the original create and is
faster than curl's 6.7s update.
New fsfs config prop ?

delete_and_recreate_instead_of_change_threshold = 50MB

-ph
Branko Čibej
2017-07-13 11:54:24 UTC
Permalink
Post by Philip Martin
The delete+recreate takes the same 5.2s as the original create and is
faster than curl's 6.7s update.
New fsfs config prop ?
delete_and_recreate_instead_of_change_threshold = 50MB
You forgot the smiley here, surely you're joking. :)

-- Brane
Paul Hammant
2017-07-13 12:00:17 UTC
Permalink
I don't mind how it's implemented, and agree this one is weak. I am after a
solution though and am open to alternate ideas if compression-exempt-suffixes
and its sibling cannot be implemented for some technical reason.
Post by Branko Čibej
Post by Philip Martin
The delete+recreate takes the same 5.2s as the original create and is
faster than curl's 6.7s update.
New fsfs config prop ?
delete_and_recreate_instead_of_change_threshold = 50MB
You forgot the smiley here, surely you're joking. :)
-- Brane
Philip Martin
2017-07-13 11:48:43 UTC
Permalink
Post by Philip Martin
The fact that the create is faster than the update means that if
uploading the data is the only thing that matters then breaking the
history and changing the update into a delete+recreate is faster, i.e.
svnmucc -mm -U http://localhost/repo put u1.dat svn.dat
svnmucc -mm -U http://localhost/repo rm svn.dat put u2.dat svn.dat
svnmucc -mm -U http://localhost/repo rm svn.dat put u3.dat svn.dat
The delete+recreate takes the same 5.2s as the original create and is
faster than curl's 6.7s update.
Even faster is to use curl to talk DAV:

curl -XPOST -H 'Content-Type: application/vnd.svn-skel' -d @post.txt \
'http://localhost/repo/!svn/me'
curl -XDELETE 'http://localhost/repo/!svn/txr/1-1/svn.dat'
curl -T u2.dat 'http://localhost/repo/!svn/txr/1-1/svn.dat'
curl -XMERGE -d @merge.txt http://localhost/repo

which takes 4.3s, more or less the same as the 4.2s curl create.

Note I have cheated here: I knew the txn name in advance while a real
implementation would need to parse the POST response and adjust the
subsequent requests.

svnmucc is implemented on top of the delta editor, I don't know how easy
it would be to get it to send a plain PUT rather than a delta.
--
Philip
Paul Hammant
2017-07-13 11:52:32 UTC
Permalink
Sorry I'm gunna be a bit slow here, Philip, but you're saying that sequence
of four is a single atomic commit in Subversion?

- Paul
Post by Philip Martin
Post by Philip Martin
The fact that the create is faster than the update means that if
uploading the data is the only thing that matters then breaking the
history and changing the update into a delete+recreate is faster, i.e.
svnmucc -mm -U http://localhost/repo put u1.dat svn.dat
svnmucc -mm -U http://localhost/repo rm svn.dat put u2.dat svn.dat
svnmucc -mm -U http://localhost/repo rm svn.dat put u3.dat svn.dat
The delete+recreate takes the same 5.2s as the original create and is
faster than curl's 6.7s update.
'http://localhost/repo/!svn/me'
curl -XDELETE 'http://localhost/repo/!svn/txr/1-1/svn.dat'
curl -T u2.dat 'http://localhost/repo/!svn/txr/1-1/svn.dat'
which takes 4.3s, more or less the same as the 4.2s curl create.
Note I have cheated here: I knew the txn name in advance while a real
implementation would need to parse the POST response and adjust the
subsequent requests.
svnmucc is implemented on top of the delta editor, I don't know how easy
it would be to get it to send a plain PUT rather than a delta.
--
Philip
Philip Martin
2017-07-13 12:03:02 UTC
Permalink
Post by Paul Hammant
Sorry I'm gunna be a bit slow here, Philip, but you're saying that sequence
of four is a single atomic commit in Subversion?
Yes.
--
Philip
Paul Hammant
2017-07-13 12:12:44 UTC
Permalink
Sweet - I'll operationalize that ASAP. The client side of the sync agent
knows whether the item is new or to be changed, already.

- Paul
Post by Paul Hammant
Post by Paul Hammant
Sorry I'm gunna be a bit slow here, Philip, but you're saying that
sequence
Post by Paul Hammant
of four is a single atomic commit in Subversion?
Yes.
Branko Čibej
2017-07-13 12:17:09 UTC
Permalink
Post by Paul Hammant
Sweet - I'll operationalize that ASAP. The client side of the sync
agent knows whether the item is new or to be changed, already.
To be quite clear here: If you don't need history, you might consider
using a different DAV backend than Subversion. mod_dav_fs uses the file
system as a back-end and avoids all problems with deltification on
client or server, etc.
Post by Paul Hammant
- Paul
Post by Paul Hammant
Sorry I'm gunna be a bit slow here, Philip, but you're saying
that sequence
Post by Paul Hammant
of four is a single atomic commit in Subversion?
Yes.
Philip Martin
2017-07-13 12:29:16 UTC
Permalink
Post by Paul Hammant
Sweet - I'll operationalize that ASAP. The client side of the sync agent
knows whether the item is new or to be changed, already.
Protocol details:

http://svn.apache.org/repos/asf/subversion/trunk/notes/http-and-webdav/http-protocol-v2.txt

POST is only available with the v2 protocol from 1.7 on, older server
only support MKACTIVITY. A proper v2 client, such as the Subversion
libraries, will send OPTIONS and parse the response to identify
features. Things like the various "magic" !svn strings, whether to use
create-txn or create-txn-with-props, etc.
--
Philip
Philip Martin
2017-07-13 12:38:54 UTC
Permalink
Post by Philip Martin
A proper v2 client, such as the Subversion
libraries, will send OPTIONS and parse the response to identify
features.
Each transaction has server-side state on disk. A Subversion client
will attempt to delete the transaction if an error occurs, otherwise
failed transactions take up disk space until manually deleted on the
server.

Once you start using transactions other things become possible: you
could choose to PUT multiple files in one atomic commit.
--
Philip
Evgeny Kotkov
2017-07-14 11:18:27 UTC
Permalink
Post by Paul Hammant
1. compression-exempt-suffixes = mp3,mp4,jpeg
2. deltification-exempt-suffixes = mp3,mp4,jpeg
Regardless of the setting of 'compression-level', #1 above two mean certain
things can skip the compression attempt. It must give up at a certain point
right?
Hi everyone,

To improve the situation with slow commits of large binary and, possibly,
incompressible files I committed a patch (http://svn.apache.org/r1801940)
that adds initial support for LZ4 compression in the backend. While still
providing a decent compression ratio, LZ4 offers much faster compression
even than zlib with level=1, and can skip incompressible data chunks.
(Presumably, LZ4 is used for on-the-fly compression in different file
systems for these reasons.)

I have seen significant (up to 3 times) speed improvement for svn import
and commit with large binary files. Sometimes, using LZ4 compression can
even outperform a configuration with disabled compression, if the file is
at least somehow compressible, as it reduces the time required to write
the data to disk.

Currently, LZ4 compression is enabled if the fsfs.conf file specifies
compression-level=1, and all other levels still use zlib for compression.
Right now, the support for LZ4 is only implemented for the file:// protocol,
but support/negotiation for other protocols can be added later.


Regards,
Evgeny Kotkov
Johan Corveleyn
2017-07-14 11:38:08 UTC
Permalink
On Fri, Jul 14, 2017 at 1:18 PM, Evgeny Kotkov
Post by Evgeny Kotkov
Post by Paul Hammant
1. compression-exempt-suffixes = mp3,mp4,jpeg
2. deltification-exempt-suffixes = mp3,mp4,jpeg
Regardless of the setting of 'compression-level', #1 above two mean certain
things can skip the compression attempt. It must give up at a certain point
right?
Hi everyone,
To improve the situation with slow commits of large binary and, possibly,
incompressible files I committed a patch (http://svn.apache.org/r1801940)
that adds initial support for LZ4 compression in the backend. While still
providing a decent compression ratio, LZ4 offers much faster compression
even than zlib with level=1, and can skip incompressible data chunks.
(Presumably, LZ4 is used for on-the-fly compression in different file
systems for these reasons.)
I have seen significant (up to 3 times) speed improvement for svn import
and commit with large binary files. Sometimes, using LZ4 compression can
even outperform a configuration with disabled compression, if the file is
at least somehow compressible, as it reduces the time required to write
the data to disk.
Currently, LZ4 compression is enabled if the fsfs.conf file specifies
compression-level=1, and all other levels still use zlib for compression.
Right now, the support for LZ4 is only implemented for the file:// protocol,
but support/negotiation for other protocols can be added later.
Nice!

But how about deltification? Has anyone tried / benchmarked the effect
of turning off deltification (with or without compression), to see
what the effect would be on the commit time? Like I suggested in this
thread yesterday (i.e. set max-deltification-walk to 0 in fsfs.conf --
or perhaps play with both max-deltification-walk and
max-linear-deltification) ...
--
Johan
Paul Hammant
2017-07-14 11:59:43 UTC
Permalink
I'd like to weigh in with some perf stats on that ever-changing 15GB file,
but I can't:

With TMPDIR same drive as Svn's root dir: sometime 6mins, sometimes 14
mins for the PUT.
With TMPDIR different drive as Svn's root dir: sometime 6mins, sometimes
14 mins for the PUT.
I can't prove or disprove the benefit of that env var setting.

I have compression-level = 0 at the moment, and have filled the drive to
the 3.4TB level (now full) which isn't pertinent to the question posed, I
guess.

With such oscillation on times for back to back PUTs on that v large
resource, when nothing else is changing, I can't come back with meaningful
stats after a build of Svn rom HEAD or trunk/ :-(

Great work Evgeny. Is the implementation toggled somehow in the build, or
another property? I looked through the diffs but couldn't see anything
that stuck out.

- Paul
Post by Johan Corveleyn
On Fri, Jul 14, 2017 at 1:18 PM, Evgeny Kotkov
Post by Evgeny Kotkov
Post by Paul Hammant
1. compression-exempt-suffixes = mp3,mp4,jpeg
2. deltification-exempt-suffixes = mp3,mp4,jpeg
Regardless of the setting of 'compression-level', #1 above two mean
certain
Post by Evgeny Kotkov
Post by Paul Hammant
things can skip the compression attempt. It must give up at a certain
point
Post by Evgeny Kotkov
Post by Paul Hammant
right?
Hi everyone,
To improve the situation with slow commits of large binary and, possibly,
incompressible files I committed a patch (http://svn.apache.org/r1801940
)
Post by Evgeny Kotkov
that adds initial support for LZ4 compression in the backend. While
still
Post by Evgeny Kotkov
providing a decent compression ratio, LZ4 offers much faster compression
even than zlib with level=1, and can skip incompressible data chunks.
(Presumably, LZ4 is used for on-the-fly compression in different file
systems for these reasons.)
I have seen significant (up to 3 times) speed improvement for svn import
and commit with large binary files. Sometimes, using LZ4 compression can
even outperform a configuration with disabled compression, if the file is
at least somehow compressible, as it reduces the time required to write
the data to disk.
Currently, LZ4 compression is enabled if the fsfs.conf file specifies
compression-level=1, and all other levels still use zlib for compression.
Right now, the support for LZ4 is only implemented for the file://
protocol,
Post by Evgeny Kotkov
but support/negotiation for other protocols can be added later.
Nice!
But how about deltification? Has anyone tried / benchmarked the effect
of turning off deltification (with or without compression), to see
what the effect would be on the commit time? Like I suggested in this
thread yesterday (i.e. set max-deltification-walk to 0 in fsfs.conf --
or perhaps play with both max-deltification-walk and
max-linear-deltification) ...
--
Johan
Branko Čibej
2017-07-14 12:18:09 UTC
Permalink
Post by Paul Hammant
I'd like to weigh in with some perf stats on that ever-changing 15GB
With TMPDIR same drive as Svn's root dir: sometime 6mins, sometimes
14 mins for the PUT.
With TMPDIR different drive as Svn's root dir: sometime 6mins,
sometimes 14 mins for the PUT.
I can't prove or disprove the benefit of that env var setting.
I have compression-level = 0 at the moment, and have filled the drive
to the 3.4TB level (now full) which isn't pertinent to the question
posed, I guess.
In fact it might be. A full filesystem may have a harder time finding
contiguous empty space, which may translate to more separate
(non-contiguous) writes for the same amount of data, which will have a
measurable effect even on an SSD where seek time is effectively zero,
but IOPS are not free.

-- Brane
Paul Hammant
2017-07-14 12:19:44 UTC
Permalink
Well anecdotally, it wasn't oscillating like that in earlier phases.

I'll do p4 obliterate and start over. I mean svn obliterate. Err :-P
Post by Branko Čibej
In fact it might be. A full filesystem may have a harder time finding
contiguous empty space, which may translate to more separate
(non-contiguous) writes for the same amount of data, which will have a
measurable effect even on an SSD where seek time is effectively zero,
but IOPS are not free.
Philip Martin
2017-07-14 14:08:03 UTC
Permalink
Post by Paul Hammant
With TMPDIR same drive as Svn's root dir: sometime 6mins, sometimes 14
mins for the PUT.
With TMPDIR different drive as Svn's root dir: sometime 6mins, sometimes
14 mins for the PUT.
I can't prove or disprove the benefit of that env var setting.
I don't believe Apache/mod_dav_svn uses TMPDIR when processing a PUT.
--
Philip
Paul Hammant
2017-07-14 19:47:38 UTC
Permalink
Philip,
Post by Philip Martin
I don't believe Apache/mod_dav_svn uses TMPDIR when processing a PUT.
I can prove that either Apache or ModDavSvn (or the OS) uses TMPDIR during
a PUT of a very large resource.

Test 1: leave 3GB free on system drive, try to PUT 15GB file thru DAV into
it's ultimate destination on a driver with 3TB of space. And observe an
error (as I did before)

Test 2: as #1 but with a TMPDIR export in /etc/apache2/envvars. No error
during the PUT will be the observation.

* this is me retracing steps *

- Paul
Philip Martin
2017-07-14 20:42:02 UTC
Permalink
Post by Paul Hammant
Post by Philip Martin
I don't believe Apache/mod_dav_svn uses TMPDIR when processing a PUT.
I can prove that either Apache or ModDavSvn (or the OS) uses TMPDIR during
a PUT of a very large resource.
Test 1: leave 3GB free on system drive, try to PUT 15GB file thru DAV into
it's ultimate destination on a driver with 3TB of space. And observe an
error (as I did before)
Test 2: as #1 but with a TMPDIR export in /etc/apache2/envvars. No error
during the PUT will be the observation.
So what data gets written there? Which process writes it? When I use
curl to put a 15GB file into apache my /tmp stays at 92KB used. I don't
have TMPDIR set, I don't set TMPDIR for apache.

I have also run

strace -fetrace=open -p nnn

on the apache process to see all the files opened by the apache process.
All the files opened are within the the repository itself except for
/dev/urandom.

Incidentally, the 15GB upload takes 2min 4s. This is on a machine with
a core i5 4570 (circa 2013) processor and with the repository on a SATA
SSD.
--
Philip
Paul Hammant
2017-07-15 13:09:01 UTC
Permalink
Hey Philip, starting from scratch, I can reproduce "Can't find a temporary
directory" errors in /var/log/apache2/error.log for PUT when the boot drive
is pretty much full, and the repository's drive has 3 spare terabytes.

Can you tell me what to run concurrently that would help you be convinced
too? Bear in mind I'm really a 22-year enterprise
Java/JavaScript/Python developer
who likes cosseted JetBrains edit/debug experiences, isn't at all savvy
with the long list of tools gcc developers know well.

Also, my machine is an Atom CPU with 4GB ram and a less than stellar perf
64GB SSD.

- Paul
Post by Branko Čibej
Post by Paul Hammant
Post by Philip Martin
I don't believe Apache/mod_dav_svn uses TMPDIR when processing a PUT.
I can prove that either Apache or ModDavSvn (or the OS) uses TMPDIR
during
Post by Paul Hammant
a PUT of a very large resource.
Test 1: leave 3GB free on system drive, try to PUT 15GB file thru DAV
into
Post by Paul Hammant
it's ultimate destination on a driver with 3TB of space. And observe an
error (as I did before)
Test 2: as #1 but with a TMPDIR export in /etc/apache2/envvars. No error
during the PUT will be the observation.
So what data gets written there? Which process writes it? When I use
curl to put a 15GB file into apache my /tmp stays at 92KB used. I don't
have TMPDIR set, I don't set TMPDIR for apache.
I have also run
strace -fetrace=open -p nnn
on the apache process to see all the files opened by the apache process.
All the files opened are within the the repository itself except for
/dev/urandom.
Incidentally, the 15GB upload takes 2min 4s. This is on a machine with
a core i5 4570 (circa 2013) processor and with the repository on a SATA
SSD.
--
Philip
Philip Martin
2017-07-15 16:00:57 UTC
Permalink
Post by Paul Hammant
Hey Philip, starting from scratch, I can reproduce "Can't find a temporary
directory" errors in /var/log/apache2/error.log for PUT when the boot drive
is pretty much full, and the repository's drive has 3 spare terabytes.
Ah, it's not running out of space in tmp, rather it is failing to find
tmp. There is a behaviour difference between Subversion 1.9 and 1.10
here: each Apache child process using 1.9 will create two zero-length
files in tmp, the first is APR finding tmp the second is Subversion
getting the default file permissions. 1.10 doesn't use tmp for this
purpose and the file is created in the repository.

Note: the files created in tmp by 1.9 are zero length so should not
affect performance of the PUT.

APR should detect most tmp dirs automatically via apr_temp_dir_get:

http://svn.apache.org/repos/asf/apr/apr/branches/1.5.x/file_io/unix/tempdir.c

Your system must have some more exotic tmp dir setup, a per-user tmp
perhaps? If you tell us what your tmp looks like then perhaps we can
fix APR to detect it automatically. Do other commands such as 'mktemp'
locate your tmp dir?

If you want to trace the file IO of Apache you can use

strace -f -p NNN

where NNN is the process ID of an Apache child process. (Apache has a
top level process listening on the socket and child processes handling
the requests.) You may want to trace every child process, or configure
Apache to use only one child process, or repeat the operation until the
chosen child is triggered. strace show all system calls by default you
can restrict it with things like:

strace -f -etrace=open,write,close,rename -p NNN
--
Philip
Paul Hammant
2017-07-17 09:52:49 UTC
Permalink
Philip,
Post by Philip Martin
Ah, it's not running out of space in tmp, rather it is failing to find
tmp. There is a behaviour difference between Subversion 1.9 and 1.10
here: each Apache child process using 1.9 will create two zero-length
files in tmp, the first is APR finding tmp the second is Subversion
getting the default file permissions. 1.10 doesn't use tmp for this
purpose and the file is created in the repository.
Note: the files created in tmp by 1.9 are zero length so should not
affect performance of the PUT.
I agree that they're not affecting performance when everything is working.
It's just that PUTs stopped working altogether. Of course 'system drive
full' will make many Linux services stop working incrementally, and the
administrator has bigger problems that just PUT into Subversion.

Specifying the location of TMPDIR in /etc/apache2/envvars alleviates the
block on put operations, BUT who'd want to operate a Linux machine with
100% full system drive?

This is XY problem territory: I shouldn't have been asking how to set the
TMPDIR when I was operating a system at 'full'. Sorry!
Post by Philip Martin
Your system must have some more exotic tmp dir setup, a per-user tmp
perhaps? If you tell us what your tmp looks like then perhaps we can
fix APR to detect it automatically. Do other commands such as 'mktemp'
locate your tmp dir?
It is just a stock Ubuntu 17.04 Server install. The from-scratch install
was 10 mins before the Apache/Subversion setup specifically for this
file-sync application, and nothing else other than some ancillary apt-get
installs for creature comforts.
Post by Philip Martin
If you want to trace the file IO of Apache you can use
strace -f -p NNN
Thanks for the info on strace.

- ph
Paul Hammant
2017-07-17 10:09:02 UTC
Permalink
I think as this thread draws towards a close, that the lz4 modifictions
Evgeny made in part address my speed concerns for gigabyte sized files. And
that if any more speed for PUT is wanted, that one would set:

max-deltification-walk = 0
compression-level = 0

It also looks like the there is dev-team consensus that the properties that
exist already are should *not* be supplemented by new ones that pertain to
compression/deltification on per file-suffix basis. And that consensus is
*regardless* of whether per-file-suffix is considered to be safe or not.
Therefore I should not go ahead and raise a Jira feature request on this -
right?

(thanks for the great particpation in this conversation, gang)

-ph

Evgeny Kotkov
2017-07-14 14:31:22 UTC
Permalink
Post by Johan Corveleyn
Nice!
But how about deltification? Has anyone tried / benchmarked the effect
of turning off deltification (with or without compression), to see
what the effect would be on the commit time? Like I suggested in this
thread yesterday (i.e. set max-deltification-walk to 0 in fsfs.conf --
or perhaps play with both max-deltification-walk and
max-linear-deltification) ...
I haven't tested commit with deltification enabled/disabled in config, but
a synthetic benchmark of the deltification itself (create two big random
files, run vdelta-test -q over them) shows a rate of around 340-350 MiB/s
on my machine.

That's probably enough to not slow down commits in an immediately visible way.


Regards,
Evgeny Kotkov
Johan Corveleyn
2017-07-14 14:53:56 UTC
Permalink
On Fri, Jul 14, 2017 at 4:31 PM, Evgeny Kotkov
Post by Evgeny Kotkov
Post by Johan Corveleyn
Nice!
But how about deltification? Has anyone tried / benchmarked the effect
of turning off deltification (with or without compression), to see
what the effect would be on the commit time? Like I suggested in this
thread yesterday (i.e. set max-deltification-walk to 0 in fsfs.conf --
or perhaps play with both max-deltification-walk and
max-linear-deltification) ...
I haven't tested commit with deltification enabled/disabled in config, but
a synthetic benchmark of the deltification itself (create two big random
files, run vdelta-test -q over them) shows a rate of around 340-350 MiB/s
on my machine.
That's probably enough to not slow down commits in an immediately visible way.
That's good to know. Though that's a bit at odds with what Philip
found when he eliminated deltification on the client side (first by
using svnmucc (delta against empty file only); then by using curl-PUT
+ autoversioning (no deltification at all)):

https://svn.haxx.se/dev/archive-2017-07/0040.shtml

Is it slower / more intensive to deltify on the client side?
--
Johan
Evgeny Kotkov
2017-07-14 15:08:20 UTC
Permalink
Post by Johan Corveleyn
That's good to know. Though that's a bit at odds with what Philip
found when he eliminated deltification on the client side (first by
using svnmucc (delta against empty file only); then by using curl-PUT
https://svn.haxx.se/dev/archive-2017-07/0040.shtml
Is it slower / more intensive to deltify on the client side?
The fact that curl PUT is faster is because the svn client first stores
the delta to a temporary file, and only then sends it to the server
(and there's also compression and checksumming involved). Curl streams
the data to the server, and the server can immediately start processing
the incoming data.

If the svn client would be streaming data as well, it would make the server
and the client process data simultaneously and also avoid the overhead
associated with writing the temporary file to disk. Perhaps, that would
make the difference much less visible.

There is a sketch of the solution available in:

https://svn.apache.org/repos/asf/subversion/branches/ra_serf-stream-commit

but I wasn't able to finish the work on it.


Regards,
Evgeny Kotkov
Philip Martin
2017-07-14 14:03:46 UTC
Permalink
Post by Evgeny Kotkov
Right now, the support for LZ4 is only implemented for the file:// protocol,
but support/negotiation for other protocols can be added later.
That's not right is it? This is a backend change that already affects
all protocols.
--
Philip
Evgeny Kotkov
2017-07-14 14:38:04 UTC
Permalink
Post by Philip Martin
Post by Evgeny Kotkov
Right now, the support for LZ4 is only implemented for the file:// protocol,
but support/negotiation for other protocols can be added later.
That's not right is it? This is a backend change that already affects
all protocols.
Indeed, this wording is imprecise.

What I was trying to say is that is no way (yet) to negotiate the use
of svndiff2 / LZ4 over the wire where it would make sense, e.g., over
http:// if the server has SVNCompressionLevel set to 1.


Regards,
Evgeny Kotkov
Loading...