Discussion:
[fossil-users] rebuild scale-ability/data written/repo size ratio
Karel Gardas
2016-10-28 09:45:21 UTC
Permalink
Hello,

first of all, I know that Fossil was written with the idea of serving
SQLite project and projects of similar size well and that it does
great job in this task.

I'm just curious if there are people here tinkering with the idea to
make it more scale-able and allow its real usage also for projects of
bigger size.

Now, when git -> fossil (incremental) mirror functionality seems to be
working this may be even more interesting or tempting IMHO.

Let's talk about some real numbers to illustrate the situation. Let's
clone NetBSD src tree kindly provided by Jörg Sonnenberger by
following command:

$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil

It takes:

real 323m2.323s
user 42m0.262s
sys 13m18.003s

on my E5-2620 Sandy Bridge workstation. Of course part of this time is
spent perhaps on not so efficient network data send/receive, but
majority of time at least as observed from the output of the command
is spend on DB rebuild. I know that from the example of OpenBSD src
tree which is comparable in size with NetBSD and where rebuild alone
takes around 250 minutes on the same hardware and with the same
fossil.

So this is about time spent on rebuilt. What may be even more
important is how much data rebuild is going to write. Here I do not
have exact or perfectly exact numbers, but this is on my workstation
so I see what's going on by keeping drive meters on my eyes so let's
assume I'm not that off claiming that rebuild was writing data on
speed ~40 MB/s for 2 or even more hours. In sum this may be around 300
GBs of data written on this rebuild (rounded up). This is for
repository which final file size is:

$ ls -lha netbsd-src.fossil
-rw-r--r-- 1 karel karel 2.6G Oct 28 01:11 netbsd-src.fossil

and which results in the source tree of size of 2.7 GB.

Now just to show that this rebuild may be the biggest scalability
obstacle I'd like to compare with open/status/diff/commit operations:

- open: results in 2.7GB of data written to disk in the resulting
NetBSD source tree. It takes:
real 4m38.843s
user 1m44.221s
sys 1m58.553s

IMHO very nice result for the source tree of this size

- status/diff -- one random file modified: both runs for 4-5 seconds.
Also very nice results for the source tree of this size

- commit, this is a little bit harder. One file modified and commit takes:
real 4m0.765s
user 1m55.442s
sys 1m11.892s

IMHO not so nice, but still kind of acceptable even for development on
this source tree size. But certainly commit may be another target for
speedup hacking.

So that's it. Fossil used in those tests is:

This is fossil version 1.37 [0fa60142eb] 2016-10-26 21:45:52 UTC

and the tests were performed on ZFS mirror of two SSDs (1TB Crucial
MX200 and 1TB Samsung 850 Evo) on Solaris 11.2 running on E5-2620 with
32GB RAM -- if anybody is interested in this info for numbers
verification.

Cheers,
Karel
jungle Boogie
2016-10-28 16:40:55 UTC
Permalink
Post by Karel Gardas
I'm just curious if there are people here tinkering with the idea to
make it more scale-able and allow its real usage also for projects of
bigger size.
There has been this discussion. I have an email with the subject of
"Fossil 2.1: Scaling" from March 2015. the
http://www.mail-archive.com/ doesn't go back that far, though.


AH, just found it on marc:
http://marc.info/?l=fossil-users&m=144565850643439&w=2
--
-------
inum: 883510009027723
sip: ***@sip2sip.info
Warren Young
2016-10-28 17:02:56 UTC
Permalink
Post by Karel Gardas
make it more scale-able and allow its real usage also for projects of
bigger size.
How many projects are there bigger than SQLite, percentage-wise?

Has anyone done something like produce a SLOC histogram for all projects on GitHub or Sourceforge, so that we can say something like, “SQLite is in the top 2nd percentile for open source C projects based on SLOCCount’s line counting algorithm”?

I’m intrigued enough to want to do the project, but I don’t think I really want to clone the entirety of GitHub onto my HDD in order to find out, even if it’s just one project at a time. That sounds like a great way to blow through my Comcast data cap.
Post by Karel Gardas
Let's talk about some real numbers to illustrate the situation.
Yes, let’s. :)
Post by Karel Gardas
$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil
real 323m2.323s
user 42m0.262s
sys 13m18.003s
Okay, but compared to what?

If you compare to the checkout time from NetBSD’s main CVS repository, you aren’t comparing apples to oranges, since you’re transferring only the tip of the trunk. You have to go back to the CVS server for any history. I suspect if you checked out each CVS revision one at a time, it would take a lot longer than pulling the whole project history with Fossil.

If you want to compare with some other DVCS, post those numbers.
Post by Karel Gardas
rebuild alone
takes around 250 minutes on the same hardware and with the same
fossil.
Would a --skip-rebuild option for fossil clone solve your major problem, then?

Rebuilding is strictly optional. It just makes Fossil operations run faster post-clone.

Also realize that cloning is a one-time activity per development machine, for anyone active enough in the project to maintain their local clone.

A cute option would be if --skip-rebuild would look for a local at(1) command, then offer to schedule the rebuild for a later time, after you’ve left off work for the day.

Although there may be casual clients who clone, do something with the source, throw the clone repo away when done, then clone again a year later when they need the source again, we should not optimize Fossil for that case.

We’ve already discussed shallow clones, which would make Fossil more like CVS in terms of clone size. See the “Fossil 2.0” document Mr. Boogie linked to.
Post by Karel Gardas
real 4m0.765s
user 1m55.442s
sys 1m11.892s
That seems like a much more important problem to solve. 4 minutes per commit is simply *painful*, and it may happen multiple times per day, rather than once per development box.

Here, I occasionally see commit times of 10 seconds or so, and that’s painful enough already.
Nikita Borodikhin
2016-10-28 17:33:14 UTC
Permalink
Hi Karel,

I have quite a big repository (3.4G) imported from svn by a custom tool.
It also took several minutes to commit, and most of the time was spent in
md5 hash computation. It is extra precaution to ensure checkout file
integrity, which can be turned off with repo-cksum setting.

With that setting off, it takes 4 to 6 second to commit.

My hardware is ext4 on Samsung 850 Pro 512 SSD, i7-3770

Nikita
Post by Warren Young
Post by Karel Gardas
make it more scale-able and allow its real usage also for projects of
bigger size.
How many projects are there bigger than SQLite, percentage-wise?
Has anyone done something like produce a SLOC histogram for all projects
on GitHub or Sourceforge, so that we can say something like, “SQLite is in
the top 2nd percentile for open source C projects based on SLOCCount’s line
counting algorithm”?
I’m intrigued enough to want to do the project, but I don’t think I really
want to clone the entirety of GitHub onto my HDD in order to find out, even
if it’s just one project at a time. That sounds like a great way to blow
through my Comcast data cap.
Post by Karel Gardas
Let's talk about some real numbers to illustrate the situation.
Yes, let’s. :)
Post by Karel Gardas
$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil
real 323m2.323s
user 42m0.262s
sys 13m18.003s
Okay, but compared to what?
If you compare to the checkout time from NetBSD’s main CVS repository, you
aren’t comparing apples to oranges, since you’re transferring only the tip
of the trunk. You have to go back to the CVS server for any history. I
suspect if you checked out each CVS revision one at a time, it would take a
lot longer than pulling the whole project history with Fossil.
If you want to compare with some other DVCS, post those numbers.
Post by Karel Gardas
rebuild alone
takes around 250 minutes on the same hardware and with the same
fossil.
Would a --skip-rebuild option for fossil clone solve your major problem, then?
Rebuilding is strictly optional. It just makes Fossil operations run faster post-clone.
Also realize that cloning is a one-time activity per development machine,
for anyone active enough in the project to maintain their local clone.
A cute option would be if --skip-rebuild would look for a local at(1)
command, then offer to schedule the rebuild for a later time, after you’ve
left off work for the day.
Although there may be casual clients who clone, do something with the
source, throw the clone repo away when done, then clone again a year later
when they need the source again, we should not optimize Fossil for that
case.
We’ve already discussed shallow clones, which would make Fossil more like
CVS in terms of clone size. See the “Fossil 2.0” document Mr. Boogie
linked to.
Post by Karel Gardas
- commit, this is a little bit harder. One file modified and commit
real 4m0.765s
user 1m55.442s
sys 1m11.892s
That seems like a much more important problem to solve. 4 minutes per
commit is simply *painful*, and it may happen multiple times per day,
rather than once per development box.
Here, I occasionally see commit times of 10 seconds or so, and that’s
painful enough already.
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Karel Gardas
2016-10-29 20:34:21 UTC
Permalink
Hi Nikita,

your advice indeed helped a lot and brings commit to 20 seconds here.
Now, the question is if I may really leave file integrity to
file-system or if even ZFS/Btrfs is not enough here and fossil does
some other magic?

Thanks!
Karel
Post by Nikita Borodikhin
Hi Karel,
I have quite a big repository (3.4G) imported from svn by a custom tool. It
also took several minutes to commit, and most of the time was spent in md5
hash computation. It is extra precaution to ensure checkout file integrity,
which can be turned off with repo-cksum setting.
With that setting off, it takes 4 to 6 second to commit.
Nikita Borodikhin
2016-10-30 04:38:58 UTC
Permalink
Hi Karel,

as i understand this option, it is indeed for extra integrity checking. It
checks all the files in your checkout, not only those involved in the
commit and allows to find data corruption on file system.

As other scms don't do that check (svn and git), I think it is safe to
leave it off.

Nikita.
Post by Karel Gardas
Hi Nikita,
your advice indeed helped a lot and brings commit to 20 seconds here.
Now, the question is if I may really leave file integrity to
file-system or if even ZFS/Btrfs is not enough here and fossil does
some other magic?
Thanks!
Karel
Post by Nikita Borodikhin
Hi Karel,
I have quite a big repository (3.4G) imported from svn by a custom
tool. It
Post by Nikita Borodikhin
also took several minutes to commit, and most of the time was spent in
md5
Post by Nikita Borodikhin
hash computation. It is extra precaution to ensure checkout file
integrity,
Post by Nikita Borodikhin
which can be turned off with repo-cksum setting.
With that setting off, it takes 4 to 6 second to commit.
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
Karel Gardas
2016-10-30 07:57:46 UTC
Permalink
Post by Warren Young
Post by Karel Gardas
make it more scale-able and allow its real usage also for projects of
bigger size.
How many projects are there bigger than SQLite, percentage-wise?
And does it really matter? The question was just question, anybody
working scaling fossil better? I've though fossil is just another
open-source DVCS where people are free to hack their ideas at least to
test some new directions.
Post by Warren Young
Post by Karel Gardas
$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil
real 323m2.323s
user 42m0.262s
sys 13m18.003s
Okay, but compared to what?
For example Git, on the same source tree:

$ time git clone https://github.com/jsonn/src.git
Cloning into 'src'...
remote: Counting objects: 3725278, done.
remote: Compressing objects: 100% (111/111), done.
remote: Total 3725278 (delta 52), reused 0 (delta 0), pack-reused 3725166
Receiving objects: 100% (3725278/3725278), 2.18 GiB | 773.00 KiB/s, done.
Resolving deltas: 100% (2782525/2782525), done.
Checking connectivity... done.
Checking out files: 100% (176388/176388), done.

real 55m20.926s
user 9m30.362s
sys 4m50.320s
Post by Warren Young
Post by Karel Gardas
rebuild alone
takes around 250 minutes on the same hardware and with the same
fossil.
Would a --skip-rebuild option for fossil clone solve your major problem, then?
There is no such option in current fossil. Some commands (import)
supports --no-rebuild, but clone is not among them.
Post by Warren Young
Rebuilding is strictly optional. It just makes Fossil operations run faster post-clone.
That's news to me, I've thought rebuild is strictly necessary to have
other fossil commands working if not, hmm, allowing rebuild over night
may be indeed an option.
Post by Warren Young
We’ve already discussed shallow clones, which would make Fossil more like CVS in terms of clone size. See the “Fossil 2.0” document Mr. Boogie linked to.
Indeed, I've seen this, very appreciated. But originally I've thought
in a line of kind of better optimization of rebuild implementation in
current fossil without a need to go for future fossil version. I've
been tempted to think into this direction by seeing excess amount of
data fossil writes on rebuild and resulting repo size.
Post by Warren Young
Post by Karel Gardas
real 4m0.765s
user 1m55.442s
sys 1m11.892s
That seems like a much more important problem to solve. 4 minutes per commit is simply *painful*, and it may happen multiple times per day, rather than once per development box.
Here, I occasionally see commit times of 10 seconds or so, and that’s painful enough already.
repo chksuming switching off as suggested by Nikita Borodikhin helps
here a lot. The question is if to leave it to file-system or to fossil
itself.

Thanks,
Karel
Warren Young
2016-10-31 21:50:50 UTC
Permalink
Post by Karel Gardas
Post by Warren Young
Post by Karel Gardas
make it more scale-able and allow its real usage also for projects of
bigger size.
How many projects are there bigger than SQLite, percentage-wise?
And does it really matter?
Sure it does. Fossil is fast enough for SQLite, so if SQLite is “very large” compared to most other projects that could usefully use it, then speeding up Fossil amounts to spending effort on a tiny minority of users.

All of that is predicated on that first “if,” however.
Post by Karel Gardas
Post by Warren Young
Post by Karel Gardas
$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil
real 323m2.323s
user 42m0.262s
sys 13m18.003s
Okay, but compared to what?
$ time git clone https://github.com/jsonn/src.git
Cloning into 'src'...
remote: Counting objects: 3725278, done.
remote: Compressing objects: 100% (111/111), done.
remote: Total 3725278 (delta 52), reused 0 (delta 0), pack-reused 3725166
Receiving objects: 100% (3725278/3725278), 2.18 GiB | 773.00 KiB/s, done.
Resolving deltas: 100% (2782525/2782525), done.
Checking connectivity... done.
Checking out files: 100% (176388/176388), done.
real 55m20.926s
user 9m30.362s
sys 4m50.320s
So given your other report, that rebuild takes 250 minutes of that time, then Fossil is within about 25% of the speed of Git, if you don’t rebuild.
Post by Karel Gardas
Post by Warren Young
Post by Karel Gardas
takes around 250 minutes on the same hardware and with the same
fossil.
Would a --skip-rebuild option for fossil clone solve your major problem, then?
There is no such option in current fossil. Some commands (import)
supports --no-rebuild, but clone is not among them.
I didn’t tell you to use that option, I asked if you would like that option to exist.
Post by Karel Gardas
Post by Warren Young
Rebuilding is strictly optional. It just makes Fossil operations run faster post-clone.
That's news to me, I've thought rebuild is strictly necessary to have
other fossil commands working
Nope. The only reason Fossil rebuilds by default is that the clone operation results in a sub-optimal DB, because each cloned artifact is checked into the new DB separately. You end up with a series of incremental states, none of which are equal to the final DB state once the clone is finished.

Rebuilding forces the SQLite instance inside Fossil to take a new look at all the cloned artifacts as a whole and optimize the DB for that completed post-clone state, rather than the series of incremental states that exist at each point during the clone.
Post by Karel Gardas
repo chksuming switching off as suggested by Nikita Borodikhin helps
here a lot. The question is if to leave it to file-system or to fossil
itself.
If your filesystem has strong data checksumming (as opposed to just metadata checksumming) then I see no reason to leave repo-cksum turned on. Keep in mind that the vast majority of filesystems in common use do *not* have strong data checksumming, so letting repo-cksum default on is a good idea.
Loading...