[fossil-users] rebuild scale-ability/data written/repo size ratio

Post by Karel Gardas
I'm just curious if there are people here tinkering with the idea to
make it more scale-able and allow its real usage also for projects of
bigger size.

There has been this discussion. I have an email with the subject of
"Fossil 2.1: Scaling" from March 2015. the
http://www.mail-archive.com/ doesn't go back that far, though.

AH, just found it on marc:
http://marc.info/?l=fossil-users&m=144565850643439&w=2

--
-------
inum: 883510009027723
sip: ***@sip2sip.info

Warren Young

2016-10-28 17:02:56 UTC

Post by Karel Gardas
make it more scale-able and allow its real usage also for projects of
bigger size.

How many projects are there bigger than SQLite, percentage-wise?

Has anyone done something like produce a SLOC histogram for all projects on GitHub or Sourceforge, so that we can say something like, “SQLite is in the top 2nd percentile for open source C projects based on SLOCCount’s line counting algorithm”?

I’m intrigued enough to want to do the project, but I don’t think I really want to clone the entirety of GitHub onto my HDD in order to find out, even if it’s just one project at a time. That sounds like a great way to blow through my Comcast data cap.

Post by Karel Gardas
Let's talk about some real numbers to illustrate the situation.

Yes, let’s. :)

Post by Karel Gardas
$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil
real 323m2.323s
user 42m0.262s
sys 13m18.003s

Okay, but compared to what?

If you compare to the checkout time from NetBSD’s main CVS repository, you aren’t comparing apples to oranges, since you’re transferring only the tip of the trunk. You have to go back to the CVS server for any history. I suspect if you checked out each CVS revision one at a time, it would take a lot longer than pulling the whole project history with Fossil.

If you want to compare with some other DVCS, post those numbers.

Post by Karel Gardas
rebuild alone
takes around 250 minutes on the same hardware and with the same
fossil.

Would a --skip-rebuild option for fossil clone solve your major problem, then?

Rebuilding is strictly optional. It just makes Fossil operations run faster post-clone.

Also realize that cloning is a one-time activity per development machine, for anyone active enough in the project to maintain their local clone.

A cute option would be if --skip-rebuild would look for a local at(1) command, then offer to schedule the rebuild for a later time, after you’ve left off work for the day.

Although there may be casual clients who clone, do something with the source, throw the clone repo away when done, then clone again a year later when they need the source again, we should not optimize Fossil for that case.

We’ve already discussed shallow clones, which would make Fossil more like CVS in terms of clone size. See the “Fossil 2.0” document Mr. Boogie linked to.

Post by Karel Gardas
real 4m0.765s
user 1m55.442s
sys 1m11.892s

Nikita Borodikhin

2016-10-28 17:33:14 UTC

Hi Karel,

I have quite a big repository (3.4G) imported from svn by a custom tool.
It also took several minutes to commit, and most of the time was spent in
md5 hash computation. It is extra precaution to ensure checkout file
integrity, which can be turned off with repo-cksum setting.

With that setting off, it takes 4 to 6 second to commit.

My hardware is ext4 on Samsung 850 Pro 512 SSD, i7-3770

Nikita

Post by Karel Gardas
make it more scale-able and allow its real usage also for projects of
bigger size.

How many projects are there bigger than SQLite, percentage-wise?
Has anyone done something like produce a SLOC histogram for all projects
on GitHub or Sourceforge, so that we can say something like, âSQLite is in
the top 2nd percentile for open source C projects based on SLOCCountâs line
counting algorithmâ?
Iâm intrigued enough to want to do the project, but I donât think I really
want to clone the entirety of GitHub onto my HDD in order to find out, even
if itâs just one project at a time. That sounds like a great way to blow
through my Comcast data cap.

Post by Karel Gardas
Let's talk about some real numbers to illustrate the situation.

Yes, letâs. :)

Post by Karel Gardas
$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil
real 323m2.323s
user 42m0.262s
sys 13m18.003s

Okay, but compared to what?
If you compare to the checkout time from NetBSDâs main CVS repository, you
arenât comparing apples to oranges, since youâre transferring only the tip
of the trunk. You have to go back to the CVS server for any history. I
suspect if you checked out each CVS revision one at a time, it would take a
lot longer than pulling the whole project history with Fossil.
If you want to compare with some other DVCS, post those numbers.

Post by Karel Gardas
rebuild alone
takes around 250 minutes on the same hardware and with the same
fossil.

Would a --skip-rebuild option for fossil clone solve your major problem, then?
Rebuilding is strictly optional. It just makes Fossil operations run faster post-clone.
Also realize that cloning is a one-time activity per development machine,
for anyone active enough in the project to maintain their local clone.
A cute option would be if --skip-rebuild would look for a local at(1)
command, then offer to schedule the rebuild for a later time, after youâve
left off work for the day.
Although there may be casual clients who clone, do something with the
source, throw the clone repo away when done, then clone again a year later
when they need the source again, we should not optimize Fossil for that
case.
Weâve already discussed shallow clones, which would make Fossil more like
CVS in terms of clone size. See the âFossil 2.0â document Mr. Boogie
linked to.

Post by Karel Gardas
- commit, this is a little bit harder. One file modified and commit
real 4m0.765s
user 1m55.442s
sys 1m11.892s

That seems like a much more important problem to solve. 4 minutes per
commit is simply *painful*, and it may happen multiple times per day,
rather than once per development box.
Here, I occasionally see commit times of 10 seconds or so, and thatâs
painful enough already.
_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Karel Gardas

2016-10-29 20:34:21 UTC

Hi Nikita,

your advice indeed helped a lot and brings commit to 20 seconds here.
Now, the question is if I may really leave file integrity to
file-system or if even ZFS/Btrfs is not enough here and fossil does
some other magic?

Thanks!
Karel

Post by Nikita Borodikhin
Hi Karel,
I have quite a big repository (3.4G) imported from svn by a custom tool. It
also took several minutes to commit, and most of the time was spent in md5
hash computation. It is extra precaution to ensure checkout file integrity,
which can be turned off with repo-cksum setting.
With that setting off, it takes 4 to 6 second to commit.

Nikita Borodikhin

2016-10-30 04:38:58 UTC

Hi Karel,

as i understand this option, it is indeed for extra integrity checking. It
checks all the files in your checkout, not only those involved in the
commit and allows to find data corruption on file system.

As other scms don't do that check (svn and git), I think it is safe to
leave it off.

Nikita.

Post by Karel Gardas
Hi Nikita,
your advice indeed helped a lot and brings commit to 20 seconds here.
Now, the question is if I may really leave file integrity to
file-system or if even ZFS/Btrfs is not enough here and fossil does
some other magic?
Thanks!
Karel

Post by Nikita Borodikhin
Hi Karel,
I have quite a big repository (3.4G) imported from svn by a custom

tool. It

Post by Nikita Borodikhin
also took several minutes to commit, and most of the time was spent in

md5

Post by Nikita Borodikhin
hash computation. It is extra precaution to ensure checkout file

integrity,

Post by Nikita Borodikhin
which can be turned off with repo-cksum setting.
With that setting off, it takes 4 to 6 second to commit.

_______________________________________________
fossil-users mailing list
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Karel Gardas

2016-10-30 07:57:46 UTC

Post by Karel Gardas
make it more scale-able and allow its real usage also for projects of
bigger size.

How many projects are there bigger than SQLite, percentage-wise?

And does it really matter? The question was just question, anybody
working scaling fossil better? I've though fossil is just another
open-source DVCS where people are free to hack their ideas at least to
test some new directions.

Post by Karel Gardas
$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil
real 323m2.323s
user 42m0.262s
sys 13m18.003s

Okay, but compared to what?

For example Git, on the same source tree:

$ time git clone https://github.com/jsonn/src.git
Cloning into 'src'...
remote: Counting objects: 3725278, done.
remote: Compressing objects: 100% (111/111), done.
remote: Total 3725278 (delta 52), reused 0 (delta 0), pack-reused 3725166
Receiving objects: 100% (3725278/3725278), 2.18 GiB | 773.00 KiB/s, done.
Resolving deltas: 100% (2782525/2782525), done.
Checking connectivity... done.
Checking out files: 100% (176388/176388), done.

real 55m20.926s
user 9m30.362s
sys 4m50.320s

Post by Karel Gardas
rebuild alone
takes around 250 minutes on the same hardware and with the same
fossil.

Would a --skip-rebuild option for fossil clone solve your major problem, then?

There is no such option in current fossil. Some commands (import)
supports --no-rebuild, but clone is not among them.

Post by Warren Young
Rebuilding is strictly optional. It just makes Fossil operations run faster post-clone.

That's news to me, I've thought rebuild is strictly necessary to have
other fossil commands working if not, hmm, allowing rebuild over night
may be indeed an option.

Post by Warren Young
We’ve already discussed shallow clones, which would make Fossil more like CVS in terms of clone size. See the “Fossil 2.0” document Mr. Boogie linked to.

Indeed, I've seen this, very appreciated. But originally I've thought
in a line of kind of better optimization of rebuild implementation in
current fossil without a need to go for future fossil version. I've
been tempted to think into this direction by seeing excess amount of
data fossil writes on rebuild and resulting repo size.

Post by Karel Gardas
real 4m0.765s
user 1m55.442s
sys 1m11.892s

repo chksuming switching off as suggested by Nikita Borodikhin helps
here a lot. The question is if to leave it to file-system or to fossil
itself.

Thanks,
Karel

Warren Young

2016-10-31 21:50:50 UTC

Post by Karel Gardas
make it more scale-able and allow its real usage also for projects of
bigger size.

How many projects are there bigger than SQLite, percentage-wise?

And does it really matter?

Sure it does. Fossil is fast enough for SQLite, so if SQLite is “very large” compared to most other projects that could usefully use it, then speeding up Fossil amounts to spending effort on a tiny minority of users.

All of that is predicated on that first “if,” however.

Post by Karel Gardas
$ time /opt/fossil-head/bin/fossil clone
http://netbsd.sonnenberger.org/ netbsd-src.fossil
real 323m2.323s
user 42m0.262s
sys 13m18.003s

Okay, but compared to what?

$ time git clone https://github.com/jsonn/src.git
Cloning into 'src'...
remote: Counting objects: 3725278, done.
remote: Compressing objects: 100% (111/111), done.
remote: Total 3725278 (delta 52), reused 0 (delta 0), pack-reused 3725166
Receiving objects: 100% (3725278/3725278), 2.18 GiB | 773.00 KiB/s, done.
Resolving deltas: 100% (2782525/2782525), done.
Checking connectivity... done.
Checking out files: 100% (176388/176388), done.
real 55m20.926s
user 9m30.362s
sys 4m50.320s

So given your other report, that rebuild takes 250 minutes of that time, then Fossil is within about 25% of the speed of Git, if you don’t rebuild.

Post by Karel Gardas
takes around 250 minutes on the same hardware and with the same
fossil.

Would a --skip-rebuild option for fossil clone solve your major problem, then?

There is no such option in current fossil. Some commands (import)
supports --no-rebuild, but clone is not among them.

I didn’t tell you to use that option, I asked if you would like that option to exist.