[Cvsnt] CVS(NT) with huge repositories?

Sun Mar 24 20:41:23 GMT 2002

"Performance" is perhaps the area where you'll get the biggest volume of FUD
from various tool vendors regarding source control tools, which in many
cases isn't expected because it is difficult to characterize the various
projects.

The project I am working on has a source code pull of "404 File(s)
19,362,615 bytes"

As the current version control for the team is non-client/server, it uses
considerable network bandwidth to pull the code.  (~6 minutes during the
day, ~3 minutes on weekends or late evenings).  I checked a baseline version
of our code into CVS, and was able with -z9 to pull the code from my home
linux box via the cable modem uplink in about 2.5 minutes, and pull from a
machine on the local 100Mbit ethernet in about 40 seconds.

However, this doesn't account for stuff like back revisions, etc.

The last time I did any investigation of benchmarking CVS, it had a full
version of the head, and then backreferenced all changes from that point.
We have files that are version 2.1380, which on this file would be a
nightmare to grab -r2.1.  Also, the algorithm doesn't save any history of
parsing when comparing 2 different versions, so doing a compare of "-r2.1"
and "-r2.2" is the worst case for performance -- it involves 2 linear
searches from 2.1380.

A sample worst case check can be done by taking 2 completely different
source files, and alternatively checking them in over the top of each other
as the revision numbers grow:

	copy	test-a.txt  test.txt
	cvs commit -m "test" test.txt
	copy	test-b.txt	test.txt
	cvs commit -m "test" test.txt

and so on...

For this sort of reason, I am not sure how well CVS necessarilly scales to
very large implementations like in the presentation you mentioned.  I have
heard complaints about tagging speed too, though I haven't done any
investigation in that matter myself.

A "modern" version control system (something like, say, Rational, PVCS
Dimensions, or any of the others) uses a database to hold all their changes,
and keeps their archive in a format with more checkpoints, as well as a data
representation of what the revision history tree for any given file looks
like, that allow it to make more intelligent decisions regarding how to
parse the changes as fast as possible when differencing two files, pulling
out a specific version, etc.  They pay 2 penalties for doing their
versioning in this manner: 1) unless you have 1000+ revisions of a single
file, the database access penalty exceeds the time to just perform a linear
algorithm on the file to get what you want -- the "overkill" factor. And 2)
storing full checkpoints is space intensive.  However, the size of disk
drives is definitely growing faster than our ability to fill them with lines
of source code.  (I do not write 5x more lines now than I did 2 years ago)
With space getting close to free for most people, there is little reason not
to use a more space intensive algorithm that saves CPU cycles.  A simple
database with hash-based searches, something like MySQL, can still sort
millions of entries in a matter of seconds on modest hardware.  (I wrote a
bug tracking database for fun one spring break (like bugzilla, but a bit
simpler) and with 500K bugs in the database, each bug having 3-8 events
attached to it. It took me about 40 seconds on a P3-450 running linux to
sort the database ordered by priority then ordered by age to the second,
then return the top 200 results to me via a web page... this seemed like a
pretty good test at the time)

If anyone else has thoughts on this, or any CVS performance numbers, I'd be
interested to hear them also.

And I am not trying to suggest that CVS isn't a great package, just I would
need to see some proof that it could handle the NT-esque development teams
before I would try to implement it in one.

--eric

> -----Original Message-----
> From: Kari Hoijarvi [mailto:hoijarvi at me.wustl.edu]
> Sent: Saturday, March 23, 2002 9:39 AM
> To: cvsnt
> Subject: [Cvsnt] CVS(NT) with huge repositories?
>
>
> This is a multi-part message in MIME format.
> --
> [ Picked text/plain from multipart/alternative ]
> There is an interesting slide show about NT development:
> http://www.usenix.org/events/usenix-win2000/invitedtalks/lucov
sky_html/

especially the slide about version control:
http://www.usenix.org/events/usenix-win2000/invitedtalks/lucovsky_html/sld01
5.htm

I wonder if CVS(NT) actually could handle it?

They wrote their own tool. I was in the Outlook team 1995-1998 and used
an earlier version of it. It was able to handle Office 2000 fairly well,
maybe 1/6 th of
the size of Win2000. My experiences with the synchronization
effort is consistent, about 15 minutes vs. 2 hours.

Has someone benchmarked CVS(NT) with 200 projects(250 MB each) and
about 1000 updated files per day?

Kari

--

_______________________________________________
Cvsnt mailing list
Cvsnt at cvsnt.org
http://www.cvsnt.org/cgi-bin/mailman/listinfo/cvsnt https://www.march-hare.com/cvspro/en.asp#downcvs
_______________________________________________
Cvsnt mailing list
Cvsnt at cvsnt.org
http://www.cvsnt.org/cgi-bin/mailman/listinfo/cvsnt https://www.march-hare.com/cvspro/en.asp#downcvs