[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Speed-up of libsvn_diff by restarts of LCS algorithm

From: Johan Corveleyn <jcorvel_at_gmail.com>
Date: Tue, 14 Jun 2011 23:12:30 +0200

On Tue, Jun 14, 2011 at 7:32 PM, Morten Kloster <morklo_at_gmail.com> wrote:
> On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcorvel_at_gmail.com> wrote:
>> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <morklo_at_gmail.com> wrote:
>>>
> []
>>> Here's an updated patch based on the current HEAD, and with
>>> one or two minor improvements. While, as stated, the patch
>>> can give many-fold speed increases for ideal cases, it also slows
>>> down the LCS algorithm significantly for poorly matching files; I
>>> got about 18% increase in running time on my system for files
>>> that were very poor matches but had no lines unique to either file.
>>
>> Hi Morten,
>>
>> Hmmm, that sounds like a lot of overhead in some cases (I actually
>> wonder why -- I haven't fully understood the implementation, but on
>> first sight it doesn't seem that computationally intensive). But the
>> most important question that's bugging me: do those "ideal cases"
>> occur often in real-world examples?
>>
>
> It's not surprising to get that kind of speed reduction, since it uses
> an extra test within the loops that take the bulk of the time when
> the files don't match.

Indeed, that might be it :-). Just wondering though: it isn't by any
chance the case that, worst case, clear_fp is called too often (more
often than is useful, each time resetting the LCS run for only
marginal gain)?

Can you verify that those 18% are spent because of those extra
comparisons, and not because of calls to clear_fp? IIUC, with those
poorly-matching files, there should never be a call to clear_fp,
because the unique-matching optimization never kicks in, right?

Also, those 18% you measured: is that the increase in
svn_diff_file_diff time, or in lcs time? Or is that approximately the
same, because the lcs time dominates because of the poor matching?

I have two more, possibly stupid, questions on the implementation in the patch:

- You mentioned in your theoretical explanation that you still have to
scan the entire d-range for every p, even after you've found a "reset
point". But can't you just "break" from the for-loops, whenever you
encounter a reset point?

- You calculated 'limit' as:
        limit = (d >= 0 ? 2 * p + d : 2 * p - d);
and made the "reset condition" be:
        if (fp[k].uniquematchcount > limit + k) /* with k < 0 */
or:
        if (fp[k].uniquematchcount > limit - k) /* with k >= 0 */

I don't fully understand. Can you explain a bit more why this is the
right value for 'limit', and how this is the correct "reset
condition"?

>> I'm sure you could craft good examples, but how likely am I to have a
>> benefit of this in my average repository (given all of the other
>> optimizations taking away already a lot of the work (prefix/suffix
>> elimination; elimination of non-matching lines due to token-counting;
>> ...))?
>>
>> Can you give examples from subversion's own repository, or other
>> public repositories, where it can have a big impact?
>>
>
> I ran it on merge.c (in libsvn_client), between revisions 1127198 (HEAD,
> or close enough) and 1036419 (semi-randomly chosen to have about
> the right fraction of changes from 1127198 - it was the first one I looked
> at, and seemed reasonable for the test). When running all of
> svn_diff_file_diff, the patch was about 15% faster, however, the LCS
> algorithm itself was about 3.5 times faster; most of the time was
> spent reading the tokens from file.

That's quite neat! It's a lot more than I expected in an average case.
Maybe some more testing should be done (not saying you have to do this
per say, just in general) to see how average this is :-). But for now
it's already great to see a single real-world example like the one you
just gave.

> The "problem" is that the files have to be quite big before the LCS
> algorithm itself dominates the running time even when the files still
> match reasonably well, whereas it is much easier for the LCS
> algorithm to dominate the running time when the files don't match
> at all, in which case this patch degrades performance.

Yeah, I know :-). The token-scanning/hashing phase is still quite a
heavy factor in the diff performance (especially when the matching of
the files is reasonably good (yielding a fast LCS-run), which is
typically the case with changes in source code).

Stefan Fuhrmann (stefan2) has some great ideas for optimizing the
token-scanning phase, which could help a lot for that. We had some
nice discussions about it during the Berlin Hackathon last month (but
I didn't have the chance yet to dig into it further). But that's all
definitely 1.8 or later stuff.

Also, in the same vein, I think this current patch will have to wait
for post-1.7. I like the idea a lot, and would definitely like to see
it in svn one day, but it's not such a trivial change. With 1.7 just
around the corner, we're trying to avoid changes that are not directly
related to getting 1.7 (finally) out the door :-).

> It is quite easy to do a quick heuristic test for whether this patch
> would help or not, so we could have alternate versions of the LCS
> algorithm depending on whether that test checks out. That would
> make the code somewhat harder to maintain, of course.

That might be a good idea, to get really the best of both worlds.
Though I think it would also be acceptable to optimize for the common
case (reasonably matching files (modulo all the "non-matching
lines")), at a slight cost for the rare case (files that are matching
terribly bad, even with "skipping all the non-matching lines"). Not
sure right now, I guess it depends on the cost of the added code
complexity.

Definitely something to look at post-1.7 :-).

>> Or are you looking at this for some specific purpose, some repository
>> where this situation occurs regularly and has a big impact (huge
>> LCS'es ...)?
>>
>
> Nope, I just like to optimize stuff, and since we were already
> looking at diff optimization I thought I might as well point out the
> possibility.

Cool :-). Me too.

-- 
Johan
Received on 2011-06-14 23:13:21 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.