Closed Bug 585258 Opened 14 years ago Closed 13 years ago

Performance regression on the testcase in bug 424715

Categories

(Core :: Web Painting, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- final+

People

(Reporter: bzbarsky, Assigned: roc)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

(Keywords: perf, regression)

Attachments

(4 files, 1 obsolete file)

We've got to add this to some regression suite...

I profiled the testcase in bug 424715 after bug 424715 comment 35, and a large part of what seems to be happening is that we're doing a LOT of nsRegion::Sub operations in nsDisplayItem::RecomputeVisibility.  I would assume that we in fact do one Sub() call for each div, so each painting pass ends up O(N^2) in number of divs.  Is this a region we can sanely simplify outward if it gets too complicated or something?

Note that this code was introduced in "Bug 564991. Part 9: Retain layer trees. r=tnikkel,sr=mats"
Blocks: 467263
blocking2.0: --- → ?
Keywords: perf, regression
Yes, we should simplify there.
Attached file zipped up testcase
I made a scrolling test out of this. These are the results:

Fx3.6.8: 46154ms
2010-07-14: 54926ms
2010-07-16: 35751ms
2010-08-30: 43296ms
Seamonkey 1.9.1 build of 2010-07-01: 8376ms
Opera: 10703ms

Between 2010-07-14 and 2010-07-16 is when bug 564991 landed.

Btw, notice how 1.9.1 builds seem to be that much quicker.

Is this test useful for this bug? Is this useful for a regression testsuite? What regression suite was commented about in comment 0?
> Is this test useful for this bug?

No, but it's a useful test....

> What regression suite was commented about in comment 0?

We don't have one like that yet.  That's the problem.

roc, this should probably block a beta, right?
Attachment #471538 - Attachment mime type: application/zip → application/java-archive
I don't think it needs to block beta, since all we need to do here is back off some analysis when it's getting expensive and repaint the entire window. But it definitely blocks!

Martijn's test should probably be added to Tscroll. But we don't need to use a huge file, we can document.write() our way to victory.
(In reply to comment #4)
> I don't think it needs to block beta, since all we need to do here is back off
> some analysis when it's getting expensive and repaint the entire window.

In fact we won't even need to repaint the window.
blocking2.0: ? → final+
I know this is "blocking final+" but my understanding is there are a large number of blocking bugs and not very much time to ship and cuts must come.

Can anyone really fix this fairly large performance regression (in terms of effort people made the test case performant in the first place) or should it really be considered a more long term gaol?
I think we can fix it fairly easily.
Doing a quick test of the original benchmark on my fairly fast system the current browsers look like this to me:

Firefox nightly build: 160 seconds at about 110/120 frames and then crash http://crash-stats.mozilla.com/report/index/bp-3f3378ae-890e-480d-b5a5-7e13a2101219
IE 9 Beta 2: 46.774 seconds
Chrome latest Canary Build: 10.877 seconds
Opera 11: 8.17 seconds
Yes, there's no point retesting until this bug is fixed.  :(
Sorry for the bugspam, I get a little excited sometimes as I don't have much spare time to do Firefox testing these days. Good luck for Fx4 release.
My time is 53692ms in my debug build.
Actually that was 53s in an opt build.
OK, so I applied that patch.  Things are still Really Slow, though.  Compared to bug 424715 comment 26, I see something like 87% of the time in painting (used to be 42%).  20% is spent building the display list (14%) and computing visibility (6%).  But painting itself is 65%: 10% building layers and calling WillEndTransaction on the layer builder, and 55% painting backgrounds...

Quartz debug (insofar as I can trust it) claims we repaint a box bounded by the text in the upper right corner and the most-recent line of the raytrace, so we would be repainting all the previously-painted things.  I bet we didn't use to do that before...
On this hardware, FF 3.6 is 70s, Chrome is 13s. I'll see what else we can do, but this is a rare sort of testcase; we hardly ever see real pages with such an absurd number of element visible at the same time.
Oh, I'm testing the scroll link from the testcase in comment #2. Are you testing pageload in the original test?
At least during scrolling, we're repainting exactly the areas you'd expect.
3.6 is before we started fixing the deps of bug 424715, as I recall....

I just tried modifying the testcase by removing the innerHTML set for the thing at the top, and time I see dropped from 61s (with the attached patch) to about 37s.  Safari 5 and Chrome are both at about 11s (with the innerHTML set).

We still have the fix for bug 516740 in, so that's not it.  :(
> Oh, I'm testing the scroll link from the testcase in comment #2.

Oh.  See comment 3!

I'm testing the original test.  Load http://nontroppo.org/timer/progressive_raytracer.html and click "Full Render".
Just FYI... Early betas of FF 3.7 had this test case down to 11 seconds on my Core2Duo, so this appears to have been more or less completely resolved at one point.
Yes, we know...  See comment 0.  ;)
OK thanks for clarifying that.

Now I think that the problem is in FrameLayerBuilder::InvalidateThebesLayerContents. The invalid region we accumulate gets complicated because the rects we invalidate aren't simply in top-to-bottom, left-to-right order; we get some holes. When it gets complicated, nsRegion::SimplifyOutward fluffs out to the bounding box, which includes the text at the top-right, so we invalidate a big chunk of the layer.

I think we need to make nsRegion::SimplifyOutward smarter.
This takes the "full render" from 45s to 30s on my machine. We repaint a constant number of display items in every pass --- 732, which seems high. I'll look into that next.
Attachment #499257 - Flags: review?(bzbarsky)
Oh, 732 is about right actually. 3 rows, each 240 pixels (elements) wide.
Hmm, xperf is still not nearly as good as Shark for this work, if I'm using it right...
Maybe --enable-profiling would help...
Indeed, xperf looks quite good now.

Painting time still dominates (74% of total profile in nsLayoutUtils::PaintFrame). Of the painting time, roughly 33% is display-list construction (BuildDisplayListForStackingContext), 28% is ComputeVisibilityForRoot, 34% is layer construction (under FrameLayerBuilder::BuildContainerLayerFor and FrameLayerBuilder::WillEndTransaction), and 5% is actual painting (under LayerManagerD3D9::EndTransaction).
Hmm.  So I tried applying the second patch, but it doesn't seem to help anything here...  In fact, even if I make nsRegion::SimplifyOutward a no-op that changes nothing.  I still see over 50% of the time spent in actual painting, and over 87% spent in painting-related stuff (including painting, layer construction, etc)...

Maybe we're simplifying the region somewhere else on Mac too or something?  Or maybe our painting really has gotten that much slower?  I'm seeing this crap land inside CoreGraphics color management stuff, for one thing.
Comment on attachment 499257 [details] [diff] [review]
Part 2: Make SimplifyOutward smarter

>+           pRect->YMost () >= pRect->next->y)

Drop the space char after "YMost" here?

>+      pRect->UnionRect(*pRect, *pRect->next);
>+      delete Remove (pRect->next);

And here after "Remove"?

r=me with those nits.
Attachment #499257 - Flags: review?(bzbarsky) → review+
But again, part 2 seems to do nothing for me on Mac.
(In reply to comment #30)
> Comment on attachment 499257 [details] [diff] [review]
> Part 2: Make SimplifyOutward smarter
> 
> >+           pRect->YMost () >= pRect->next->y)
> 
> Drop the space char after "YMost" here?
> 
> >+      pRect->UnionRect(*pRect, *pRect->next);
> >+      delete Remove (pRect->next);
> 
> And here after "Remove"?
> 
> r=me with those nits.

Those are consistent with the rest of nsRegion.cpp.
If you count the number of times we reach here:
http://mxr.mozilla.org/mozilla-central/source/layout/base/FrameLayerBuilder.cpp#1672
in each call to DrawThebesLayer, you should see that number growing linearly with each paint before part 2, and not growing with each paint after part 2. At least I did.

(In reply to comment #28)
> Painting time still dominates (74% of total profile in
> nsLayoutUtils::PaintFrame). Of the painting time, roughly 33% is display-list
> construction (BuildDisplayListForStackingContext), 28% is
> ComputeVisibilityForRoot, 34% is layer construction (under
> FrameLayerBuilder::BuildContainerLayerFor and
> FrameLayerBuilder::WillEndTransaction), and 5% is actual painting (under
> LayerManagerD3D9::EndTransaction).

Of those, the layer construction time (34%) is new. The rest should be pretty much unchanged since 3.6. I guess it's possible that Quartz painting has slowed down ... we've switched from drawing to a CGBitmapContext to a CGLayer, and maybe that adds overhead that shows up when you're drawing 59,000 1x1 pixel rects.

I'm willing to do small tweaks to make this testcase faster but not major restructuring. This testcase has 100x more visible elements than any page you'll ever see that's not doing some kind of one-DIV-per-pixel hack ... and <canvas> is the right API for this use-case.
> Those are consistent with the rest of nsRegion.cpp.

Ah, ok.

I agree we shouldn't spend too much time on this except insofar as real pages might be affected.  I'll take a look at the counts I see.
If the counts keep growing, you'd want to look at calls to ThebesLayerOGL::InvalidateRegion to see which region(s) are being invalidated at each paint. It should be something reasonable, a rectangle for the text and one or more rectangles for the strip of DIVs.
Over here (on Mac), unless I'm screwing something up, I see the number of times we reach that line growing linearly (adding 729 per paint) no matter what I do with nsRegion::SimplifyOutward (including just putting a return at the very beginning of the function)...

Lemme take a look at those regions.
Yeah, those look broken to me.  I see two InvalidateRegion calls per paint.  The first is passed an empty region.  The second is passed a region with three rects. Typical rects for that second call:

  x=502 y=-1  width=745 height=231
  x=502 y=230 width=243 height=2
  x=745 y=230 width=502 height=1

On each call the y values for rects 2 and 3 and the height for rect 1 increase by 3.

So I guess the question is where that first rect comes from.
Comment on attachment 499257 [details] [diff] [review]
Part 2: Make SimplifyOutward smarter

>+    // Combine with the following rectangle if they have the same YMost
>+    // or if they overlap vertically. This ensures that all overlapping
>+    // rectangles are merged, preserving the invariant that rectangles
>+    // don't overlap.

"... don't intersect" would distinguish from other kinds of overlapping.
Real page load times are always most important, but it would be cool if there existed a suite of 'chaos' benchmarks that reacted very badly to unintended performance regressions in very important functions, that way they could be treated as less important but give developers of an idea of potential unintended regressions.

Also it's a bit of a shame between Fx 3.6 and 4.0 this benchmark has had Fx as the slowest modern browser, to the fastest, to the slowest again. Much appreciate your attention roc!
I _think_ it's already present in the invalidThebesContent region we get in FrameLayerBuilder::BuildContainerLayerFor.  Though that region has 5 rects, and the region that's returned from ToOutsidePixels only has 3...
Comment on attachment 499230 [details] [diff] [review]
Part 1: Don't make visible region arbitrarily complex in RecomputeVisibility

r=dbaron
Attachment #499230 - Flags: review?(dbaron) → review+
Keywords: checkin-needed
Whiteboard: [needsd landing]
(In reply to comment #40)
> I _think_ it's already present in the invalidThebesContent region we get in
> FrameLayerBuilder::BuildContainerLayerFor.  Though that region has 5 rects, and
> the region that's returned from ToOutsidePixels only has 3...

That region is added to in FrameLayerBuilder::InvalidateThebesLayerContents. If you log the incoming aRect and the resulting invalidThebesContent region, you should be able to understand how the region ends up with that oversized first rectangle.

The SimplifyOutward(20) call there is what was causing trouble for me, which patch 2 fixed. Everything you're saying makes it sound like patch 2 was not applied ...
> Everything you're saying makes it sound like patch 2 was not applied ...

It really truly is!

I did this with printfs just now instead of a debugger, and in an opt build to avoid whatever noise a debug build was introducing.  I can no longer reproduce comment 37 on, but I do still see comment 36.  Here's what a log looks like:

Our region is: Rect count: 2
  [(52261,5130) 22559x930]
  [(30120,26976) 14580x180]
And will become: Rect count: 2
  [(871,85) 376x16]
  [(502,449) 243x4]
Invaliding region: Rect count: 2
  [(871,-1) 376x16]
  [(502,363) 243x4]
Resulting valid region: Rect count: 6
  [(0,0) 871x15]
  [(0,15) 502x409]
  [(502,15) 745x348]
  [(745,363) 502x4]
  [(502,367) 745x57]
  [(0,424) 1247x420]
Painted item count: 45293

The logging for regions logs the rect count and then logs [(x,y) width x height] for each rect.

Those first two regions above are before and after the ToOutsidePixels call in FrameLayerBuilder::BuildContainerLayerFor.  The next two are from ThebesLayerOGL::InvalidateRegion.  The "painted item count" is the number of times in FrameLayerBuilder::DrawThebesLayer that we get past the cdi->mItem->GetVisibleRect().IsEmpty() check...

So that looks like the valid region we end up with in ThebesLayerOGL::InvalidateRegion is correct (I drew those 6 rects out, and it's the whole viewport except a 376px wide by 16px tall strip at top right and a 243px wide by 4px tall strip in the middle).  But we're still painting too much in DrawThebesLayer for some reason.
OK.  Looks like the aRegionToDraw passed to FrameLayerBuilder::DrawThebesLayer is too big.  Typical values:

region to draw: Rect count: 1
  [(502,0) 745x199]
region to draw: Rect count: 1
  [(502,0) 745x202]
region to draw: Rect count: 1
  [(502,0) 745x205]
etc
Ah, I found it!  BasicTextureImage::BeginUpdate in GLContext.cpp always fluffs out aRegion to its bounding rect, with this comment:

    // the basic impl can only upload updates to rectangles

So in BasicBufferOGL::BeginPaint we call BeginUpdate, and that sets result.mRegionToDraw to the bounding rect of the two-rect region we started with.  With GL disabled, I see about 38 seconds for this testcase over here after applying the patches in this bug; with GL enabled it's closer to 60 seconds.  Do we need a separate bug on this?

If I disable opengl layers, then the number of items I'm seeing painted per call to DrawThebesLayer stops growing and stabilizes at 1137 (no idea why; since we're invalidating 4px strips for some reason, I would have expected 4 * 243 = 972 plus maybe a few for the text, so 1137 is still a bit high, but it's less than 5*243...).
And with gl disabled my "paint-related" time is about 75% instead of 87%; most of this is display list management and such; only 4% is actual painting.  There seem to be no obvious hotspots in the sense that all the display list stuff is just linear in the number of all things on the page, not just the things in the dirty area.
(In reply to comment #45)
> So in BasicBufferOGL::BeginPaint we call BeginUpdate, and that sets
> result.mRegionToDraw to the bounding rect of the two-rect region we started
> with.  With GL disabled, I see about 38 seconds for this testcase over here
> after applying the patches in this bug; with GL enabled it's closer to 60
> seconds.  Do we need a separate bug on this?

Yes, thanks. Fixing that would bring Mac in line with Windows here, and after these patches are applied, the problem boils down to "processing display items for all 59,000 visible rectangles in the window is slow".
File bug 621778.
This compiles, which is nice, but sensible people like a little more than "philor blindly unrotted it with absolutely no understanding of the patch or the change that bitrotted it" for things that are going to land under their name.
Attachment #499230 - Attachment is obsolete: true
Attachment #500629 - Flags: feedback?(roc)
Keywords: checkin-needed
Whiteboard: [needsd landing] → [needs landing]
The reason the original didn't compile is not because it is bitrotted, but because it depends on other changes in roc's queue.
(specifically bug 602757)
Attachment #500629 - Attachment is obsolete: true
Attachment #500629 - Flags: feedback?(roc)
Attachment #499230 - Attachment is obsolete: false
Depends on: 602757
Whiteboard: [needs landing] → [needs landing][part 1 applies on top of attachment 491151 from bug 602757]
Yes, sorry, I added checkin-need prematurely. Very sorry!
http://hg.mozilla.org/mozilla-central/rev/35013af94ec7
http://hg.mozilla.org/mozilla-central/rev/c3825c079c00
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [needs landing][part 1 applies on top of attachment 491151 from bug 602757]
Does this resolve the actual reported problem or just get it back to Firefox 3.6 levels? Which bug 424715 orriginal solved by making Firefox about 20x faster at this testcase than Firefox 3.6 was. Just need to know if I need to open another bug on the issue.
Gets us back to Firefox 3.6 levels I guess.

We're building a complete display list for the entire window on every paint to simplify the construction and update of our layer tree. That works pretty well when the window has hundreds of visible elements, not well when the window has 60,000 visible elements like here. I don't think we should be designing around this rare case.
I concur with the reasoning and only hopped the absurdity of the testcase would help find underlying problems when firefox has to deal with a large number of visible elements. And as other browsers still do thisin excess 10x faster I think it is still valid in that sense, will make another low priority bug for this.
When I was testing this, I found out that on my machine, each next test takes longer (first 3.5 seconds, second about 6 seconds, third 9 seconds, etc). Anyone can reproduce this? Or is this already a known bug?
All browsers will do this as it is related to the way the code is written.
Depends on: dlbi
Depends on: 705561
Component: Layout: View Rendering → Layout: Web Painting
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: