Skip to content

Fix t5516 flakiness in Visual Studio builds #428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dscho
Copy link
Member

@dscho dscho commented Oct 25, 2019

Among the flaky tests, it seems that the Azure Pipeline suffers relatively frequently from t5516 failing with the Visual Studio builds. Essentially, we grep for an error message, but that error message is produced twice, once by a fetch and once by the upload-pack spawned from it, and those error messages are usually interleaved because of MSVC runtime fprintf() idiosyncracies.

As a consequence, I have to re-run about half a dozen failed builds a day, which I would like to avoid. My plan is therefore to merge this patch into Git for Windows v2.24.0-rc2.

The commit message of this patch is based, in part, on #407.

This fixes #240.

Changes since v3:

  • Reworded the part of the commit message that talks about the need for fflush(stderr); for accuracy, as suggested by Junio.
  • Simplified the flow of the prefix handling as well as providing a proper BUG message and abort()ing when the prefix is too long. Thanks Junio!

Changes since v2:

  • Using write_in_full() instead of xwrite() again (to make sure that the entire message is printed).
  • When vsnprintf() fails, now we at least print the prefix.
  • The code to check whether prefix was too long no longer tests an inequality between pointers, but between sizes.

Changes since v1:

  • Changed the oneline to be more accurate (thanks Junio).
  • Improved the commit message (e.g. talking about the xwrite() function this patch uses, rather than the write_in_full() function used by an earlier iteration, thanks Gábor).
  • Revamped the actual code to account for insanely long prefixes (thanks for the advice, Junio).

Cc: Jeff King [email protected], SZEDER Gábor [email protected], Alexandr Miloslavskiy [email protected]

@dscho
Copy link
Member Author

dscho commented Oct 25, 2019

FYI @SyntevoAlex

@dscho
Copy link
Member Author

dscho commented Oct 25, 2019

So apparently it is not only t0001.21 init allows insanely long --template, but also t3510.1[23] that fail due to this change (in the latter case, it would appear that a multi-line message is broken apart, something I thought not possible with write()).

Worse, t5516.81 now possibly fails with a "Broken pipe" on macOS.

Will investigate more on Monday, unless you beat me to it, @SyntevoAlex.

@SyntevoAlex
Copy link

SyntevoAlex commented Oct 26, 2019

Please let me finish my patches instead of jumping in and reauthoring them.

@dscho
Copy link
Member Author

dscho commented Oct 26, 2019

Please let me finish my patches instead of jumping in and reauthoring them.

Sure.

FWIW this here PR is mostly an experiment to see how quickly I could get a fix or t5516 without the additional complexity of allowing unlimited message lengths, in the hope to get this in before v2.24.0 (which is now unlikely).

It just looked to me as if you were ready to wait for others to confirm my suggestion that the unlimited message length is irrelevant, or at least substantially less relevant.

What I would have loved for you to do was to disentangle the concerns that your single patch tangled up, and to do a much better job at separating what information to put into code comments and what information to put into the commit message.

But as I said, it did not appear to me as if you were eager to do these things, but rather wanted to wait, and I was not ready to join you in the waiting room.

@SyntevoAlex
Copy link

SyntevoAlex commented Oct 26, 2019

No, I'm actually aimed at getting things done. Just wanted to make sure before discarding a few days worth of work and going into another effort. Trimming is harder then it seems. You've already run into segfault, because vsnprintf() return value is desired buffer size and not what it has written. Ok, that's easy to fix, but appending '\n' is harder: need to avoid issues with partial multibyte characters.

@dscho
Copy link
Member Author

dscho commented Oct 26, 2019

You've already run into segfault, because vsnprintf() return value is desired buffer size and bot what it has written.

I've seen a broken pipe and an unexpected re-ordering of output lines, where did you see the segmentation fault?

This raises more interesting questions like avoiding issues with partial multibyte characters.

Valid point.

@SyntevoAlex
Copy link

As for commit message, I actually think that having info about t5516 in code comment is better. This is what people are going to read, and commit message is often forgotten for archeologists.

@SyntevoAlex
Copy link

I've seen a broken pipe and an unexpected re-ordering of output lines, where did you see the segmentation fault?

If vsnprintf returns desired buffer size, where do you append '\n', exactly?

@SyntevoAlex
Copy link

SyntevoAlex commented Oct 26, 2019

On Monday, I will rework my patch. Please let me submit it in my name.

If you figure what's wrong with broken pipe and/or reordering before that, I would be happy to co-author you. Otherwise, I will deal with all issues on Monday.

FYI, https://public-inbox.org/git/[email protected]/

could reproduce the failure both with and without the patch before it
timeouted.

@SyntevoAlex
Copy link

I suspect that broken pipe arises from parent process exiting before child process. If so, that is a separate issue and will not block my patch.

@dscho
Copy link
Member Author

dscho commented Oct 26, 2019

I suspect that broken pipe arises from parent process exiting before child process.

Well, that's almost certainly true.

But that also puts a dent into the hope of fixing the flakiness of t5516 any time soon.

@dscho dscho force-pushed the fix-t5516-flakiness branch from a6bb6d1 to 455026c Compare October 26, 2019 21:09
@dscho
Copy link
Member Author

dscho commented Oct 26, 2019

As for commit message, I actually think that having info about t5516 in code comment is better.

I already pointed out that a code comment this far removed from the actual problem is prone to become stale. Whereas the commit message is the exact spot to describe a breakage that is being fixed.

Besides, it is poor practice to talk about an already-fixed problem in a code comment.

This is what people are going to read, and commit message is often forgotten for archeologists.

Not in the Git project: there, contributors (such as yourself, as witnessed in your very own commit message) do investigate the commit history of the code in question.

On Monday, I will rework my patch. Please let me submit it in my name.

Sure.

If you figure what's wrong with broken pipe before that, I would be happy to co-author you.

If you take (most of) my alternative patch, I would like to be mentioned as co-author, too.

FWIW I am not sure that it is worth doing this fix if it does not address the broken pipe issue on macOS: the test would be just as flaky, just in a different job of the Pipeline. That does not buy me much.

And my patch still results in two failures in t3510, although I think I figured it out (I force-pushed, and the Azure Pipeline will tell me whether I managed to fix things).

https://public-inbox.org/git/[email protected]/

That is a very illuminating thread.

@SyntevoAlex
Copy link

I already pointed out that a code comment this far removed from the actual problem is prone to become stale. Whereas the commit message is the exact spot to describe a breakage that is being fixed.

The comment, and whole patch, isn't about fixing t5516. It's about fixing a real problem where user will observe interleaved outputs. Code comment isn't talking about fixing t5516, it's talking about various problems that could arise in fprintf(), and explains the need to preformat into memory buffer. This is what makes me believe that it's well placed.

Not in the Git project: there, contributors (such as yourself, as witnessed in your very own commit message) do investigate the commit history of the code in question.

That's good to hear.

FWIW I am not sure that it is worth doing this fix if it does not address the broken pipe issue on macOS: the test would be just as flaky, just in a different job of the Pipeline. That does not buy me much.

Let's first see what investigation shows.

And my patch still results in two failures in t3510, although I think I figured it out (I force-pushed, and the Azure Pipeline will tell me whether I managed to fix things).

You've ran into partial multibyte character issue that I talked about before.

@dscho
Copy link
Member Author

dscho commented Oct 26, 2019

And my patch still results in two failures in t3510, although I think I figured it out (I force-pushed, and the Azure Pipeline will tell me whether I managed to fix things).

You've ran into partial multibyte character issue that I talked about before.

Since the tests pass here, locally, without any multi-byte character stunt, I believe that such a stunt is not necessary. AFAICT most consoles (probably even all of them) have code in place to handle invalid/incomplete multi-byte characters gracefully.

@dscho
Copy link
Member Author

dscho commented Oct 26, 2019

Seems that the tests now all pass with my patch.

@SyntevoAlex
Copy link

fflush I can understand, because you used xwrite which I guess is lower level and skips any FILE* caching.

But how to explain broken pipe?

@dscho
Copy link
Member Author

dscho commented Oct 26, 2019

But how to explain broken pipe?

It is a flaky test, that's why it only breaks the build sometimes.

@SyntevoAlex
Copy link

Yes, that's what I meant, you expressed concern about broken pipe, that it's an obstacle in getting this patch merged, and it seems to still be there.

@SyntevoAlex
Copy link

Any reason why xwrite could be better then fwrite ?

@dscho
Copy link
Member Author

dscho commented Oct 26, 2019

you expressed concern about broken pipe,

Yes.

that it's an obstacle in getting this patch merged,

No.

It is not an obstacle to getting it merged, but it is an obstacle to help with the flakiness of t5516 in general: that broken pipe problem is still there.

Any reason why xwrite could be better then fwrite ?

Since stderr is supposed to be unbuffered, what would be the use in calling fwrite() (which is supposed to buffer)?

@SyntevoAlex
Copy link

that it's an obstacle in getting this patch merged,

No.

I see, I misunderstood it.

Any reason why xwrite could be better then fwrite ?

Since stderr is supposed to be unbuffered, what would be the use in calling fwrite() (which is supposed to buffer)?

Didn't want to mix up fd and FILE* writing. Apparently, it actually causes issues, even though I didn't expect anything specific. However, in hindsight, you're probably right and xwrite is better, because it involves less CRT magic.

@dscho
Copy link
Member Author

dscho commented Oct 28, 2019

@SyntevoAlex since this PR essentially does what I want it to do, I plan on submitting it in about 2h. Did you still plan on working on #407?

@SyntevoAlex
Copy link

Yes, I am working on #407 and I kindly ask you to let me do that in peace and without rush.

@dscho
Copy link
Member Author

dscho commented Oct 28, 2019

Yes, I am working on #407 and I kindly ask you to let me do that in peace and without rush.

Well, I kind of have something here that works, you know? And that is clean enough to get maybe even code-reviewed before v2.24.0 is due.

I have no idea what you want to do on top of what I already did, so I am a bit at a loss why I should be waiting, i.e. what I should be waiting for.

@SyntevoAlex
Copy link

First, it's pretty impolite to ignore issue for months and then mess with someone else's work when he volunteers.
Second, you're rushing it and your code has at least two bugs, after I pointed to third bug and you fixed it. Rushing is never good.

@dscho
Copy link
Member Author

dscho commented Oct 28, 2019

First, it's pretty impolite to ignore issue for months

I am sorry that it took so long for this patch to get reviewed. We're all busy.

and then mess with someone else's work when he volunteers.

I apologize if you have the impression that I mess with your work.

From my point of view, I came up with a much smaller, much more elegant solution, and since you had already provided a good description of the "prior art" in your commit message, I copy-edited that, and made you a co-author.

Further, I only started to work for real on this when I sensed reluctance on your part to accept that my suggestion had merit. You asked for a second opinion about the truncation, after all, when it was pretty obvious from the commit history you yourself found that there is a strong preference for truncating the error message rather than risking OOMs, and you never addressed my concerns about the unnecessary complexity of your code and about the inappropriate location of the comments about t5516, instead trying to double-down on commenting on a particular test case far away from that test case, in the code that implements a generic error reporting mechanism.

If I had had the impression that we could reach consensus where I was purely reviewing your code and providing suggestions how to improve it, I would have gladly refrained from implementing a fix myself.

It does not appear to me that I was wrong in my assessment.

Second, you're rushing it and your code has at least two bugs,

It still has two bugs? Please point them out to me.

after I pointed to third bug and you fixed it.

I pushed a tentative version and let the PR build figure out what still needs to be dealt with, which is how I work. I am sorry if you don't like that style.

Rushing is never good.

On my side, I would place a higher priority in reducing complexity. Complexity is never good, and I see way too much complexity for no benefit in #407.

As to "rushing"... Sending a patch to the mailing list for discussion is not at all rushing anything. It is opening a discussion among interested and highly competent Git developers, and I fully intend on benefittin from that expertise to the fullest.

Finally, it is not exactly "rushing" if I hold back a perfectly fine patch for over a day just because you said you wanted to work on your version (without any indication what you find wrong in my version, which I tolerated, but find unsatisfying, to say the least).

I am beginning to suspect that the time I waited was not well spent.

@SyntevoAlex
Copy link

SyntevoAlex commented Oct 28, 2019

From my point of view, I came up with a much smaller, much more elegant solution, and since you had already provided a good description of the "prior art" in your commit message, I copy-edited that, and made you a co-author.

when it was pretty obvious from the commit history you yourself found that there is a strong preference for truncating the error message rather than risking OOMs

I actually saw that patch was implemented with a fixed buffer as most simple and then it repeatedly failed to work well, until there was an attempt to remove the limit.

and you never addressed my concerns about the unnecessary complexity of your code

On my side, I would place a higher priority in reducing complexity. Complexity is never good, and I see way too much complexity for no benefit in #407.

This is dependent on decision about limit, which wasn't clear to me.

If I had had the impression that we could reach consensus where I was purely reviewing your code and providing suggestions how to improve it, I would have gladly refrained from implementing a fix myself.

I am beginning to suspect that the time I waited was not well spent.

You jumped to conclusions without giving me a single working day to react.

It still has two bugs? Please point them out to me.

Finally, it is not exactly "rushing" if I hold back a perfectly fine patch for over a day just because you said you wanted to work on your version (without any indication what you find wrong in my version, which I tolerated, but find unsatisfying, to say the least).

This doesn't feel right to me when I'm trying to finish my work and you jump in and offer your patch, also not perfect, and I'm expected to switch to polishing your patch instead of polishing mine.

As to "rushing"... Sending a patch to the mailing list for discussion is not at all rushing anything. It is opening a discussion among interested and highly competent Git developers, and I fully intend on benefittin from that expertise to the fullest.

It's best to submit perfect patches firsthand.

@SyntevoAlex
Copy link

Can we please reset the discussion? Let's get it back into a relaxed flow. I'm working on my contribution to ensure perfect quality, give me some time, and then you can see it.

The MSVC runtime behavior differs from glibc's with respect to
`fprintf(stderr, ...)` in that the former writes out the message
character by character.

In t5516, this leads to a funny problem where a `git fetch` process as
well as the `git upload-pack` process spawned by it _both_ call `die()`
at the same time. The output can look like this:

	fatal: git uploadfata-lp: raemcokte :error:  upload-pnot our arcef k6: n4ot our ea4cr1e3f 36d45ea94fca1398e86a771eda009872d63adb28598f6a9
	8e86a771eda009872d6ab2886

Let's avoid this predicament altogether by rendering the entire message,
including the prefix and the trailing newline, into the buffer we
already have (and which is still fixed size) and then write it out via
`write_in_full()`.

We still clip the message to at most 4095 characters.

The history of `vreportf()` with regard to this issue includes the
following commits:

d048a96 (2007-11-09) - 'char msg[256]' is introduced to avoid interleaving
389d176 (2009-03-25) - Buffer size increased to 1024 to avoid truncation
625a860 (2009-11-22) - Buffer size increased to 4096 to avoid truncation
f4c3edc (2015-08-11) - Buffer removed to avoid truncation
b5a9e43 (2017-01-11) - Reverts f4c3edc to be able to replace control
                        chars before sending to stderr
9ac13ec (2006-10-11) - Another attempt to solve interleaving.
                        This is seemingly related to d048a96.
137a0d0 (2007-11-19) - Addresses out-of-order for display()
34df8ab (2009-03-10) - Switches xwrite() to fprintf() in recv_sideband()
                        to support UTF-8 emulation
eac14f8 (2012-01-14) - Removes the need for fprintf() for UTF-8 emulation,
                        so it's safe to use xwrite() again
5e5be9e (2016-06-28) - recv_sideband() uses xwrite() again

Note that we print nothing if the `vsnprintf()` call failed to render
the error message; There is little we can do in that case, and it should
not happen anyway.

The process may have written to `stderr` and there may be something left
in the buffer kept in the stdio layer. Call `fflush(stderr)` before
writing the message we prepare in this function.

Helped-by: Jeff King <[email protected]>
Helped-by: Alexandr Miloslavskiy <[email protected]>
Helped-by: SZEDER Gábor <[email protected]>
Helped-by: Junio C Hamano <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho force-pushed the fix-t5516-flakiness branch from fce0894 to f6d6c81 Compare October 30, 2019 08:43
@dscho
Copy link
Member Author

dscho commented Oct 30, 2019

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 30, 2019

Submitted as [email protected]

@dscho dscho mentioned this pull request Oct 30, 2019
@gitgitgadget
Copy link

gitgitgadget bot commented Nov 2, 2019

This patch series was integrated into pu via git@80f5cd0.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 4, 2019

This patch series was integrated into pu via git@41faec3.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 6, 2019

This patch series was integrated into pu via git@69eb89d.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 7, 2019

This patch series was integrated into pu via git@33193cc.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 8, 2019

This patch series was integrated into pu via git@4c8b5bf.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 10, 2019

This patch series was integrated into pu via git@a11a2c9.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 11, 2019

This patch series was integrated into pu via git@7090f4e.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 12, 2019

This patch series was integrated into pu via git@6ae1307.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 19, 2019

This patch series was integrated into pu via git@00c0217.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 19, 2019

This patch series was integrated into next via git@cae61b9.

@gitgitgadget gitgitgadget bot added the next label Nov 19, 2019
@gitgitgadget
Copy link

gitgitgadget bot commented Nov 20, 2019

This patch series was integrated into pu via git@137e3f0.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 21, 2019

This patch series was integrated into pu via git@f3df0fc.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 22, 2019

This patch series was integrated into pu via git@ee27365.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 25, 2019

This patch series was integrated into pu via git@c0c6e16.

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 27, 2019

This patch series was integrated into pu via git@9139157.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 2, 2019

This patch series was integrated into pu via git@5444d52.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 2, 2019

This patch series was integrated into next via git@5444d52.

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 2, 2019

This patch series was integrated into master via git@5444d52.

@gitgitgadget gitgitgadget bot added the master label Dec 2, 2019
@gitgitgadget gitgitgadget bot closed this Dec 2, 2019
@gitgitgadget
Copy link

gitgitgadget bot commented Dec 2, 2019

Closed via 5444d52.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix flaky "t5516.81 deny fetch unreachable SHA1, allowtipsha1inwant=false"
2 participants