-
Notifications
You must be signed in to change notification settings - Fork 9
PT-2358: Normalize USFM coming from file and going to file #1896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
438b2ec to
769d753
Compare
tjcouch-sil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 46 files reviewed, 1 unresolved discussion
c-sharp/Projects/ParatextProjectDataProvider.cs line 541 at r2 (raw file):
// system, and we want to match Paratext 9.4's whitespace. // ScrText.PutText runs other private methods to standardize the text before saving to file // as well. Maybe sometime we should see if we can get ScrText.PutBook created or something
As I was investigating how USX books are imported (UsxImporter.cs), I discovered that ImportSfmText.WriteChaptersToBook has a way to import whole books using ScrText.PutText. It would probably be wise for us to consider using ImportSfmText.ImportBooks instead of File.WriteAllBytes.
We will probably still need to standardize CRLFs and normalize out here, though, because CRLFs affect normalization
8e6d6f3 to
282ab3a
Compare
bbac23b to
d0d2a36
Compare
b97d65a to
fd4fbe9
Compare
fd4fbe9 to
8e1826d
Compare
tjcouch-sil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjcouch-sil reviewed 39 of 46 files at r3.
Reviewable status: 0 of 46 files reviewed, all discussions resolved
c-sharp/Projects/ParatextProjectDataProvider.cs line 541 at r2 (raw file):
Previously, tjcouch-sil (TJ Couch) wrote…
As I was investigating how USX books are imported (
UsxImporter.cs), I discovered thatImportSfmText.WriteChaptersToBookhas a way to import whole books usingScrText.PutText. It would probably be wise for us to consider usingImportSfmText.ImportBooksinstead ofFile.WriteAllBytes.We will probably still need to standardize CRLFs and normalize out here, though, because CRLFs affect normalization
Discovered that directly using scrText.PutText is likely more suitable for this situation after all. Seems UsxImporter and ImportSfmText does a lot of work to read multiple books in and set them, whereas we know which book we are setting. Much simpler this way.
irahopkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@irahopkinson reviewed 46 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @tjcouch-sil).
c-sharp/Projects/ParatextProjectDataProvider.cs line 1055 at r4 (raw file):
/// <summary> /// Copied from `ScrText.StandardizeCrLfsIfNecessary`. We need to do this when setting USFM /// because we need to normalize USFM with CrLfs before we run `ScrText.PutText`, and we expect
Nit there are 3 different casings for CR in this doc comment. Perhaps standardize to uppercase?
Code quote:
CrLfsc-sharp/Projects/ParatextProjectDataProvider.cs line 1058 at r4 (raw file):
/// CR not to be present on the USFM received for setting. /// /// Some programs (include cc which is used for mapin/mapout) strip out cr's.
BTW what is cc?
Code quote:
ccc-sharp/Projects/ParatextProjectDataProvider.cs line 1088 at r4 (raw file):
var scrText = LocalParatextProjects.GetParatextProject(ProjectDetails.Metadata.Id); // Make newlines have CRLF because Paratext 9.4 always does this regardless of operating
BTW I'm guessing there is a reason this doesn't cause issues for PT9 on Linux and mac (other than PT9 doesn't run on Linux and mac)?
tjcouch-sil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjcouch-sil made 3 comments.
Reviewable status: 45 of 46 files reviewed, 3 unresolved discussions (waiting on @irahopkinson).
c-sharp/Projects/ParatextProjectDataProvider.cs line 1055 at r4 (raw file):
Previously, irahopkinson (Ira Hopkinson) wrote…
Nit there are 3 different casings for CR in this doc comment. Perhaps standardize to uppercase?
Yeah, it's a mess. The first paragraph is mine, and the second paragraph is copied directly from ScrText. But I did something that hopefully is a reasonable enough thing to do to make it more consistent...?
c-sharp/Projects/ParatextProjectDataProvider.cs line 1058 at r4 (raw file):
Previously, irahopkinson (Ira Hopkinson) wrote…
BTW what is
cc?
Honestly no idea. I couldn't figure it out in the small time I spent investigating, either 😂 I suppose P9 people know ;)
c-sharp/Projects/ParatextProjectDataProvider.cs line 1088 at r4 (raw file):
Previously, irahopkinson (Ira Hopkinson) wrote…
BTW I'm guessing there is a reason this doesn't cause issues for PT9 on Linux and mac (other than PT9 doesn't run on Linux and mac)?
I theorize maybe they just didn't care and wanted to have the exact same thing on all OSes so they can assume it will be there throughout the code, but I don't know. Maybe it has to do with Mercurial (maybe it doesn't or didn't have a good way to handle the difference? Or maybe they thought it was too likely users would copy files between their systems without going through Mercurial and wanted them to be the same?)
Normalize when coming out from
ParatextProjectDataProviderand when saving to file. This is a significant product-level change that has been approved by Todd, Glenn, and Mike Lothers. I will discuss with Karina once this is merged so she can document this change in the list of significant things to note about using Paratext 10 Studio.Benefits:
UsjReaderWriterUSFM location <-> USJ location transformation) will still work on all USFM served fromParatextProjectDataProviderDrawbacks:
camarker with an extra newline before the space before the marker. This means S/Ring and editing a project withcamarkers will cause this extra newline to pop in and out every time the team edits the same chapter in the two different tools. The following image gives an example of what this would look like (Paratext 9 normalized USFM and proposed Paratext 10 Studio USFM on the left; Paratext 9 Standard View USFM on the right):This change is