Skip to content

Conversation

@tjcouch-sil
Copy link
Member

@tjcouch-sil tjcouch-sil commented Oct 30, 2025

Normalize when coming out from ParatextProjectDataProvider and when saving to file. This is a significant product-level change that has been approved by Todd, Glenn, and Mike Lothers. I will discuss with Karina once this is merged so she can document this change in the list of significant things to note about using Paratext 10 Studio.

Benefits:

  • Files with non-normalized USFM will not be changed unless edited in Platform.Bible
  • All the code that relies on USFM being normalized according to Paratext's rules (including UsjReaderWriter USFM location <-> USJ location transformation) will still work on all USFM served from ParatextProjectDataProvider
  • Any project that uses USFM 2.0 will be transformed to USFM 3.0 (automatic transformation during normalization and transformed back to USFM 2.0 while saving to file), so no one has to deal with USFM 2.0

Drawbacks:

  • If a team uses Platform.Bible and something else that does not normalize whitespace the way Paratext 9 does (including Paratext 9 in some cases as mentioned below) to edit their project, the USFM they edit in Platform.Bible will be normalized, meaning non-normalized whitespace will be removed.
    • John Wickberg knows of at least one team who uses Paratext 9 unformatted view, which preserves non-normalized whitespace. However, he mentioned it is worth considering removing this ability (meaning normalize the USFM even in the unformatted view) since it is not officially supported and does not work properly all the time. See this discussion in Discord for more information.
    • This should not affect teams who use the Paratext 9 standard view almost at all. Paratext 9 Standard view saves USFM to file almost exactly normalized already, so their files have essentially no risk of changing (and no risk of meaningful changes)
      • TJ knows of one case where the Paratext 9 standard view saves to file in a way that is not the same as how Paratext 9 normalizes, so this situation would cause churn: Paratext 9 standard view saves the ca marker with an extra newline before the space before the marker. This means S/Ring and editing a project with ca markers will cause this extra newline to pop in and out every time the team edits the same chapter in the two different tools. The following image gives an example of what this would look like (Paratext 9 normalized USFM and proposed Paratext 10 Studio USFM on the left; Paratext 9 Standard View USFM on the right):
image

This change is Reviewable

@tjcouch-sil tjcouch-sil force-pushed the 2358-translate-offsets branch 2 times, most recently from 438b2ec to 769d753 Compare November 5, 2025 18:57
Copy link
Member Author

@tjcouch-sil tjcouch-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 46 files reviewed, 1 unresolved discussion


c-sharp/Projects/ParatextProjectDataProvider.cs line 541 at r2 (raw file):

        // system, and we want to match Paratext 9.4's whitespace.
        // ScrText.PutText runs other private methods to standardize the text before saving to file
        // as well. Maybe sometime we should see if we can get ScrText.PutBook created or something

As I was investigating how USX books are imported (UsxImporter.cs), I discovered that ImportSfmText.WriteChaptersToBook has a way to import whole books using ScrText.PutText. It would probably be wise for us to consider using ImportSfmText.ImportBooks instead of File.WriteAllBytes.

We will probably still need to standardize CRLFs and normalize out here, though, because CRLFs affect normalization

@tjcouch-sil tjcouch-sil force-pushed the 2358-translate-offsets branch from 8e6d6f3 to 282ab3a Compare November 6, 2025 17:07
@tjcouch-sil tjcouch-sil force-pushed the 2358-translate-offsets branch from bbac23b to d0d2a36 Compare November 18, 2025 18:27
Base automatically changed from 2358-translate-offsets to main November 19, 2025 18:08
Copy link
Member Author

@tjcouch-sil tjcouch-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjcouch-sil reviewed 39 of 46 files at r3.
Reviewable status: 0 of 46 files reviewed, all discussions resolved


c-sharp/Projects/ParatextProjectDataProvider.cs line 541 at r2 (raw file):

Previously, tjcouch-sil (TJ Couch) wrote…

As I was investigating how USX books are imported (UsxImporter.cs), I discovered that ImportSfmText.WriteChaptersToBook has a way to import whole books using ScrText.PutText. It would probably be wise for us to consider using ImportSfmText.ImportBooks instead of File.WriteAllBytes.

We will probably still need to standardize CRLFs and normalize out here, though, because CRLFs affect normalization

Discovered that directly using scrText.PutText is likely more suitable for this situation after all. Seems UsxImporter and ImportSfmText does a lot of work to read multiple books in and set them, whereas we know which book we are setting. Much simpler this way.

@tjcouch-sil tjcouch-sil marked this pull request as ready for review December 16, 2025 23:52
irahopkinson
irahopkinson previously approved these changes Dec 17, 2025
Copy link
Contributor

@irahopkinson irahopkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irahopkinson reviewed 46 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @tjcouch-sil).


c-sharp/Projects/ParatextProjectDataProvider.cs line 1055 at r4 (raw file):

    /// <summary>
    /// Copied from `ScrText.StandardizeCrLfsIfNecessary`. We need to do this when setting USFM
    /// because we need to normalize USFM with CrLfs before we run `ScrText.PutText`, and we expect

Nit there are 3 different casings for CR in this doc comment. Perhaps standardize to uppercase?

Code quote:

CrLfs

c-sharp/Projects/ParatextProjectDataProvider.cs line 1058 at r4 (raw file):

    /// CR not to be present on the USFM received for setting.
    ///
    /// Some programs (include cc which is used for mapin/mapout) strip out cr's.

BTW what is cc?

Code quote:

cc

c-sharp/Projects/ParatextProjectDataProvider.cs line 1088 at r4 (raw file):

        var scrText = LocalParatextProjects.GetParatextProject(ProjectDetails.Metadata.Id);

        // Make newlines have CRLF because Paratext 9.4 always does this regardless of operating

BTW I'm guessing there is a reason this doesn't cause issues for PT9 on Linux and mac (other than PT9 doesn't run on Linux and mac)?

Copy link
Member Author

@tjcouch-sil tjcouch-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjcouch-sil made 3 comments.
Reviewable status: 45 of 46 files reviewed, 3 unresolved discussions (waiting on @irahopkinson).


c-sharp/Projects/ParatextProjectDataProvider.cs line 1055 at r4 (raw file):

Previously, irahopkinson (Ira Hopkinson) wrote…

Nit there are 3 different casings for CR in this doc comment. Perhaps standardize to uppercase?

Yeah, it's a mess. The first paragraph is mine, and the second paragraph is copied directly from ScrText. But I did something that hopefully is a reasonable enough thing to do to make it more consistent...?


c-sharp/Projects/ParatextProjectDataProvider.cs line 1058 at r4 (raw file):

Previously, irahopkinson (Ira Hopkinson) wrote…

BTW what is cc?

Honestly no idea. I couldn't figure it out in the small time I spent investigating, either 😂 I suppose P9 people know ;)


c-sharp/Projects/ParatextProjectDataProvider.cs line 1088 at r4 (raw file):

Previously, irahopkinson (Ira Hopkinson) wrote…

BTW I'm guessing there is a reason this doesn't cause issues for PT9 on Linux and mac (other than PT9 doesn't run on Linux and mac)?

I theorize maybe they just didn't care and wanted to have the exact same thing on all OSes so they can assume it will be there throughout the code, but I don't know. Maybe it has to do with Mercurial (maybe it doesn't or didn't have a good way to handle the difference? Or maybe they thought it was too likely users would copy files between their systems without going through Mercurial and wanted them to be the same?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants