-
Notifications
You must be signed in to change notification settings - Fork 237
Long-string parsing for CR
LF
vs LF
eol sequences differ
#1302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It may be that neither of the patches work. Here's a case I think may not be handled: (def long-string-2
``
first line
second line
``)
(pp long-string-2) Expected:
Actual:
|
Updated diffs below. These seem to work better. Top version: diff --git a/src/core/parse.c b/src/core/parse.c index 40ccfbf2..66046e3f 100644 --- a/src/core/parse.c +++ b/src/core/parse.c @@ -368,7 +368,8 @@ static int stringend(JanetParser *p, JanetParseState *state) { int reindent = 1; while (reindent && (r < end)) { if (*r++ == '\n') { - for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++) { + for (int32_t j = 0; (r < end) && (*r != '\n') && + (*r != '\r') && (j < indent_col); j++, r++) { if (*r != ' ') { reindent = 0; break; @@ -376,6 +377,12 @@ static int stringend(JanetParser *p, JanetParseState *state) { } } } + /* Adjust things a bit if there is a leading \r\n eol sequence. */ + if (buflen > 1 && (bufstart[0] == '\r') && + (bufstart[1] == '\n')) { + buflen--; + bufstart++; + } /* Now reindent if able to, otherwise just drop leading newline. */ if (!reindent) { if (buflen > 0 && bufstart[0] == '\n') { @@ -393,16 +400,20 @@ static int stringend(JanetParser *p, JanetParseState *state) { } else { *w++ = *r++; } - for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++); + for (int32_t j = 0; (r < end) && (*r != '\n') && + (*r != '\r') && (j < indent_col); j++, r++); } else { *w++ = *r++; } } buflen = (int32_t)(w - bufstart); } - /* Check for trailing newline character so we can remove it */ + /* Check for end of line sequence so we can remove it */ if (buflen > 0 && bufstart[buflen - 1] == '\n') { buflen--; + if (buflen > 0 && bufstart[buflen - 1] == '\r') { + buflen--; + } } } if (state->flags & PFLAG_BUFFER) { Bottom version: diff --git a/src/core/parse.c b/src/core/parse.c index 40ccfbf2..ee678814 100644 --- a/src/core/parse.c +++ b/src/core/parse.c @@ -368,7 +368,8 @@ static int stringend(JanetParser *p, JanetParseState *state) { int reindent = 1; while (reindent && (r < end)) { if (*r++ == '\n') { - for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++) { + for (int32_t j = 0; (r < end) && (*r != '\n') && + (*r != '\r') && (j < indent_col); j++, r++) { if (*r != ' ') { reindent = 0; break; @@ -376,32 +377,35 @@ static int stringend(JanetParser *p, JanetParseState *state) { } } } - /* Now reindent if able to, otherwise just drop leading newline. */ - if (!reindent) { - if (buflen > 0 && bufstart[0] == '\n') { - buflen--; - bufstart++; - } - } else { + /* Reindent if needed. */ + if (reindent) { uint8_t *w = bufstart; r = bufstart; + uint8_t at_newline = 0; while (r < end) { - if (*r == '\n') { - if (r == bufstart) { - /* Skip leading newline */ - r++; - } else { - *w++ = *r++; - } - for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++); - } else { - *w++ = *r++; + at_newline = (*r == '\n') ? 1 : 0; + *w++ = *r++; + if (at_newline) { + for (int32_t j = 0; (r < end) && (*r != '\n') && + (*r != '\r') && (j < indent_col); j++, r++); } } buflen = (int32_t)(w - bufstart); } - /* Check for trailing newline character so we can remove it */ - if (buflen > 0 && bufstart[buflen - 1] == '\n') { + /* Skip leading eol sequence if needed. */ + if (buflen > 0 && bufstart[0] == '\n') { + buflen--; + bufstart++; + } else if (buflen > 1 && (bufstart[0] == '\r') && + (bufstart[1] == '\n')) { + buflen -= 2; + bufstart += 2; + } + /* Remove trailing eol sequence if needed. */ + if (buflen > 1 && (bufstart[buflen - 2] == '\r') && + (bufstart[buflen - 1] == '\n')) { + buflen -= 2; + } else if (buflen > 0 && bufstart[buflen - 1] == '\n') { buflen--; } } Thanks to @pyrmont for the fix! |
So thinking over it some more, it seems problematic for literal long-strings to yield different values depending on the end-of-line sequence used within a source file. As additional data, the REPL behavior is consistent between Linux and Windows:
Perhaps each That doesn't seem completely problem-free [1], but may be it's a better behavior than the current situation. [1] May be for nearly every case (and likely case) it would be fine? |
So perhaps something like this: diff --git a/src/core/parse.c b/src/core/parse.c
index 40ccfbf2..59908f24 100644
--- a/src/core/parse.c
+++ b/src/core/parse.c
@@ -368,7 +368,8 @@ static int stringend(JanetParser *p, JanetParseState *state) {
int reindent = 1;
while (reindent && (r < end)) {
if (*r++ == '\n') {
- for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++) {
+ for (int32_t j = 0; (r < end) && (*r != '\n') &&
+ (*r != '\r') && (j < indent_col); j++, r++) {
if (*r != ' ') {
reindent = 0;
break;
@@ -376,6 +377,12 @@ static int stringend(JanetParser *p, JanetParseState *state) {
}
}
}
+ /* Adjust things a bit if there is a leading \r\n eol sequence. */
+ if (buflen > 1 && (bufstart[0] == '\r') &&
+ (bufstart[1] == '\n')) {
+ buflen--;
+ bufstart++;
+ }
/* Now reindent if able to, otherwise just drop leading newline. */
if (!reindent) {
if (buflen > 0 && bufstart[0] == '\n') {
@@ -393,16 +400,24 @@ static int stringend(JanetParser *p, JanetParseState *state) {
} else {
*w++ = *r++;
}
- for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++);
+ for (int32_t j = 0; (r < end) && (*r != '\n') &&
+ (*r != '\r') && (j < indent_col); j++, r++);
+ } else if ((*r == '\r') &&
+ (r + 1 < end) && (*(r + 1) == '\n')) {
+ /* Skip carriage return that is right before a newline */
+ r++;
} else {
*w++ = *r++;
}
}
buflen = (int32_t)(w - bufstart);
}
- /* Check for trailing newline character so we can remove it */
+ /* Check for end of line sequence so we can remove it */
if (buflen > 0 && bufstart[buflen - 1] == '\n') {
buflen--;
+ if (buflen > 0 && bufstart[buflen - 1] == '\r') {
+ buflen--;
+ }
}
}
if (state->flags & PFLAG_BUFFER) { |
I think this is how Python3's multiline string literals work [1] too. For these two bits of code: s = """a
^Mj
z"""
print(len(s)) s = """a^M
^Mj^M
z"""^M
^M
print(len(s))^M where That is, whether the line endings are I think the patch in the previous comment implements this behavior. [1] Haven't tracked down the source, but this bit might be relevant. I think this is part of "Universal Newline Support" defined in PEP 278. According to Wikipedia (see near end of the linked section):
|
I found summaries regarding some programming languages line termination handling starting from here. Languages mentioned include:
One unexpected issue for me was potential security implications of Note that most of this is from around 10 years ago. Also of referential interest might be this Wikipedia article about Newline that was mentioned in the above issue. Of particular note might be the "In programming languages" section. Apparently in Rust:
That seems similar to the Python3 behavior described above. |
Trying to spell out some rationale for long-strings (and long-buffers). So far I have:
ATM, the website docs have this text:
My suspicion at the moment is that the use-case of constructing a multi-line source string via a literal where the end-of-line delimiters are not converted to IIUC, in most C implementations(?), end-of-line sequences get converted to
I don't know if this set a precedent which influenced some newer programming languages, but wouldn't be surprised if it did. |
ATM, it seems to me that parsing might be improved so that source produced in a typical Windows environment might work more consistently with that produced in a more Unix-y environment [1]. There are two obvious ways to me (may be I missed some) one might do this:
I think the former is simpler to implement, but it seems less expressive as one would not be able to get a string or buffer containing a Below are a couple of bits about what Rust does for comparative purposes... AFAICT, Rust does 1. (drops For raw string literals though, apparently Rust only replaces I bring these bits up to try to aid in reasoning about what to do for Janet [2]. My previous comment in this issue tries to spell out some rationale for long-strings (which includes some use cases) to further inform thinking about this situation. [1] To give a bit of background, this came up for me because tests I had written behave differently on Windows than they do on Linux. [2] I think impementing 1. or 2. is better than not changing the current situation, but perhaps others have different opinions... |
\r\n
vs \n
eol sequences differCR
LF
vs LF
eol sequences differ
This is all good discussion, and I do think we should strip out in some cases, but a goal of long strings is to be able to encode any binary sequence, so just dropping all CR doesn't quite work here. There is some special case at the beginning and end of the strings to allow for the form: ``
Text here
`` to not have leading and trailing lines, and allow for indentation. So I'm not sure about how I feel about doing anything to the internal contents if there is no indentation |
Do you see problems with a transformation that replaces One thing I think isn't great about this idea is that it makes it awkward (and possibly brittle) to construct multiline strings that include It seems doable to me if one constructs the long-string literal by using I'm not sure I'd want to write such long-string literals if they have many lines because it doesn't seem ergonomic and I wonder if some editors (and perhaps version control systems) might mess with my source, corrupting my data... [1] I haven't come across a programming language that supports ergonomic construction of this type of data -- so may be it doesn't come up so much in practice? May be that's wishful thinking... |
Closing this as a WNF. The behavior gets too complicated and annoying, mostly to deal with editors and system that try to change how your source code is formatted. |
The following code yields different results [1] when it contains different end-of-line sequences (
CR
LF
vsLF
) [2] (tested with c708ff9):When using
CR
LF
, the result I get is:When using just
LF
, the result I get is:I presume the latter form (when only
LF
is used) is correct in that it removes the leading and trailing eol sequences.Presumably when
CR
LF
is used for a file, one would want the result to be:Update: I think we may not want the
CR
in this example. See this comment below.Below are a couple of attempts at adjustment, but perhaps there are better ways :)
A relatively minimal patch that seems to yield consistent results is:
A somewhat clearer (but I have doubts about equivalence) patch is:
Thanks to @pyrmont for aid in investigation and working on patches.
[1] It's the leading and trailing parts differing that's a problem. That end-of-line sequences that are "interior" differ is to be expected.
[2] In this issue, notationally,
CR
and "carriage return" mean the same, and likewise forLF
and "line feed".The text was updated successfully, but these errors were encountered: