Skip to content

Span tracking #1246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Span tracking #1246

wants to merge 1 commit into from

Conversation

robsdedude
Copy link

@robsdedude robsdedude commented Mar 11, 2025

Introducing optional span tracking (line, column, byte offset).

The feature is behind a new feature flag called spanned. It introduces

  • Spanned<T>, with wich you can get serde_json (when calling the appropriate deserializer) to inject span information into custom Deserialize types.
  • SpannedValue a copy of Value that uses Spanned<T> to provide span information.

This PR still needs a bit of fixing up:

  • clean out quick testing code
  • Fix relying on Read::position as its column isn't what we want and its line is slow to compute
    • While doing so, make sure to fix spans including leading white space

And then a good chunk of polishing:

  • DRY up the code (much of SpannedValue functionality is 90% copy-pasta from Value)
  • Write tests
  • Update/write docs
  • Maybe add the feature to the default features for the playground?

However, before I proceed with any of that, and since I've already put a good amount hours into this, I'd first like to get initial feedback on/review of the general design (API, outlined provided functionality).

Here's the gist of how it'd work

Sample code with output
#[cfg(test)]
mod test {
    use crate::spanned::*;
    use crate::*;

    fn show_spanned<T: core::fmt::Debug>(s: &Spanned<T>, source: &str) {
        use codespan_reporting::diagnostic::{Diagnostic, Label};
        use codespan_reporting::files::SimpleFiles;
        use codespan_reporting::term;
        use codespan_reporting::term::termcolor::{ColorChoice, StandardStream};

        let mut files = SimpleFiles::new();
        let file_id = files.add("input", source);
        let diagnostic = Diagnostic::note()
            .with_message(std::format!("Look, it's a {}", std::any::type_name::<T>()))
            .with_notes(std::vec![
                std::format!("{:?}", s.as_ref()),
                std::format!(
                    "From {}:{} ({}) to {}:{} ({})",
                    s.span().start.line,
                    s.span().start.column,
                    s.span().start.byte_offset,
                    s.span().end.line,
                    s.span().end.column,
                    s.span().end.byte_offset,
                )
            ])
            .with_labels(std::vec![Label::primary(file_id, s.byte_span())]);

        let writer = StandardStream::stderr(ColorChoice::Always);
        let config = codespan_reporting::term::Config::default();
        let mut writer_lock = writer.lock();

        term::emit(&mut writer_lock, &config, &files, &diagnostic).unwrap()
    }

    fn show_recursive(v: &Spanned<SpannedValue>, source: &str) {
        show_spanned(v, source);
        match v.get_ref() {
            SpannedValue::Array(values) => values.iter().for_each(|v| show_recursive(v, source)),
            SpannedValue::Object(map) => map.iter().for_each(|(k, v)| {
                show_spanned(k, source);
                show_recursive(v, source)
            }),
            _ => {}
        }
    }

    #[test]
    fn test_spanned_value() {
        let data = r#"
        {
            "name": "John Doe",
            "age": 42,
            "phones":    [
                "+44 1234567",
                "+44 2345678"
            ]
        }"#;

        let v: Spanned<SpannedValue> = from_str_spanned(data).unwrap();
        show_recursive(&v, data);
    }
}

Produces

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:1:1
  │  
1 │ ╭ 
2 │ │         {
3 │ │             "name": "John Doe",
4 │ │             "age": 42,
  · │
8 │ │             ]
9 │ │         }
  │ ╰─────────^
  │  
  = Object {Spanned { span: Span { start: SpanPosition { line: 3, column: 13, byte_offset: 23 }, end: SpanPosition { line: 3, column: 18, byte_offset: 29 } }, value: "name" }: Spanned { span: Span { start: SpanPosition { line: 3, column: 20, byte_offset: 30 }, end: SpanPosition { line: 3, column: 30, byte_offset: 41 } }, value: String("John Doe") }, Spanned { span: Span { start: SpanPosition { line: 4, column: 13, byte_offset: 55 }, end: SpanPosition { line: 4, column: 17, byte_offset: 60 } }, value: "age" }: Spanned { span: Span { start: SpanPosition { line: 4, column: 19, byte_offset: 61 }, end: SpanPosition { line: 4, column: 21, byte_offset: 64 } }, value: Number(42) }, Spanned { span: Span { start: SpanPosition { line: 5, column: 13, byte_offset: 78 }, end: SpanPosition { line: 5, column: 20, byte_offset: 86 } }, value: "phones" }: Spanned { span: Span { start: SpanPosition { line: 5, column: 22, byte_offset: 87 }, end: SpanPosition { line: 8, column: 13, byte_offset: 167 } }, value: Array [Spanned { span: Span { start: SpanPosition { line: 6, column: 17, byte_offset: 109 }, end: SpanPosition { line: 6, column: 29, byte_offset: 122 } }, value: String("+44 1234567") }, Spanned { span: Span { start: SpanPosition { line: 7, column: 17, byte_offset: 140 }, end: SpanPosition { line: 7, column: 29, byte_offset: 153 } }, value: String("+44 2345678") }] }}
  = From 1:1 (0) to 9:9 (177)

note: Look, it's a alloc::string::String
  ┌─ input:3:13
  │
3 │             "name": "John Doe",
  │             ^^^^^^
  │
  = "name"
  = From 3:13 (23) to 3:18 (29)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:3:20
  │
3 │             "name": "John Doe",
  │                    ^^^^^^^^^^^
  │
  = String("John Doe")
  = From 3:20 (30) to 3:30 (41)

note: Look, it's a alloc::string::String
  ┌─ input:4:13
  │
4 │             "age": 42,
  │             ^^^^^
  │
  = "age"
  = From 4:13 (55) to 4:17 (60)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:4:19
  │
4 │             "age": 42,
  │                   ^^^
  │
  = Number(42)
  = From 4:19 (61) to 4:21 (64)

note: Look, it's a alloc::string::String
  ┌─ input:5:13
  │
5 │             "phones":    [
  │             ^^^^^^^^
  │
  = "phones"
  = From 5:13 (78) to 5:20 (86)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:5:22
  │  
5 │               "phones":    [
  │ ╭──────────────────────^
6 │ │                 "+44 1234567",
7 │ │                 "+44 2345678"
8 │ │             ]
  │ ╰─────────────^
  │  
  = Array [Spanned { span: Span { start: SpanPosition { line: 6, column: 17, byte_offset: 109 }, end: SpanPosition { line: 6, column: 29, byte_offset: 122 } }, value: String("+44 1234567") }, Spanned { span: Span { start: SpanPosition { line: 7, column: 17, byte_offset: 140 }, end: SpanPosition { line: 7, column: 29, byte_offset: 153 } }, value: String("+44 2345678") }]
  = From 5:22 (87) to 8:13 (167)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:6:17
  │
6 │                 "+44 1234567",
  │                 ^^^^^^^^^^^^^
  │
  = String("+44 1234567")
  = From 6:17 (109) to 6:29 (122)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:7:17
  │
7 │                 "+44 2345678"
  │                 ^^^^^^^^^^^^^
  │
  = String("+44 2345678")
  = From 7:17 (140) to 7:29 (153)

Closes: #637

@robsdedude robsdedude marked this pull request as draft March 11, 2025 08:29
@@ -36,6 +35,8 @@ pub struct Deserializer<R> {
single_precision: bool,
#[cfg(feature = "unbounded_depth")]
disable_recursion_limit: bool,
#[cfg(feature = "spanned")]
spanned_enabled: bool,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design note on why I decided to only add support for spans when explicitly enabled (e.g., by calling from_str_spanned instead of from_str):

Otherwise, this would be a breaking change for anyone passing a struct whose Deserialize implementation calls deserialize_struct on serde_json's Deserializer with arguments satisfying is_spanned. Admittedly, this is highly unlikely, but I'd rather stick to hard SemVer rules. If you disagree, I'm happy to put in this functionality unconditionally.

Another benefit of this design is that when fleshing out the implementation, this allows me to move the position tracking into the Deserializer (as Read's tracking seems to be insufficient) and only engage in the counting/tracking if the user indicated that they're interested in span information in the first place (by calling one of the new APIs).

@197g
Copy link

197g commented Jun 1, 2025

I can't comment as a maintainer of serde_json but as a maintainer of other crates, the best path of action is to dramatically cut down the size and number of concepts. For both review and your own good since there are many open TODO's in that commit and you're not going to want to address them all at once, but over time. Set yourself up for incremental small steps forward, I've seen smaller efforts just lose momentum.

For instance, SpannedValue and the Spanned<T>: Deserialize seem quite orthogonal to each other.


SpannedValue on its own seems very concrete. There is also clear prior reference to its implementation style in the crate. This already calls into question the need for a bool-flag in the deserializer (compare RawValue doesn't need to be turned on in addition to the feature flag). If it gets a feature flag then I don't see why SpannedValue and Spanned<T> should have overlapping ones.

It would also be helpful to expand on the motivating use cases in tests some more. For users that are are deriving Deserialize there are already options:

  • is actually a minimal implementation satisfying the referenced issue
    • for diagnostics, downstream crate want to issue help that refers to spans in the json input after semantically interpreting the json
  • for rewriting, the spans allows us to programmatically replace the value
  • the span allows getting a &'lt RawValue after deserialization

Each of these deserve asserts that demonstrate the additional information is valuable. If you can show that this can do multiple uses cases, it's very convincing. Needing lots of maintainance for niche gains will not be easy to add to any OSS crate. The show call is just there, a proper test is more convincing.

Also probably cut the impl Serializer and impl Deserializer for Spanned<T> initially unless you can link them to one of the above. Those adds hundreds of lines of their own, need separate test coverage, have some complex question with regards to future versions, and should at least be a separate commit.


Then in comparison, I'm not certain the motivation for SpannedValue is nearly as good. Sure, adding span information to all parts of a json Value interesting. But you're littering the overhead everywhere (performance, memory use and thousands of implementation lines) and the question is really for which use case this is the best tool. That doesn't seem nearly as clear to me. Is it even realistic to expect one to build up such a value by hand?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Get line and column number on valid JSON?
2 participants