Skip to content

backcompat parsing changes original document #104

Closed
@kartikprabhu

Description

@kartikprabhu

Since mf2py does class substitutions for backcompat parsing it changes the original BeautifulSoup document given to parse. Not sure this is a bug for usage yet, but is a "quirk" for sure.

cc: @kevinmarks @snarfed @sknebel @bear

Example

The following is an example, with the html variable being the following HTML as a string

<article class="hentry">
    <section class="entry-content">
        <p class="entry-summary">This is a summary</p> 
        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>
    </section>
</article>

Now run the following in python

>>> from bs4 import BeautifulSoup
>>> from mf2py import parse
>>> bs = BeautifulSoup(html)
>>> bs.article

This will output the original HTML

<article class="hentry">\n    <section class="entry-content">\n        <p class="entry-summary">This is a summary</p> \n        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>\n    </section>\n</article>

Now run

>>> parse(bs)
>>> bs.article

This will output the "modified" HTML

<article class="hentry h-entry">\n    <section class="entry-content">\n        <p class="entry-summary">This is a summary</p> \n        <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>\n    </section>\n</article>

Note the following:

  1. article gets an additional h-entry class from backcompat
  2. None of the children get the corresponding mf2 class transformations
  3. using >>> parse(bs) again will give erroneous results as it will skip all the properties!

Problem code

This happens because of https://github.com/microformats/mf2py/blob/master/mf2py/backcompat.py#L112 in backcompat. This creates a shallow copy of the element to apply the backcompat rules (BeautifulSoup does not support deepcopy yet.) But this does not affect the children of the element somehow.

Possible solutions

  1. This is not a problem and leave it as is.
  2. Change the entire document to mf2 equivalent i.e. don't make shallow copies while applying mf1 to mf2 conversion rules. (This was the original behaviour which I changed! my bad.)
  3. Implement some workaround to make a deep copy and work on that for parsing to completely preserve the original document.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions