Description
Since mf2py does class substitutions for backcompat parsing it changes the original BeautifulSoup document given to parse. Not sure this is a bug for usage yet, but is a "quirk" for sure.
cc: @kevinmarks @snarfed @sknebel @bear
Example
The following is an example, with the html
variable being the following HTML as a string
<article class="hentry">
<section class="entry-content">
<p class="entry-summary">This is a summary</p>
<p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>
</section>
</article>
Now run the following in python
>>> from bs4 import BeautifulSoup
>>> from mf2py import parse
>>> bs = BeautifulSoup(html)
>>> bs.article
This will output the original HTML
<article class="hentry">\n <section class="entry-content">\n <p class="entry-summary">This is a summary</p> \n <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>\n </section>\n</article>
Now run
>>> parse(bs)
>>> bs.article
This will output the "modified" HTML
<article class="hentry h-entry">\n <section class="entry-content">\n <p class="entry-summary">This is a summary</p> \n <p>This is <a href="/tags/mytag" rel="tag">mytag</a> inside content. </p>\n </section>\n</article>
Note the following:
article
gets an additionalh-entry
class from backcompat- None of the children get the corresponding mf2 class transformations
- using
>>> parse(bs)
again will give erroneous results as it will skip all the properties!
Problem code
This happens because of https://github.com/microformats/mf2py/blob/master/mf2py/backcompat.py#L112 in backcompat. This creates a shallow copy of the element to apply the backcompat rules (BeautifulSoup does not support deepcopy yet.) But this does not affect the children of the element somehow.
Possible solutions
- This is not a problem and leave it as is.
- Change the entire document to mf2 equivalent i.e. don't make shallow copies while applying mf1 to mf2 conversion rules. (This was the original behaviour which I changed! my bad.)
- Implement some workaround to make a deep copy and work on that for parsing to completely preserve the original document.