-
Notifications
You must be signed in to change notification settings - Fork 236
Implement character encoding detection / conversion #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
rust-encoding implements the Encoding spec. We already use it in Servo for CSS. HTML’s prescan the byte stream to determine its encoding algorithm (the one that looks for I believe html5ever should implement the encoding sniffing algorithm (of which the prescan is part of) independently of the tokenizer and parser (although they might share code internally) so that the overall parsing is, conceptually:
(It would also be nice to have some kind of Unicode stream: rust-lang/rfcs#57) |
Per the spec, the tree builder can also reload the document with a different encoding when it encounters a relevant This seems annoying, given that html5ever is not necessarily running in Servo. @hsivonen suggests on #whatwg @ Freenode that we might get away with ignoring that part of the spec:
In the test case below, Chrome 46 dev switches encodings mid-stream (but not exactly at the point of the import time
from wsgiref.simple_server import make_server
def simple_app(environ, start_response):
status = '200 OK'
headers = [('Content-type', 'text/html')]
start_response(status, headers)
yield u"""
<!DOCTYPE html>
<!-- 1024 bytes:
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
123456789012345678901234567890123456789012345678901234567890123
-->
è""".encode('utf8')
time.sleep(2)
yield u"è<meta charset=utf-8>é".encode('utf8')
httpd = make_server('', 8000, simple_app)
print("Listening on port 8000....")
httpd.serve_forever() |
@SimonSapin what happens when it's in response to a POST request? |
Rewrite the high-level API (driver module) to use TendrilSink This depends on servo/tendril#23. This also adds an API to parse from bytes, which is part of #18. r? @nox <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/servo/html5ever/188) <!-- Reviewable:end -->
#188 added a It should be possible to modify the |
@shinglyu Servo doesn't process meta tags yet. |
@jdm: I see, so the parser has the ability to do so but it's not hooked up in servo yet? |
I meant |
@SimonSapin Is servo/servo#9730 what you were talking about? |
Yes. |
No need for a debugger, I can confirm that current Servo always uses UTF-8 for HTML. (servo/servo#9730 never landed.) What happened with #9730 is that I first tried to build an abstraction in html5ever with a nice simplified API, and then realized it didn’t fit what Servo needs. So this time around I suggest first implementing the encoding sniffing algorithm in Servo, and then later see what kind of API we can build to move it into html5ever. And we should use encoding_rs instead of rust-encoding, now that Gecko ships it. I have some ideas for adding encoding_rs support in Tendril. |
See the HTML spec and the WHATWG Encoding spec. This also entails noticing
<meta charset=...>
and<meta http-equiv="Content-Type">
as we parse.The text was updated successfully, but these errors were encountered: