-
Notifications
You must be signed in to change notification settings - Fork 7
Use CMake to build uchardet and update upstream submodule #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7c37c65
f319faf
784a47b
69c80ba
e80234a
44553be
11fdb93
5be347f
4a5a4fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
[submodule "uchardet-sys/uchardet"] | ||
path = uchardet-sys/uchardet | ||
url = https://github.com/BYVoid/uchardet | ||
url = https://anongit.freedesktop.org/git/uchardet/uchardet.git |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,80 +1,91 @@ | ||
//! A wrapper around the uchardet library. Detects character encodings. | ||
//! A wrapper around the uchardet library. Detects character encodings. | ||
//! | ||
//! Note that the underlying implemention is written in C and C++, and I'm | ||
//! not aware of any security audits which have been performed against it. | ||
//! | ||
//! ``` | ||
//! use uchardet::detect_encoding_name; | ||
//! | ||
//! assert_eq!(Some("windows-1252".to_string()), | ||
//! detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7, | ||
//! 0x61, 0x69, 0x73]).unwrap()); | ||
//! assert_eq!("WINDOWS-1252", | ||
//! detect_encoding_name(&[0x46, 0x93, 0x72, 0x61, 0x6e, 0xe7, 0x6f, | ||
//! 0x69, 0x73, 0xe9, 0x94]).unwrap()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the detector still detect For my use case, I encounter a lot of input data in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It outputs
Relevant wiki pages: |
||
//! ``` | ||
//! | ||
//! For more information, see [this project on | ||
//! GitHub](https://github.com/emk/rust-uchardet). | ||
|
||
// Increase the compiler's recursion limit for the `error_chain` crate. | ||
#![recursion_limit = "1024"] | ||
#![deny(missing_docs)] | ||
|
||
#[macro_use] | ||
extern crate error_chain; | ||
extern crate libc; | ||
extern crate uchardet_sys as ffi; | ||
|
||
use libc::size_t; | ||
use std::error::Error; | ||
use std::fmt; | ||
use std::result::Result; | ||
use std::ffi::CStr; | ||
use std::str::from_utf8; | ||
|
||
/// An error occurred while trying to detect the character encoding. | ||
#[derive(Debug)] | ||
pub struct EncodingDetectorError { | ||
message: String | ||
} | ||
pub use errors::*; | ||
|
||
impl Error for EncodingDetectorError { | ||
fn description(&self) -> &str { "encoding detector error" } | ||
fn cause(&self) -> Option<&Error> { None } | ||
} | ||
#[allow(missing_docs)] | ||
mod errors { | ||
error_chain! { | ||
errors { | ||
UnrecognizableCharset { | ||
description("unrecognizable charset") | ||
display("uchardet was unable to recognize a charset") | ||
} | ||
OutOfMemory { | ||
description("out of memory error") | ||
display("uchardet ran out of memory") | ||
} | ||
Other(int: i32) { | ||
description("unknown error") | ||
display("uchardet returned unknown error {}", int) | ||
} | ||
} | ||
} | ||
|
||
impl ErrorKind { | ||
pub fn from_nsresult(nsresult: ::ffi::nsresult) -> ErrorKind { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know how to make this private but still accessible for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could make a public free function There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case, we should probably just hoist all the code in Thank you for taking the time to respond to all my excessively detailed review comments, by the way! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I put it in a module so I could It's no problem at all; Rather it's really educational for me to have somebody more experienced guide me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (This comment is just a remark about something I noticed, not a feature request!) You know, I think it might actually be possible to fix error-chain to support doc comments. Comments starting with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually also thought about that possibility but after taking a glance at quick_error.rs I quickly abandoned that idea... 😅 (PR is up: rust-lang-deprecated/error-chain#50) |
||
assert!(nsresult != 0); | ||
match nsresult { | ||
1 => ErrorKind::OutOfMemory, | ||
int => ErrorKind::Other(int), | ||
} | ||
} | ||
|
||
impl fmt::Display for EncodingDetectorError { | ||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
write!(f, "{}", &self.message) | ||
} | ||
} | ||
|
||
/// Either a return value, or an encoding detection error. | ||
pub type EncodingDetectorResult<T> = Result<T, EncodingDetectorError>; | ||
|
||
/// Detects the encoding of text using the uchardet library. | ||
/// | ||
/// EXPERIMENTAL: This may be replaced by a better API soon. | ||
struct EncodingDetector { | ||
ptr: ffi::uchardet_t | ||
} | ||
|
||
/// Return the name of the charset used in `data`, or `None` if the | ||
/// charset is ASCII or if the encoding can't be detected. This is | ||
/// the value returned by the underlying `uchardet` library, with | ||
/// the empty string mapped to `None`. | ||
/// Return the name of the charset used in `data` or an error if uchardet | ||
/// was unable to detect a charset. | ||
/// | ||
/// ``` | ||
/// use uchardet::detect_encoding_name; | ||
/// | ||
/// assert_eq!(None, detect_encoding_name("ascii".as_bytes()).unwrap()); | ||
/// assert_eq!(Some("UTF-8".to_string()), | ||
/// detect_encoding_name("français".as_bytes()).unwrap()); | ||
/// assert_eq!(Some("windows-1252".to_string()), | ||
/// detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7, | ||
/// 0x61, 0x69, 0x73]).unwrap()); | ||
/// assert_eq!("ASCII", | ||
/// detect_encoding_name("ascii".as_bytes()).unwrap()); | ||
/// assert_eq!("UTF-8", | ||
/// detect_encoding_name("©français".as_bytes()).unwrap()); | ||
/// assert_eq!("WINDOWS-1252", | ||
/// detect_encoding_name(&[0x46, 0x93, 0x72, 0x61, 0x6e, 0xe7, 0x6f, | ||
/// 0x69, 0x73, 0xe9, 0x94]).unwrap()); | ||
/// ``` | ||
pub fn detect_encoding_name(data: &[u8]) -> | ||
EncodingDetectorResult<Option<String>> | ||
{ | ||
pub fn detect_encoding_name(data: &[u8]) -> Result<String> { | ||
let mut detector = EncodingDetector::new(); | ||
try!(detector.handle_data(data)); | ||
detector.data_end(); | ||
Ok(detector.charset()) | ||
detector.charset() | ||
} | ||
|
||
impl EncodingDetector { | ||
|
@@ -85,49 +96,49 @@ impl EncodingDetector { | |
EncodingDetector{ptr: ptr} | ||
} | ||
|
||
/// Pass a chunk of raw bytes to the detector. This is a no-op if a | ||
/// Pass a chunk of raw bytes to the detector. This is a no-op if a | ||
/// charset has been detected. | ||
fn handle_data(&mut self, data: &[u8]) -> EncodingDetectorResult<()> { | ||
let result = unsafe { | ||
fn handle_data(&mut self, data: &[u8]) -> Result<()> { | ||
let nsresult = unsafe { | ||
ffi::uchardet_handle_data(self.ptr, data.as_ptr() as *const i8, | ||
data.len() as size_t) | ||
}; | ||
match result { | ||
match nsresult { | ||
0 => Ok(()), | ||
_ => { | ||
let msg = "Error handling data".to_string(); | ||
Err(EncodingDetectorError{message: msg}) | ||
int => { | ||
Err(ErrorKind::from_nsresult(int).into()) | ||
} | ||
} | ||
} | ||
|
||
/// Notify the detector that we're done calling `handle_data`, and that | ||
/// we want it to make a guess as to our encoding. This is a no-op if | ||
/// we want it to make a guess as to our encoding. This is a no-op if | ||
/// no data has been passed yet, or if an encoding has been detected | ||
/// for certain. From reading the code, it appears that you can safely | ||
/// for certain. From reading the code, it appears that you can safely | ||
/// call `handle_data` after calling this, but I'm not certain. | ||
fn data_end(&mut self) { | ||
unsafe { ffi::uchardet_data_end(self.ptr); } | ||
} | ||
|
||
/// Reset the detector's internal state. | ||
//fn reset(&mut self) { | ||
// fn reset(&mut self) { | ||
// unsafe { ffi::uchardet_reset(self.ptr); } | ||
//} | ||
// } | ||
|
||
/// Get the decoder's current best guess as to the encoding. Returns | ||
/// `None` on error, or if the data appears to be ASCII. | ||
fn charset(&self) -> Option<String> { | ||
/// Get the decoder's current best guess as to the encoding. May return | ||
/// an error if uchardet was unable to detect an encoding. | ||
fn charset(&self) -> Result<String> { | ||
unsafe { | ||
let internal_str = ffi::uchardet_get_charset(self.ptr); | ||
assert!(!internal_str.is_null()); | ||
let bytes = CStr::from_ptr(internal_str).to_bytes(); | ||
let charset = from_utf8(bytes); | ||
match charset { | ||
Err(_) => | ||
panic!("uchardet_get_charset returned invalid value"), | ||
Ok("") => None, | ||
Ok(encoding) => Some(encoding.to_string()) | ||
panic!("uchardet_get_charset returned a charset name \ | ||
containing invalid characters"), | ||
Ok("") => Err(ErrorKind::UnrecognizableCharset.into()), | ||
Ok(encoding) => Ok(encoding.to_string()) | ||
} | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,3 +18,4 @@ libc = "*" | |
|
||
[build-dependencies] | ||
pkg-config = '*' | ||
cmake = "*" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm quite happy to see this dependency, especially if it helps us build on Windows and stay in sync with upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to always require the latest stable Rust, or do we want to support back to some specific version? I could be convinced to go either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have an opinion on that (I'm normally running nightly). Theoretically (I believe) 1.2 should suffice as the incompatibilty was caused by the usage of debug builders which are stabilized since 1.2.