Skip to content

Use C-style IO instead of ifstream for parsing#193

Merged
ToruNiina merged 4 commits intoToruNiina:masterfrom
lukash:replace-fstream
Sep 23, 2022
Merged

Use C-style IO instead of ifstream for parsing#193
ToruNiina merged 4 commits intoToruNiina:masterfrom
lukash:replace-fstream

Conversation

@lukash
Copy link
Contributor

@lukash lukash commented Jun 30, 2022

The fstream classes are notorious for their non-existent error handling. The stream error bits weren't even checked in the code, resulting in arbitrary malfunctions or crashing in case of an error.

This replaces the ifstream with C-style file IO (fopen(), etc.) and adds an exception class that is thrown in case of errors, including a descriptive error message.

It preserves the parse() variant for an istream, though I'd say using it with an ifstream is for the large part detrimental.

I'm introducing a new file_io_error exception class, let me know if it needs adjusting. It inherits from std::runtime_error, not from toml::exception, as that didn't seem to be a good fit. On a side note I think the whole exception class hierarchy should inherit from std::runtime_error, mainly to preserve a distinction from std::logic_error, as the API caller may or may not want to catch those.

As for how I've run into this, I've had the ifstream code crash on me due to mistakenly passing a directory path on a tmpfs filesystem. It failed on the assert(fsize >= 0); and left me dumbfounded for a while. I can't easily add that to the unittests, but the new code presents a reasonable error message for that case.

@lukash
Copy link
Contributor Author

lukash commented Jul 26, 2022

Hello, @ToruNiina, any feedback? I see the Windows CI has failed, unfortunately I am not able to test on Windows.

@ToruNiina
Copy link
Owner

This problem is difficult to reproduce and I haven't actually reproduced it, but I assume that this problem will occur (under very limited conditions).
So we should provide a way to workaround this problem. But using FILE* and fopen everywhere seems, IMHO, to be too much.

I looked several well-known (than mine) c++ libraries that parses a file and found that most of the libraries uses fstream internally after taking a filename. Some supports a parse function that directly takes FILE* not a filename, in addition to stream version. So I think using a stream internally is a common approach and least-surprise behavior.

Also, there are some trouble around FILE* and fopen on non-POSIX environment. Windows uses wchar_t as its filesystem::path::value_type so filesystem::path::string_type will be wstring. Its c_str will be a wchar_t* but fopen takes char*. Because I don't know much about development on windows, I'm not sure this is safe (I have not had a Windows environment either, so all Windows-specific problems have been solved by CI so far). And I'm unwilling to Introduce another environment specific macro if it is possible.

So I think it is safer to use fstream if a filename is passed and add a new function to receive a FILE*. By doing so, we can leave it up to the user to open a FILE*. Since std::vector<char> is used in any case, we can use parse(std::vector<char> file_content, std::string file_name) after loading a file content in those functions.

@lukash
Copy link
Contributor Author

lukash commented Jul 28, 2022

The goal of the PR is not to have a FILE*-based IO, the goal is to not have the fstream based one, because its error handling is very bad.

In your code, you check the good() method after opening the ifstream, but the exception you throw gives no description of the error that has occurred. This is very bad, and there is in fact (believe it or not) no way (as far as I know) to retrieve the actual error description from the ifstream.

After the stream is open, you can set the exception mask via the exceptions() method to get exceptions thrown when performing operations on the ifstream. If you don't have that mask set, you should check the fail bits of the stream after every operation (which can result in an error). It's better to set the mask, but if you do, the exceptions you get still only contain generic text messages with no concrete error descriptions whatsoever.

The unfortunate conclusion I have from this is that the fstream classes are unusable if you want to have any sort of reasonable error handling, which I think is very desirable in a library whose main purpose is to work with files.

I'm not at all happy with using FILE*, but until C++ gets a better file IO API, it seems to me the only reliable solution...

Of course, everything works fine as long as you're not getting errors, and if you do, especially for an advanced user it's usually not hard to figure out what's wrong with reading a file (not found and permissions are by far the most common). But (especially) for a newbie user of a tool (that uses toml11 to work with some of its files), getting a concrete error message is much more helpful. And in the <2% of some more cryptic errors, having a concrete error message is very valuable as it can save a ton of debugging.

@lukash
Copy link
Contributor Author

lukash commented Aug 30, 2022

@ToruNiina would you please respond to my last comment? I find poor error handling a serious issue which is important for deciding whether to even use a library or not.

Copy link
Owner

@ToruNiina ToruNiina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you about we need to improve the error messages regarding file I/O.
So, the problem here is about a FILE*.

The policy of using FILE* has some problems, like

  • functions such as fopen are sometimes problematic on non-ASCII environments. So, additional environment-dependent macros must be included, which makes the code less readable.
  • Since filesystem::path is aware of non-ASCII environments, it is also sometimes problematic when we try to use filesystem::path and fopen at the same time.
  • Most C++ libraries use fstream, so using FILE* violates the principle of least surprise.
  • And this code currently does not run on Windows in CI. (You asked me to fix it, but my primary environment is linux and secondary is macOS. Debugging on windows is not easy for me either.)

I think the advantage of a FILE pointer is errno. But, practically, all the libstdc++, libc++, and microsoft/STL uses FILE* inside and standard library does not set errno zero, so the errno corresponding to the error that occurs inside will be kept at the point where we find stream operation fails even if we use fstream. I said "practically" because standard spec does not say anything about errno in the Input/Output section. But since all the most well known implementation uses this, I think we can assume that errno can be obtained. People who use in-house c++ standard library that implements fstream without FILE* may get a useless error message. But it does not differ from the current state.

So, to me, the advantage seems not to be large enough to ignore the disadvantage. If you use fstream and errno in your patch, I will merge it.

@lukash
Copy link
Contributor Author

lukash commented Sep 2, 2022

I agree with you about we need to improve the error messages regarding file I/O. So, the problem here is about a FILE*.

The policy of using FILE* has some problems, like

* functions such as fopen are sometimes problematic on non-ASCII environments. So, additional environment-dependent macros must be included, which makes the code less readable.

* Since `filesystem::path` is aware of non-ASCII environments, it is also sometimes problematic when we try to use `filesystem::path` and `fopen` at the same time.

This is too vague, sorry I'm not aware of the issues, I assume as long as you use the same encoding as the FS is using, it should work? Though, just mentioning it for completeness, since you seem to dislike the solution in general, we maybe don't need to go down this path any further 🙂.

* Most C++ libraries use `fstream`, so using `FILE*` violates the principle of least surprise.

I don't understand this argument at all. What is the least surprise useful for? The fstream has a lot of objective technical shortcomings. The only very much anecdotal observation I can make is my subjectively pleasant surprise if I use toml11 and find out it has decent error handling due to not using fstream.

* And this code currently does not run on Windows in CI. (You asked me to fix it, but my primary environment is linux and secondary is macOS. Debugging on windows is not easy for me either.)

This is indeed quite problematic.

I think the advantage of a FILE pointer is errno. But, practically, all the libstdc++, libc++, and microsoft/STL uses FILE* inside and standard library does not set errno zero, so the errno corresponding to the error that occurs inside will be kept at the point where we find stream operation fails even if we use fstream. I said "practically" because standard spec does not say anything about errno in the Input/Output section. But since all the most well known implementation uses this, I think we can assume that errno can be obtained. People who use in-house c++ standard library that implements fstream without FILE* may get a useless error message. But it does not differ from the current state.

Since it's not a part of the specification, it's only an assumption that errno is being set. It can also only be set for some of the errors, and it can even have a wrong value for some of them, meaning a misleading and confusing error message, which is the worst kind. Even if I wanted to dive into the source and verify the errno is being set correctly for all the cases, in all standard libraries we care about, it's still not guaranteed to stay that way and it's just a fragile approach to take. Sorry, I don't want to do it that way.

So, to me, the advantage seems not to be large enough to ignore the disadvantage. If you use fstream and errno in your patch, I will merge it.

I'd say it has issues which could be solved, and then the advantages would outweigh the disadvantages, but it's ultimately your code.

So a fallback solution, since I need reasonable error handling, would be to read the files myself and just pass the string to toml11. I can do it using a stringstream, but since toml11 reads the whole contents of the stream into a vector<char>, it'd mean the contents of the file would be in memory twice for the duration of the parsing (and, due to the potential \n being appended to the end of the vector, in the unfortunate coincidence where a reallocation would occur, even three times for a short period).

I was wondering if it would be possible to add an interface that would accept an rvalue of the data (along with a const reference for a more well-rounded API), so that it wouldn't need to do the copy.

You're using vector<char>, are there any special reasons for that? a std::string would be the more conventional container for it, wouldn't it?

@ToruNiina
Copy link
Owner

The reason why we use std::vector<char> instead of std::string is that we don't need any of methods provided by std::string. We only need a container of chars, not a string.
If we use std::string, people who read the code may assume that some string-specific method will be used. But actually we don't use any. It is misleading. Since it is an internal value, we don't need to consider about the intuitiveness of public api but the clarity of the implementation detail.
But when we add a public function that takes a file content to be parsed, as you are concerned, passing a std::vector<char> instead of std::string looks a bit strange. One might want to pass a string literal to the function if we have such a function or do some operation on the file content but it is impractical if the function takes a vector.
A problem with adding functions that takes a file content as a string is that it obscures the meaning of the arguments. If we had a function parse(string), one would be confused whether this string is the filename or the contents of the file.

Note that, as I said before, we also have an option to add a parse function that takes a FILE* keeping other functions intact. I still think it is a good way because we can use a common approach by default and can explicitly specify another approach. There are libraries that supports FILE* in this way. A (hopefully minor) problem with this approach is that one cannot free FILE* until parse ends.

So I think it is safer to use fstream if a filename is passed and add a new function to receive a FILE*. By doing so, we can leave it up to the user to open a FILE*. Since std::vector is used in any case, we can use parse(std::vector file_content, std::string file_name) after loading a file content in those functions.

@lukash
Copy link
Contributor Author

lukash commented Sep 7, 2022

The reason why we use std::vector<char> instead of std::string is that we don't need any of methods provided by std::string. We only need a container of chars, not a string. If we use std::string, people who read the code may assume that some string-specific method will be used. But actually we don't use any. It is misleading. Since it is an internal value, we don't need to consider about the intuitiveness of public api but the clarity of the implementation detail.

You are, however, effectively working with a string. Even if you don't use any std::string specific methods.

But when we add a public function that takes a file content to be parsed, as you are concerned, passing a std::vector<char> instead of std::string looks a bit strange. One might want to pass a string literal to the function if we have such a function or do some operation on the file content but it is impractical if the function takes a vector.

Not only impractical, most people use std::string (the very much universal class for working with strings in C++) and for such API they'd need to copy into a std::vector<char>, so it would be detrimental unless there'd be a very good reason to use the vector...

A problem with adding functions that takes a file content as a string is that it obscures the meaning of the arguments. If we had a function parse(string), one would be confused whether this string is the filename or the contents of the file.

Yes, the signature for that function would be the same, we'd need to call it differently, e.g. parse_text(const std::string & text).

Note that, as I said before, we also have an option to add a parse function that takes a FILE* keeping other functions intact. I still think it is a good way because we can use a common approach by default and can explicitly specify another approach. There are libraries that supports FILE* in this way. A (hopefully minor) problem with this approach is that one cannot free FILE* until parse ends.

You think there won't be any further issues on Windows besides the fopen() call? Then this is a viable alternative. You'd be left with two different implementations of file I/O in the codebase (fstream and FILE *). If you don't mind that, maybe this is slightly better from user's perspective...

For me, the two options (parse(File * f) and parse_text(std::string && text)) are close to 50:50 so let me know which one you'd prefer. For the parse_text() one I'd need to change the std::vector<char> to std::string in the internals. (hopefully I didn't miss anything)

@ToruNiina
Copy link
Owner

I prefer parse(FILE*).

The fstream classes are notorious for their non-existent error handling.

This adds a C-style fILE * IO (fopen(), etc.) alternative interface, so
that if a user needs reliable error handling, they can use that, albeit
more inconvenient, but more robust approach.
Set the exceptions mask so that exceptions are thrown when an I/O error
occurs. Also throw the same exception type when the opening fails.
@lukash
Copy link
Contributor Author

lukash commented Sep 9, 2022

It appears the CI needs approval. I've made the change to have the parse(FILE *) variant. I've added a test for it (didn't like the copy-paste, but wasn't sure what else to do), I think that should still fail on Windows as it should have the same fopen() issue.

I've also made small improvements to the ifstream error handling.

@lukash
Copy link
Contributor Author

lukash commented Sep 16, 2022

Curious the CI has passed on Windows, even with the fopen() call still there...

@ToruNiina what do you think about the new approach?

@lukash
Copy link
Contributor Author

lukash commented Sep 22, 2022

Another gentle ping. @ToruNiina please review if time at all allows, we kind of need reliable error handling...

@ToruNiina ToruNiina merged commit c3a3423 into ToruNiina:master Sep 23, 2022
@ToruNiina
Copy link
Owner

It looks good to me. I have merged it.


If you use char, fopen still works on windows (of course). As I mentioned in the first comment, the reason why I don't prefer to use fopen inside the parse function is that it increases maintenance cost when we use it with some relatively modern C++ features like std::filesystem::path. Windows uses wchar as its native character, so path::native returns wchar but fopen takes char. We need to use _wfopen in such a case.

Also, in some countries (including where I live), windows uses character encoding that was developed/specified in that country by default (though these days it seems to be able to change it to UTF-8 via settings). It makes the situation more complicated because character encoding of the filesystem (UTF-16) can differ from that of the source code file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants