Skip to content

[WIP] Monadical's first implementation of python bindings for libzim #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 44 commits into from

Conversation

Monadical-Inc
Copy link

@Monadical-Inc Monadical-Inc commented Apr 1, 2020

Progress

  • implement main ZimReader and ZimWriter interface via cython
  • (partial) finish python packaging setup (setup.py, sdist & bdist setup, tox?, setup README.md md parsing instead of rst)
  • finish README with more examples, usage, dev instructions, API reference docs, etc.
  • setup CI to build, test, and release automatically
  • debug ZimArticle content writes
  • ask about legacy zim format support
  • check libzim tests to reproduce and use their test files
  • make sure all tests are passing (creating, reading & writing)

Goals

  • Will become the canonical python-facing API for creating and interacting with ZIM files, used both internally by the scrapers and externally by users consuming ZIM files
  • Will start by exposing the core libzim API used by sotoki and mwoffliner (described below), with potential to expand to the rest of the libzim functionality
  • Can be partly property-based auto-generated Cython code, based on libzim's .h files
  • Should be threadsafe, usable with async/await, multiprocessing, or Threading
  • Should be built and tested w/ CI/CD and good documentation

API Overview

1. Article

  • ZimArticle constructor (that supports taking binary blobs of arbitrary MIME types)

2. ZimCreator

  • ZimCreator constructor (we've already got this working)
  • addArticle (in-progress, a little tricky to deal with passing article blob object)
  • setMetadata (pretty easy, can just take some key/value pairs)
  • finalise (super easy)

3. ZimFileReader

  • ZimReader constructor (we've already got this working)
  • destroy (can be a method in python land, or we can hook it up to __del__?)
  • getCountArticles (we've already got this working)
  • getArticleByUrl (super easy, not much work to implement)
  • getArticleById (same here)
  • suggest (easy to call, but returns an iterator that may need some work to translate)
  • search (easy to call, but returns an iterator that may need some work to translate)

Our work on this so far was inspired by Matthew G.'s existing code from here:


Resources

@Monadical-Inc Monadical-Inc changed the title Monadical's first implementation of python bindings for libzim [WIP] Monadical's first implementation of python bindings for libzim Apr 1, 2020
@kelson42
Copy link
Contributor

kelson42 commented Apr 2, 2020

I'm not sure to understand why with have Docker stuff in this PR. Could you please explain it?

@rgaudin
Copy link
Member

rgaudin commented Apr 2, 2020

Thank you very much!

I haven't tested it already but I will. In the mean time, just taking a quick look at it, I have a few questions:

  • we should probably have discussed that from day1 but are we all OK with pyzim ? @mgautierfr does that matches other bindings and concurrent projects?
  • Usage of finalize() seems optional? Is that so? Why?
  • I think we want only unicode literrals (so no u"some string") as this is py3 only.
  • why are there .pyc files in the repo?
  • Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?
  • I dislike the add_art() vs add_article() methods. Exposed API should be named clearly.
  • Maybe ZimCreator and ZimReader might benefit from some docstrings.
  • Naming still, this one is lower-level so it doesn't matter much but the following seems odd:
creator = pyzim.Creator(True)
creator.zim_creation('hola.zim',True,"eng",2048)

Again, this is all from a quick read of the PR ; will test properly. Thanks again 👍

@jdcaballerov
Copy link
Contributor

Thank you very much!

I haven't tested it already but I will. In the mean time, just taking a quick look at it, I have a few questions:

  • we should probably have discussed that from day1 but are we all OK with pyzim ? @mgautierfr does that matches other bindings and concurrent projects?
  • Usage of finalize() seems optional? Is that so? Why?
  • I think we want only unicode literrals (so no u"some string") as this is py3 only.
  • why are there .pyc files in the repo?
  • Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?
  • I dislike the add_art() vs add_article() methods. Exposed API should be named clearly.
  • Maybe ZimCreator and ZimReader might benefit from some docstrings.
  • Naming still, this one is lower-level so it doesn't matter much but the following seems odd:
creator = pyzim.Creator(True)
creator.zim_creation('hola.zim',True,"eng",2048)

Again, this is all from a quick read of the PR ; will test properly. Thanks again 👍

I'm not sure to understand why with have Docker stuff in this PR. Could you please explain it?

@kelson42 To compile the python extension libzim is needed, since some people might use Mac and libzim is hard to get to compile on a Mac I set up a docker container that can be used for development and testing.

@jdcaballerov
Copy link
Contributor

jdcaballerov commented Apr 2, 2020

Thank you very much!

I haven't tested it already but I will. In the mean time, just taking a quick look at it, I have a few questions:

@rgaudin

I advice to use the latest HEAD of the branch python3 where I cleaned the code a lot and already addressed some bugs and comments

  • we should probably have discussed that from day1 but are we all OK with pyzim ? @mgautierfr does that matches other bindings and concurrent projects?

Easily changed, up to you, just bear in mind there are some python naming restrictions for packages.

  • Usage of finalize() seems optional? Is that so? Why?

No, it's needed it might be an error at the example. finalise calls
creator->finishZimCreation();

  • I think we want only unicode literrals (so no u"some string") as this is py3 only.

Yes, initially I started developing in python 2.7 but all was moved to python3. It might be a mistake.

  • why are there .pyc files in the repo?

Mistake.

  • Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?

Just included for tests, perhaps the ones included with libzim are more appropriate.

  • I dislike the add_art() vs add_article() methods. Exposed API should be named clearly.

add_art was a test version I used while debugging. Will not be included.

  • Maybe ZimCreator and ZimReader might benefit from some docstrings.

documentation and docstrings are pending once you are ok with the API.

  • Naming still, this one is lower-level so it doesn't matter much but the following seems odd:
creator = pyzim.Creator(True)
creator.zim_creation('hola.zim',True,"eng",2048)

It's was changed, the example file included was outdated.

@mgautierfr
Copy link
Collaborator

I haven't made a review of your code, but just base on previous comments :

@kelson42 To compile the python extension libzim is needed, since some people might use Mac and libzim is hard to get to compile on a Mac I set up a docker container that can be used for development and testing.

To use the python extension you also need libzim. And you wouldn't set a docker container (if you can) to use a library. Why compile libzim is hard on Mac ? kiwix-build is here for that.

No, it's needed it might be an error at the example. finalise calls
creator->finishZimCreation();

I would prefer to keep the method finishZimCreation() alone. And memory management shouldn't be exposed to the user. The cpp instance must be deleted when the gc deletes the python object.

Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?

That is a real question that goes beyond this project. libzim also have some test zim and I somehow dislike this.
We could store them in a server, but then the test process would have to download it each time (or implement some kind of cache).
Anyway, those 25MB are already in the repository. Remove them will not reduce the repository size.

creator.zim_creation('hola.zim',True,"eng",2048)

I agree with @rgaudin comment and I would add that function with to many arguments is difficult to use. A better way would be to force the use of
keyword args : ...creation('hola.zim', lang="eng", clusterSize=2048, verbose=True)

@jdcaballerov
Copy link
Contributor

jdcaballerov commented Apr 9, 2020

FYI, when trying to build on macOS (I have libzim installed), I get an error

Need to check this. Whats the compiler on your Mac ? I guess is related to the virtual destructor or some missing virtual.

clang called as gcc (default macOS setup)

  • better __repr__ of objects like ZimReader (filename?), ZimArticle (title ? url?), ZimCreator (filename? lang?)

Issue #4

I'll add what you consider is better.

OK, please use the following for now:

# ZimArticle
def __repr__(self):
        return f"{self.__class__.__name__}(url={self.longurl}, title=)"

# ZimReader & ZimCreator (requires the filename getter)
def __repr__(self):
        return f"{self.__class__.__name__}(filename={self.filename})"

Done ✅

  • shouldn't we get a generator on reader.search() ? Do the libzim returns a list as-is? It's frequent to have thousands of results on large ZIM files.

I did the same as node-libzim getting 10 results and passing back a string vector. File.h returns Search class unique pointers.

 std::unique_ptr<Search> search(const std::string& query, int start, int end) const;
 std::unique_ptr<Search> suggestions(const std::string& query, int start, int end) const;

Ah ok I see. that's unfortunate. I didn't realized you hardcoded that. We should definitely be able to set those. But we'd need the number of results (from get_matches_estimated()) as well. How do you plan to address that?

@rgaudin I'll first need to understand what you want to do with Search.

I might get a Search pointer from File.h search function, once in python search function I could use the iterator and yield. But please let me know What's the use case of get_matches_estimated() ?

Issue #5

  • already mentioned (upstream bug?) but

I can restrict (or raise an exception) getting those in Cython code if it's not a redirect if you agree.

@mgautierfr should answer that. I'd prefer this situation not to happen but ideally that should be fixed upstream.

I'll recheck tomorrow how redirects are used.

  • probably upstream limitation but requiring main_page on ZimCreator creation seems weird. You allow to set it to blank but can't change it afterwards?

I used the same construct as node-libzim that fixes it from creation.
ZimCreator must implement a function getMainPage so that when finishZimCreation() (C++) is called the header is filled with it fillHeader().
With the current design I can overload the wrapper function and add a setter if needed.

I think it makes sense, yes.

Issue #6

  • you chose to Pascalize meta key names. The documentation doesn't require it. It just happens to be the case (ahah) for currently defined ones but as stated there, it's extendable. @kelson42 ?
  • related, it seems that get_metadata only returns those predefined keys while ZimCreator allows setting any. Confusing.

Yes it's correct, maybe we can put there only the mandatory metadata. Any metadata can be accessed getting an article by url

Yes, let's harmonize to only manage official ones this way and have the article mechanism for the rest ; if @kelson42 and @mgautierfr agrees.

Issue #7

  • crashes when requesting by id on an invalid Id

Upstream. I'll try to capture the exception in Cython.

Yeah we definitely not want the runtime to blow-up on mistakes like this one. Other similar crash scenario, creating a ZimCreator for which the libzim temp folder exists (xxx.tmp) ; or a ZimReader on a non-existing file…

 Article File::getArticle(article_index_type idx) const
  {
    if (idx >= article_index_type(impl->getCountArticles()))
      throw ZimFileFormatError("article index out of range");
    return Article(impl, idx);
  }

That would work but I wonder if there would be a way to catch the actual cpp exception and convert it to a python one. That would be safer than overcoming all the possible incidents from the wrapper.

On that z.tmp folder, it is created on ZimCreator("z.zim") along with z.tmp.idx.tmp and z.tmp_title.idx.tmp but if the creator is not finalized, it stays on disk. Should we remove it on garbage collection?

Done ✅ All exceptions will be catched and translated to Cython
Screen Shot 2020-04-09 at 16 16 45

  • I see finalise() which is british english. I believe we use US English (finalize()). How's it in the libzim? @kelson42 ?

libzim uses finishZimCreation(), finalise is used in node-libzim.

Ah! well that explains it as @ISNIT0 is British. Let's switch it to finalize while it's a minor change.

Done ✅

  • we have no access to params passed to ZimCreator() : main_page, index_language, etc.

I can add properties if you agree on.

Yes, please.

Done ✅

  • there's an unsafe use of memory. if you use the same variable for different articles, the articles count of the zim there are added to gets messed-up
    Now add both to kiwix-serve and you'll see that z1 has 1 article while z2 has 2 articles. If we were to reuse that same article many times, we'd the the counter increase by that number.

I don't understand completely what's kiwix-serve? can you give me the whole example ?

kiwix-serve is a web server serving ZIM files that uses the kiwix-lib which itself used the libzim. On its homepage, there's a counter showing the number of html articles. See reader.cpp

/* Get the count of articles which can be indexed/displayed */
unsigned int Reader::getArticleCount() const
{
  std::map<const std::string, unsigned int> counterMap
      = this->parseCounterMetadata();
  unsigned int counter = 0;

  if (counterMap.empty()) {
    counter = this->nsACount;
  } else {
    auto it = counterMap.find("text/html");
    if (it != counterMap.end()) {
      counter = it->second;
    }
  }

  return counter;
}

Screen Shot 2020-04-09 at 16 02 49

You can find binary here or use docker image kiwix/kiwix-serve:latest. Used kiwix-serve -p 9999 ~/src/python-libzim/z*.zim

I think this is related to how I am constructing the string, perhaps missing a semicolon or =
issue #8

  • last but not least, ZimArticle only accepts content as str. How am I suppose to add binary data to the ZIM ? Are we not there yet or is that a mistake ?

For the first version I was using auto conversion from Cython. Can be changed to accept bytes.

Well we definitely need to be able to set bytes. If we could also set/get text without having to manually encode/decode, that would be great.

Done ✅

The constructor will not type content. Leave bytes untouched if content is str content is encoded to utf-8:

    def __cinit__(self, url="", content="", namespace= "A", mimetype= "text/html", title="", redirect_article_url= "",filename="", should_index=True ):

It will encode if it's a python str or pass directly if bytes are inputed:

bytes_content =b''
        if isinstance(content, str):
            bytes_content = content.encode('UTF-8')
        else:
            bytes_content = content

The getter and setter are also adjusted so that bytes pass transparently if some Unicode Error arises (will need to check this assumption)

@property
    def content(self):
        """Get the article's content"""
        data = self.c_zim_article.content 
        try:
            return data.decode('UTF-8')
        except UnicodeDecodeError:
            return data

    @content.setter
    def content(self, new_content):
        """Set the article's content"""
        if isinstance(new_content,str):
            self.c_zim_article.content = new_content.encode('UTF-8') 
        else:
            self.c_zim_article.content = new_content 

Example use case

Screen Shot 2020-04-09 at 16 35 51

@kelson42
Copy link
Contributor

* we should probably have discussed that from day1 but are we all OK with `pyzim` ? @mgautierfr does that matches other bindings and concurrent projects?

To avoid any confusion, we should call it from the name of the repo I have created.

@kelson42
Copy link
Contributor

kelson42 commented Apr 12, 2020

Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?

This is a bit heavy indeed. Why so big? A small ZIM file should be enough to run all the unit tests.

@rgaudin
Copy link
Member

rgaudin commented Apr 12, 2020

* we should probably have discussed that from day1 but are we all OK with `pyzim` ? @mgautierfr does that matches other bindings and concurrent projects?

To avoid any confusion, we should call it from the name of the repo I have created.

this is not a valid python module name. we could have:

  • python_libzim not recommended to use python in lib name and it’s long.
  • pylibzim sounds a lot better and matches the repo somewhat.
  • libzim seems obvious and definitely matches the repo (minus the platform). has it been used already ?

I like pyzim but there are already 2 other projects with that name I think.

@kelson42
Copy link
Contributor

kelson42 commented Apr 12, 2020

@rgaudin in pypi, it should be libzim

@rgaudin
Copy link
Member

rgaudin commented Apr 12, 2020

@rgaudin in pypi, it should be libzim

then that’s settled. thanks

@pirate
Copy link
Contributor

pirate commented Apr 12, 2020

Ok, the biggest remaining things to do are:

@jdcaballerov
Copy link
Contributor

No need for a deepcopy here. If self._metadata_keys is a dict as you define it self._metadata = dict(self._metadata_keys) is enough.
If it is a list of keys (as it should be) : self._metadata = {k:None for k in METADATA_KEYS}

Agree.

@pirate
Copy link
Contributor

pirate commented Apr 17, 2020

Since all efforts are now focused on the blend approach in #3 , we're closing this PR for now.

@pirate pirate closed this Apr 17, 2020
@mgautierfr
Copy link
Collaborator

Not everything is to throw here.
The reading part (except the article) is pretty good. You can keep it and just rewrite the read article wrapper is a simpler form.

@jdcaballerov
Copy link
Contributor

jdcaballerov commented Apr 20, 2020

Not everything is to throw here.
The reading part (except the article) is pretty good. You can keep it and just rewrite the read article wrapper is a simpler form.

That's the plan. To reuse most of the reader code perhaps naming as ZimFileReader, ZimFileArticle to avoid confusing with article definitions, but without that particular article implementation. Just simple wrappers for File.h and Article.h (zim::)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants