[WIP] Monadical's first implementation of python bindings for libzim #2

Monadical-Inc · 2020-04-01T22:59:19Z

Progress

implement main ZimReader and ZimWriter interface via cython
(partial) finish python packaging setup (setup.py, sdist & bdist setup, tox?, setup README.md md parsing instead of rst)
finish README with more examples, usage, dev instructions, API reference docs, etc.
setup CI to build, test, and release automatically

debug ZimArticle content writes
ask about legacy zim format support
check libzim tests to reproduce and use their test files
make sure all tests are passing (creating, reading & writing)

Goals

Will become the canonical python-facing API for creating and interacting with ZIM files, used both internally by the scrapers and externally by users consuming ZIM files
Will start by exposing the core libzim API used by sotoki and mwoffliner (described below), with potential to expand to the rest of the libzim functionality
Can be partly property-based auto-generated Cython code, based on libzim's .h files
Should be threadsafe, usable with async/await, multiprocessing, or Threading
Should be built and tested w/ CI/CD and good documentation

API Overview

1. Article

ZimArticle constructor (that supports taking binary blobs of arbitrary MIME types)

2. ZimCreator

ZimCreator constructor (we've already got this working)
addArticle (in-progress, a little tricky to deal with passing article blob object)
setMetadata (pretty easy, can just take some key/value pairs)
finalise (super easy)

3. ZimFileReader

ZimReader constructor (we've already got this working)
destroy (can be a method in python land, or we can hook it up to __del__?)
getCountArticles (we've already got this working)
getArticleByUrl (super easy, not much work to implement)
getArticleById (same here)
suggest (easy to call, but returns an iterator that may need some work to translate)
search (easy to call, but returns an iterator that may need some work to translate)

Our work on this so far was inspired by Matthew G.'s existing code from here:

Resources

kelson42 · 2020-04-02T06:09:08Z

I'm not sure to understand why with have Docker stuff in this PR. Could you please explain it?

rgaudin · 2020-04-02T10:05:25Z

Thank you very much!

I haven't tested it already but I will. In the mean time, just taking a quick look at it, I have a few questions:

we should probably have discussed that from day1 but are we all OK with pyzim ? @mgautierfr does that matches other bindings and concurrent projects?
Usage of finalize() seems optional? Is that so? Why?
I think we want only unicode literrals (so no u"some string") as this is py3 only.
why are there .pyc files in the repo?
Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?
I dislike the add_art() vs add_article() methods. Exposed API should be named clearly.
Maybe ZimCreator and ZimReader might benefit from some docstrings.
Naming still, this one is lower-level so it doesn't matter much but the following seems odd:

creator = pyzim.Creator(True)
creator.zim_creation('hola.zim',True,"eng",2048)

Again, this is all from a quick read of the PR ; will test properly. Thanks again 👍

jdcaballerov · 2020-04-02T19:50:01Z

Thank you very much!

I haven't tested it already but I will. In the mean time, just taking a quick look at it, I have a few questions:

we should probably have discussed that from day1 but are we all OK with pyzim ? @mgautierfr does that matches other bindings and concurrent projects?

Usage of finalize() seems optional? Is that so? Why?

I think we want only unicode literrals (so no u"some string") as this is py3 only.

why are there .pyc files in the repo?

Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?

I dislike the add_art() vs add_article() methods. Exposed API should be named clearly.

Maybe ZimCreator and ZimReader might benefit from some docstrings.

Naming still, this one is lower-level so it doesn't matter much but the following seems odd:
creator = pyzim.Creator(True)
creator.zim_creation('hola.zim',True,"eng",2048)
Again, this is all from a quick read of the PR ; will test properly. Thanks again 👍

I'm not sure to understand why with have Docker stuff in this PR. Could you please explain it?

@kelson42 To compile the python extension libzim is needed, since some people might use Mac and libzim is hard to get to compile on a Mac I set up a docker container that can be used for development and testing.

jdcaballerov · 2020-04-02T20:03:28Z

Thank you very much!

I haven't tested it already but I will. In the mean time, just taking a quick look at it, I have a few questions:

@rgaudin

I advice to use the latest HEAD of the branch python3 where I cleaned the code a lot and already addressed some bugs and comments

we should probably have discussed that from day1 but are we all OK with pyzim ? @mgautierfr does that matches other bindings and concurrent projects?

Easily changed, up to you, just bear in mind there are some python naming restrictions for packages.

Usage of finalize() seems optional? Is that so? Why?

No, it's needed it might be an error at the example. finalise calls
creator->finishZimCreation();

I think we want only unicode literrals (so no u"some string") as this is py3 only.

Yes, initially I started developing in python 2.7 but all was moved to python3. It might be a mistake.

why are there .pyc files in the repo?

Mistake.

Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?

Just included for tests, perhaps the ones included with libzim are more appropriate.

I dislike the add_art() vs add_article() methods. Exposed API should be named clearly.

add_art was a test version I used while debugging. Will not be included.

Maybe ZimCreator and ZimReader might benefit from some docstrings.

documentation and docstrings are pending once you are ok with the API.

Naming still, this one is lower-level so it doesn't matter much but the following seems odd:
creator = pyzim.Creator(True)
creator.zim_creation('hola.zim',True,"eng",2048)

It's was changed, the example file included was outdated.

mgautierfr · 2020-04-06T17:20:51Z

I haven't made a review of your code, but just base on previous comments :

@kelson42 To compile the python extension libzim is needed, since some people might use Mac and libzim is hard to get to compile on a Mac I set up a docker container that can be used for development and testing.

To use the python extension you also need libzim. And you wouldn't set a docker container (if you can) to use a library. Why compile libzim is hard on Mac ? kiwix-build is here for that.

No, it's needed it might be an error at the example. finalise calls
creator->finishZimCreation();

I would prefer to keep the method finishZimCreation() alone. And memory management shouldn't be exposed to the user. The cpp instance must be deleted when the gc deletes the python object.

Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?

That is a real question that goes beyond this project. libzim also have some test zim and I somehow dislike this.
We could store them in a server, but then the test process would have to download it each time (or implement some kind of cache).
Anyway, those 25MB are already in the repository. Remove them will not reduce the repository size.

creator.zim_creation('hola.zim',True,"eng",2048)

I agree with @rgaudin comment and I would add that function with to many arguments is difficult to use. A better way would be to force the use of
keyword args : ...creation('hola.zim', lang="eng", clusterSize=2048, verbose=True)

This reverts commit 38d5c7a.

jdcaballerov · 2020-04-09T21:29:45Z

FYI, when trying to build on macOS (I have libzim installed), I get an error

Need to check this. Whats the compiler on your Mac ? I guess is related to the virtual destructor or some missing virtual.

clang called as gcc (default macOS setup)

better __repr__ of objects like ZimReader (filename?), ZimArticle (title ? url?), ZimCreator (filename? lang?)

Issue #4

I'll add what you consider is better.

OK, please use the following for now:

# ZimArticle
def __repr__(self):
        return f"{self.__class__.__name__}(url={self.longurl}, title=)"

# ZimReader & ZimCreator (requires the filename getter)
def __repr__(self):
        return f"{self.__class__.__name__}(filename={self.filename})"

Done ✅

shouldn't we get a generator on reader.search() ? Do the libzim returns a list as-is? It's frequent to have thousands of results on large ZIM files.

I did the same as node-libzim getting 10 results and passing back a string vector. File.h returns Search class unique pointers.
 std::unique_ptr<Search> search(const std::string& query, int start, int end) const;
 std::unique_ptr<Search> suggestions(const std::string& query, int start, int end) const;
Ah ok I see. that's unfortunate. I didn't realized you hardcoded that. We should definitely be able to set those. But we'd need the number of results (from get_matches_estimated()) as well. How do you plan to address that?

@rgaudin I'll first need to understand what you want to do with Search.

I might get a Search pointer from File.h search function, once in python search function I could use the iterator and yield. But please let me know What's the use case of get_matches_estimated() ?

Issue #5

already mentioned (upstream bug?) but

I can restrict (or raise an exception) getting those in Cython code if it's not a redirect if you agree.

@mgautierfr should answer that. I'd prefer this situation not to happen but ideally that should be fixed upstream.

I'll recheck tomorrow how redirects are used.

probably upstream limitation but requiring main_page on ZimCreator creation seems weird. You allow to set it to blank but can't change it afterwards?

I used the same construct as node-libzim that fixes it from creation.
ZimCreator must implement a function getMainPage so that when finishZimCreation() (C++) is called the header is filled with it fillHeader().
With the current design I can overload the wrapper function and add a setter if needed.

I think it makes sense, yes.

Issue #6

you chose to Pascalize meta key names. The documentation doesn't require it. It just happens to be the case (ahah) for currently defined ones but as stated there, it's extendable. @kelson42 ?

related, it seems that get_metadata only returns those predefined keys while ZimCreator allows setting any. Confusing.

Yes it's correct, maybe we can put there only the mandatory metadata. Any metadata can be accessed getting an article by url

Yes, let's harmonize to only manage official ones this way and have the article mechanism for the rest ; if @kelson42 and @mgautierfr agrees.

Issue #7

crashes when requesting by id on an invalid Id

Upstream. I'll try to capture the exception in Cython.

Yeah we definitely not want the runtime to blow-up on mistakes like this one. Other similar crash scenario, creating a ZimCreator for which the libzim temp folder exists (xxx.tmp) ; or a ZimReader on a non-existing file…
 Article File::getArticle(article_index_type idx) const
  {
    if (idx >= article_index_type(impl->getCountArticles()))
      throw ZimFileFormatError("article index out of range");
    return Article(impl, idx);
  }
That would work but I wonder if there would be a way to catch the actual cpp exception and convert it to a python one. That would be safer than overcoming all the possible incidents from the wrapper.

On that z.tmp folder, it is created on ZimCreator("z.zim") along with z.tmp.idx.tmp and z.tmp_title.idx.tmp but if the creator is not finalized, it stays on disk. Should we remove it on garbage collection?

Done ✅ All exceptions will be catched and translated to Cython

I see finalise() which is british english. I believe we use US English (finalize()). How's it in the libzim? @kelson42 ?

libzim uses finishZimCreation(), finalise is used in node-libzim.

Ah! well that explains it as @ISNIT0 is British. Let's switch it to finalize while it's a minor change.

Done ✅

we have no access to params passed to ZimCreator() : main_page, index_language, etc.

I can add properties if you agree on.

Yes, please.

Done ✅

there's an unsafe use of memory. if you use the same variable for different articles, the articles count of the zim there are added to gets messed-up
Now add both to kiwix-serve and you'll see that z1 has 1 article while z2 has 2 articles. If we were to reuse that same article many times, we'd the the counter increase by that number.

I don't understand completely what's kiwix-serve? can you give me the whole example ?

kiwix-serve is a web server serving ZIM files that uses the kiwix-lib which itself used the libzim. On its homepage, there's a counter showing the number of html articles. See reader.cpp
/* Get the count of articles which can be indexed/displayed */
unsigned int Reader::getArticleCount() const
{
  std::map<const std::string, unsigned int> counterMap
      = this->parseCounterMetadata();
  unsigned int counter = 0;

  if (counterMap.empty()) {
    counter = this->nsACount;
  } else {
    auto it = counterMap.find("text/html");
    if (it != counterMap.end()) {
      counter = it->second;
    }
  }

  return counter;
}
You can find binary here or use docker image kiwix/kiwix-serve:latest. Used kiwix-serve -p 9999 ~/src/python-libzim/z*.zim

I think this is related to how I am constructing the string, perhaps missing a semicolon or =
issue #8

last but not least, ZimArticle only accepts content as str. How am I suppose to add binary data to the ZIM ? Are we not there yet or is that a mistake ?

For the first version I was using auto conversion from Cython. Can be changed to accept bytes.

Well we definitely need to be able to set bytes. If we could also set/get text without having to manually encode/decode, that would be great.

Done ✅

The constructor will not type content. Leave bytes untouched if content is str content is encoded to utf-8:

    def __cinit__(self, url="", content="", namespace= "A", mimetype= "text/html", title="", redirect_article_url= "",filename="", should_index=True ):

It will encode if it's a python str or pass directly if bytes are inputed:

bytes_content =b''
        if isinstance(content, str):
            bytes_content = content.encode('UTF-8')
        else:
            bytes_content = content

The getter and setter are also adjusted so that bytes pass transparently if some Unicode Error arises (will need to check this assumption)

@property
    def content(self):
        """Get the article's content"""
        data = self.c_zim_article.content 
        try:
            return data.decode('UTF-8')
        except UnicodeDecodeError:
            return data

    @content.setter
    def content(self, new_content):
        """Set the article's content"""
        if isinstance(new_content,str):
            self.c_zim_article.content = new_content.encode('UTF-8') 
        else:
            self.c_zim_article.content = new_content

Example use case

setup.py

…embers

…adata

kelson42 · 2020-04-12T16:05:04Z

* we should probably have discussed that from day1 but are we all OK with `pyzim` ? @mgautierfr does that matches other bindings and concurrent projects?

To avoid any confusion, we should call it from the name of the repo I have created.

kelson42 · 2020-04-12T16:05:44Z

Is that wise to have ZIM files in the repo for tests? It's 25MB already. @kelson42 ?

This is a bit heavy indeed. Why so big? A small ZIM file should be enough to run all the unit tests.

rgaudin · 2020-04-12T16:17:23Z

* we should probably have discussed that from day1 but are we all OK with `pyzim` ? @mgautierfr does that matches other bindings and concurrent projects?
To avoid any confusion, we should call it from the name of the repo I have created.

this is not a valid python module name. we could have:

python_libzim not recommended to use python in lib name and it’s long.
pylibzim sounds a lot better and matches the repo somewhat.
libzim seems obvious and definitely matches the repo (minus the platform). has it been used already ?

I like pyzim but there are already 2 other projects with that name I think.

kelson42 · 2020-04-12T16:19:29Z

@rgaudin in pypi, it should be libzim

rgaudin · 2020-04-12T16:21:35Z

@rgaudin in pypi, it should be libzim

then that’s settled. thanks

pirate · 2020-04-12T18:44:58Z

Ok, the biggest remaining things to do are:

finalize package naming and make it consistent everywhere
fix last remaining tickets Get the number of results (from `get_matches_estimated() #5
fix last remaining ticket Define Metadata #7 and
merge PRs
setup CI/CD for black, tests, build, and deploy to pypi Setup CI/CD with black, tests, and cython build + autorelease to py #9
start work to hook up sotoki in parallel: Rewrite using cutting-edge python-libzim sotoki#130

jdcaballerov · 2020-04-14T15:04:07Z

No need for a deepcopy here. If self._metadata_keys is a dict as you define it self._metadata = dict(self._metadata_keys) is enough.
If it is a list of keys (as it should be) : self._metadata = {k:None for k in METADATA_KEYS}

Agree.

pirate · 2020-04-17T20:31:46Z

Since all efforts are now focused on the blend approach in #3 , we're closing this PR for now.

mgautierfr · 2020-04-20T11:50:20Z

Not everything is to throw here.
The reading part (except the article) is pretty good. You can keep it and just rewrite the read article wrapper is a simpler form.

jdcaballerov · 2020-04-20T13:15:48Z

Not everything is to throw here.
The reading part (except the article) is pretty good. You can keep it and just rewrite the read article wrapper is a simpler form.

That's the plan. To reuse most of the reader code perhaps naming as ZimFileReader, ZimFileArticle to avoid confusing with article definitions, but without that particular article implementation. Just simple wrappers for File.h and Article.h (zim::)

Monadical-Inc changed the title ~~Monadical's first implementation of python bindings for libzim~~ [WIP] Monadical's first implementation of python bindings for libzim Apr 1, 2020

pirate assigned mgautierfr, pirate, jdcaballerov and kelson42 Apr 1, 2020

pirate requested review from pirate, jdcaballerov, kelson42 and mgautierfr April 1, 2020 23:05

pirate unassigned mgautierfr and kelson42 Apr 1, 2020

jdcaballerov added 11 commits April 6, 2020 15:02

Initial version python 2.7

f67e296

Change Cython language version

537759a

Python 3 corrections

20c32c8

Revert "Change Cython language version"

8353a40

This reverts commit 38d5c7a.

Rename README.rst -> README.md

78480be

Change Readme.rst -> Readme.md in setup.py

1b054f0

Refactor to use Cython properties

9f74ba7

Delete test.py

7a7d442

Update all the examples cookbook

fe7452c

Add get_metadata dict to ZimReader

57d1665

Delete tests files at start and ending

60faf37

pirate force-pushed the dev branch from 0c3f015 to 60faf37 Compare April 6, 2020 19:02

pirate and others added 2 commits April 6, 2020 15:07

fix README formatting

d835898

Delete add_art from example

73a7a64

Add property getters for ZimCreator parameters

00e8eb8

pirate approved these changes Apr 10, 2020

View reviewed changes

setup.py Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

jdcaballerov added 6 commits April 10, 2020 09:08

Add feature mainPage setter

9e45d5c

Add independent metadata dict for each Creator

b4efd90

Add virtual default destructor to ZimArticle and missing virtual to m…

51e348c

…embers

Mark get_metadata member function as internal for ZimCreator

5128f91

Change ZimReader get_metadata() name and to return only mandatory met…

9cb954b

…adata

Add c++ 11

6903615

pirate added 2 commits April 12, 2020 12:23

Update setup.py

be3df1d

Update setup.py

49389e8

jdcaballerov added 5 commits April 13, 2020 11:15

WIP new API search and suggestions

40c0035

Merge branch 'dev' of github.com:openzim/python-libzim into dev

aaffabf

Add search, suggestions iterators and counts

054e763

Change pyzim for libzim everywhere

e477e12

Change set_metada for update_metadata for Creator, update docstrings

45f168f

Add except to try

279186b

pirate closed this Apr 17, 2020

jdcaballerov mentioned this pull request Apr 28, 2020

Writing part must be in a submodule writer #17

Closed

kelson42 deleted the dev branch July 6, 2020 08:14

Uh oh!

[WIP] Monadical's first implementation of python bindings for libzim #2

[WIP] Monadical's first implementation of python bindings for libzim #2

Uh oh!

Conversation

Monadical-Inc commented Apr 1, 2020 • edited by pirate Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Goals

API Overview

1. Article

2. ZimCreator

3. ZimFileReader

Resources

Uh oh!

kelson42 commented Apr 2, 2020

Uh oh!

rgaudin commented Apr 2, 2020

Uh oh!

jdcaballerov commented Apr 2, 2020

Uh oh!

jdcaballerov commented Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgautierfr commented Apr 6, 2020

Uh oh!

jdcaballerov commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kelson42 commented Apr 12, 2020

Uh oh!

kelson42 commented Apr 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgaudin commented Apr 12, 2020

Uh oh!

kelson42 commented Apr 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgaudin commented Apr 12, 2020

Uh oh!

pirate commented Apr 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdcaballerov commented Apr 14, 2020

Uh oh!

pirate commented Apr 17, 2020

Uh oh!

mgautierfr commented Apr 20, 2020

Uh oh!

jdcaballerov commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Monadical-Inc commented Apr 1, 2020 •

edited by pirate

Loading

jdcaballerov commented Apr 2, 2020 •

edited

Loading

jdcaballerov commented Apr 9, 2020 •

edited

Loading

kelson42 commented Apr 12, 2020 •

edited

Loading

kelson42 commented Apr 12, 2020 •

edited

Loading

pirate commented Apr 12, 2020 •

edited

Loading

jdcaballerov commented Apr 20, 2020 •

edited

Loading