Add Page Inputs #2

victor-torres · 2020-08-03T19:46:40Z

No description provided.

codecov-commenter · 2020-08-03T19:47:40Z

Codecov Report

❗ No coverage uploaded for pull request base (master@2915e0a). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master       #2   +/-   ##
=========================================
  Coverage          ?   98.07%           
=========================================
  Files             ?        2           
  Lines             ?      104           
  Branches          ?        0           
=========================================
  Hits              ?      102           
  Misses            ?        2           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2915e0a...2b3bf60. Read the comment docs.

ejulio

LGTM 👏
Just a minor thing with typing :)

ejulio · 2020-08-04T12:52:08Z

autoextract_poet/page_inputs.py

+
+    data: dict
+
+    def to_items(self) -> Optional[List[Item]]:


It seems from_list always return a dict, so it shouldn't be optional.
Or is there some other use case?

Also, in the other PR we had a conversation on the return type.
So, I guess it follows here, right?
Since this is the base class, it should've no return type as you can't infer from the base classes.
Even though, a product is an item, in the other it was opted out for the many complications it could cause.

Even though, a product is an item, in the other it was opted out for the many complications it could cause.

This. I don't think we'll need to override this method.

autoextract_poet/page_inputs.py

kmike · 2020-08-06T12:20:52Z

autoextract_poet/page_inputs.py

+    data: dict
+
+    def to_items(self) -> Optional[List[Item]]:
+        return self.item_class.from_list(


Batch request may contain data of different data types, e.g. some records can be articles, and some can be products. What do you thiink about making a single record an input, not a list? I.e. don't put raw AutoExtract response to self.data, but require a single entry from the list.

Agree. Here it should be just a single entry, no list.

Method was renamed to singular (to_item) and we're using from_dict instead of from_list now as suggested.

autoextract_poet/page_inputs.py

ivanprado · 2020-08-06T12:32:39Z

autoextract_poet/page_inputs.py

+
+
+@attr.s(auto_attribs=True)
+class ResponseData:


I think generics can be leveraged here to have better type checking. Example:

from typing import TypeVar, Generic, ClassVar, Type class Item: pass class Product(Item): pass T = TypeVar('T', bound=Item) class ResponseData(Generic[T]): item_class: ClassVar[Type[T]] def to_items(self) -> T: return self.item_class() class ProductResponseData(ResponseData[Product]): item_class = Product print(type(ProductResponseData().to_items()))

@victor-torres what about this proposal?

I might have missed something here but in my tests, it didn't present any advantage over a simple type annotation such as Type["Item"].

PyCharm was still identifying return type as Item instead of Product (an Item subclass) when the to_items method was being called from an instance of a Product class.

What happens in your test is that you're dynamically checking the type while mypy, PyCharm, and other solutions perform a static type check. It's not the same thing. I think @ejulio's example followed the same approach of yours.

mypy is able to perform better checking. mypy fails with:

error: "Item" has no attribute "product_method"

on this code:

class Item: pass class Product(Item): def product_method(self): pass class ResponseData(): item_class: ClassVar[Type[Item]] def to_items(self) -> Item: return self.item_class() class ProductResponseData(ResponseData): item_class = Product p = ProductResponseData() item = p.to_items() item.product_method()

But don't fail with this code:

class Item: pass class Product(Item): def product_method(self): pass T = TypeVar('T', bound=Item) class ResponseData(Generic[T]): item_class: ClassVar[Type[T]] def to_items(self) -> T: return self.item_class() class ProductResponseData(ResponseData[Product]): item_class = Product p = ProductResponseData() item = p.to_items() item.product_method()

Also, PyCharm is able to suggest me product_method for item in the later but not in the first case.

I got the idea from https://mypy.readthedocs.io/en/stable/generics.html?highlight=generics

Also, PyCharm is able to suggest me product_method for item in the later but not in the first case.

It will work because we're talking about a class attribute specific to the Product class.

The problem I'm trying to describe here is that AFAIK there's no way to subclass the base class without having to override those class methods to annotate them with the proper return type.

We're also probably mixing two different contexts here: Page Inputs and Items.

https://stackoverflow.com/questions/39205527/can-you-annotate-return-type-when-value-is-instance-of-cls/39205612#39205612

How to add hint to factory method? python/typing#58

SelfType or another way to spell "type of self" (or, How to define a copy() function) python/mypy#1212

Type of the same class inside the class python/mypy#3661

In your example above, PyCharm still shows the return type of ProductResponseData.to_item() as being T and not Product as we would expect.

@victor-torres PyCharm correctly infers that item is a Product in @ivanprado's item = p.to_items() example, even if it shows that to_items() returns T. In the first @ivanprado's example PyCharm shows a warning at item.product_method(), but not in the second example. I guess it is the same with mypy. Could you please check it again?

This is kind-of important, because without proper typing support we loose a many of advantages of having these Item classes.

@victor-torres PyCharm correctly infers that item is a Product in @ivanprado's item = p.to_items() example, even if it shows that to_items() returns T.

That's true. I was looking at the return type of the to_item function only.

I've just implemented @ivanprado's suggestion.

This will avoid confusion with web-poet's ResponseData class and make it clearer to the users that it's an internal implementation detail and shouldn't be used directly when implementing Page Objects.

- Remove "Response" from the class name since it means a very different thing in web-poet (url + html). - Include AutoExtract in the class name to avoid ambiguities when importing page inputs from multiple places (it could also improve logging).

ivanprado

Thanks @victor-torres . I went through the PR and left some comments. Also wonder what should we do when html is present in data, but this is a difficult question so let's discuss it separately.

ivanprado · 2020-08-06T13:29:04Z

autoextract_poet/page_inputs.py

+    data: dict
+
+    def to_items(self) -> Optional[List[Item]]:
+        return self.item_class.from_list(


Agree. Here it should be just a single entry, no list.

autoextract_poet/page_inputs.py

tests/__init__.py

We've just removed the "Response" term from the Article and Product subclasses, it makes no sense to keep the term on the main class as well.

AutoExtract API returns a JSON Array with responses but it may contain responses with different page types. Therefore, we need to receive individual responses that should have been previously selected for our Item subclass.

kmike · 2020-08-06T16:09:15Z

I think it should be return self.item_class.from_dict(self.data[self.item_key]) as we still want to retain other information which can be useful in data, like html or the future webPage attribute.

I'm not 100% sure about that. If we speak about scrapy-poet, it'd be great to extend providers so that they know how to provide html or WebPage data without making an additional request. So here is more of a question of whether we need an escape hatch for this information in this particular place, by making the input data slightly less convenient. Not against it, but it can be solved in other ways.

ivanprado · 2020-08-06T16:29:39Z

it'd be great to extend providers so that they know how to provide html or WebPage data without making an additional request.

@kmike 100% agree with that. We should then prioritize planning and designing for that because is very important and will condition the design of providers in scrapy-poet, and is a requirement to have a functional system. Otherwise, we would need the escape hatch for the time being.

We'll probably need to access other metadata such as the response HTML in the future so it's better to save it on the data attribute.

victor-torres · 2020-08-06T21:24:50Z

@ivanprado, regarding #2 (comment), I've just updated the source code to store the whole response data and not just the unified schema data.

victor-torres · 2020-08-06T21:25:09Z

@ivanprado @kmike could you guys please check if there's anything else pending on this pull request?

ivanprado · 2020-08-07T08:34:17Z

@victor-torres from my side two considerations:

I think you missed this comment Add Page Inputs #2 (comment)
The discussion @kmike opened here Add Page Inputs #2 (comment) is really important IMO. Depending on how we will solve the question of multiple page-inputs from the same requests, we should store only the data relevant to the page-input, or the whole response data. We should think a little bit about that question. But probably meanwhile we can go with the approach in this PR to not to block it, having in mind that maybe the solution is temporal.

victor-torres · 2020-08-07T14:47:34Z

@ivanprado

I think you missed this comment #2 (comment)

Answered on the comment.

The discussion @kmike opened here #2 (comment) is really important IMO. Depending on how we will solve the question of multiple page-inputs from the same requests, we should store only the data relevant to the page-input, or the whole response data. We should think a little bit about that question. But probably meanwhile we can go with the approach in this PR to not to block it, having in mind that maybe the solution is temporal.

We need to discuss it further. Probably could be addressed in other PR. As we were talking with @ejulio, we might be trying to cover a use case that doesn't exist yet.

victor-torres · 2020-08-13T20:51:47Z

@kmike @ivanprado Is this pull request ready to be merged?

I think we can improve the validation logic, probably using something different than attrs like Pydantic, but this should be done in another pull request since this one was meant to be very basic.

kmike · 2020-08-13T21:33:42Z

autoextract_poet/page_inputs.py

+    """
+
+    item_class: ClassVar[Type[Item]]
+    item_key: ClassVar[str]


I'm still unsure about this item_key. The question is if data should be {"article": {...}} dict or just a dict with article data ({...}, i.e. provider calls data["article"] itself before creating AutoExtractArticleData instance).

As we discussed on a call, escape hatch (e.g. to get html) may be unnecessary, as it can be solved in providers.

So this leaves us with an use case where AutoExtractArticleData (or a similar class) needs to access several top-level fields to create an item. I think that's a valid use case. For example, why not provide page language in addition to article language right here. But in this case item_key is not needed, because to_item is going to access several keys.

So, a proposal: keep the current approach, but make item_key private, rename it to _item_key. It looks like an implementation detail, not a part of API intended to be exposed.

An additional argument for not requiring {"article": {...}} in data dicts: if article is missing from the response for some reason, it would fail earlier: in a provider, not in to_item method. Provider would need to do this check explicitly with the current approach, to fail earlier. However, I guess we can discuss & fix this later; it'd be backwards incompatible, but that can be ok. For now my proposal is to just make item_key private.

The item_key class attribute is being used as a way to dynamically create new AutoExtract data classes, I don't think users are supposed to play with this class attribute. Anyway, I've just renamed it into _item_key for now as I don't have any strong opinion about it at this point and we're supposed to discuss it further before making additional changes.

On the same way, shouldn't we make item_class private as well?

On the same way, shouldn't we make item_class private as well?

I'm fine with both. Though it could make sense to make it private now, as you're suggesting, because changing it back to public is easier than the other way around.

@kmike, I've reverted the item_key to a public class attribute since it's going to be useful for other libraries when implementing providers for those Page Inputs.

And item_class was converted to a property that uses the information passed to the generic type T.

kmike · 2020-08-13T21:44:52Z

hey @victor-torres! It'd be good to solve typing issue before the merge. I've added a comment about data content, but it is less critical. It also looks like .json file changes and some of the test changes are not needed now, but it can be more work to revert them than to keep them, so feel free to keep.

@kmike

This was proposed by @kmike in an attempt to keep users away from this implementation detail.

Although PyCharm still shows `AutoExtractProductData.to_item()` as `Optional[T]`, now it shows `p = AutoExtractProductData.to_item()` as `Optional[Product]`.

victor-torres · 2020-08-14T22:04:36Z

@kmike and @ivanprado, I guess this pull request is ready for another review.

Note that after latest refactoring, we could replace the item_class attribute with something like self.__orig_bases__[0].__args__[0] to get the AutoExtract Item class and avoid duplication.

kmike · 2020-08-17T08:34:17Z

tests/__init__.py

@@ -9,3 +18,26 @@ def load_fixture(name):
    )
    with open(path, 'r') as f:
        return json.loads(f.read())
+
+
+def compare_item_with_dict(item: Item, data: dict):


A nitpick: what do you think about renaming it, to make it clear what's the output - that it returns True if item and dict are equal? E.g. item_equals_dict. Also, you can annotate the returned value as bool, for completeness:

def item_equals_dict(item: Item, data: dict) -> bool:

kmike · 2020-08-17T08:40:03Z

@victor-torres self.__orig_bases__[0].__args__[0] is an interesting option! I'm fine with doing it (e.g. you can make item_class a property which returns this expression). I'm also fine with keeping the current approach - it is not a lot of duplication, less code, slightly easier to understand, and has less runtime impact.

kmike · 2020-08-17T08:40:46Z

@victor-torres I've left some minor comments, but overall I think this PR is ready to be merged.

… when subclassing This will avoid duplication.

Renaming it for better meaning, type hinting and simplifying docstring.

victor-torres · 2020-08-17T14:02:25Z

@kmike, I've updated this pull request:

implementing item_class property that returns self.__orig_bases__[0].__args__[0]
improving test function item_equals_dict (type hinting, better naming, and simpler docstring)

Although, before merging this pull request, I'd like to ask you again about the _item_key class attribute as it could be used to simplify the code here on this pull request https://github.com/scrapinghub/scrapy-autoextract/pull/13/files#r471466989.

It will be useful for other libraries when developing providers for those Page Inputs.

victor-torres · 2020-08-17T23:47:40Z

@kmike I think this pull request is ready for a final check.

kmike · 2020-08-18T09:45:47Z

Looks good, thanks for addressing all the comments @victor-torres!
Thanks @ivanprado and @ejulio for you helpful comments.

victor-torres · 2020-08-18T13:27:53Z

Thank you for the kind reviews and your patience!
@kmike @ejulio @ivanprado 🚀

Add Page Inputs

a3acbb2

victor-torres requested review from kmike, ivanprado and ejulio August 3, 2020 19:46

ejulio approved these changes Aug 4, 2020

View reviewed changes

kmike reviewed Aug 6, 2020

View reviewed changes

autoextract_poet/page_inputs.py Outdated Show resolved Hide resolved

kmike reviewed Aug 6, 2020

View reviewed changes

autoextract_poet/page_inputs.py Outdated Show resolved Hide resolved

ivanprado reviewed Aug 6, 2020

View reviewed changes

victor-torres added 2 commits August 6, 2020 10:48

Turn ResponseData into a private class (_ResponseData)

e175524

This will avoid confusion with web-poet's ResponseData class and make it clearer to the users that it's an internal implementation detail and shouldn't be used directly when implementing Page Objects.

Rename page inputs

485287f

- Remove "Response" from the class name since it means a very different thing in web-poet (url + html). - Include AutoExtract in the class name to avoid ambiguities when importing page inputs from multiple places (it could also improve logging).

ivanprado reviewed Aug 6, 2020

View reviewed changes

victor-torres added 2 commits August 6, 2020 11:04

Rename class from _ResponseData to _AutoExtractData

be24828

We've just removed the "Response" term from the Article and Product subclasses, it makes no sense to keep the term on the main class as well.

Receive a single response instead of a list

c2fdb76

AutoExtract API returns a JSON Array with responses but it may contain responses with different page types. Therefore, we need to receive individual responses that should have been previously selected for our Item subclass.

Keep the whole response instead of just the page data (unified schema)

fcbd322

We'll probably need to access other metadata such as the response HTML in the future so it's better to save it on the data attribute.

kmike reviewed Aug 13, 2020

View reviewed changes

victor-torres added 2 commits August 14, 2020 18:42

Turn item_key into a private class attribute _item_key

09542b2

This was proposed by @kmike in an attempt to keep users away from this implementation detail.

Improve type hinting

12554f3

Although PyCharm still shows `AutoExtractProductData.to_item()` as `Optional[T]`, now it shows `p = AutoExtractProductData.to_item()` as `Optional[Product]`.

kmike reviewed Aug 17, 2020

View reviewed changes

victor-torres added 2 commits August 17, 2020 10:57

Create property to discover item type instead of having to specify it…

aa3797b

… when subclassing This will avoid duplication.

Refactor comparison function

1a76723

Renaming it for better meaning, type hinting and simplifying docstring.

victor-torres added 2 commits August 17, 2020 11:02

Remove not used import

da49366

Revert item_key to a public class attribute

2b3bf60

It will be useful for other libraries when developing providers for those Page Inputs.

kmike merged commit d2fc277 into master Aug 18, 2020

victor-torres deleted the page-inputs branch August 18, 2020 13:27

Add Page Inputs #2

Add Page Inputs #2

Uh oh!

Conversation

victor-torres commented Aug 3, 2020

Uh oh!

codecov-commenter commented Aug 3, 2020 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ejulio left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanprado left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kmike commented Aug 6, 2020

Uh oh!

ivanprado commented Aug 6, 2020

Uh oh!

victor-torres commented Aug 6, 2020

Uh oh!

victor-torres commented Aug 6, 2020

Uh oh!

ivanprado commented Aug 7, 2020

Uh oh!

victor-torres commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

victor-torres commented Aug 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victor-torres Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov-commenter commented Aug 3, 2020 •

edited by codecov bot

Loading

kmike Aug 13, 2020 •

edited

Loading

victor-torres commented Aug 7, 2020 •

edited

Loading

victor-torres Aug 14, 2020 •

edited

Loading