Skip to content

Add Page Inputs #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 18, 2020
Merged

Add Page Inputs #2

merged 12 commits into from
Aug 18, 2020

Conversation

victor-torres
Copy link
Contributor

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Aug 3, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@2915e0a). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master       #2   +/-   ##
=========================================
  Coverage          ?   98.07%           
=========================================
  Files             ?        2           
  Lines             ?      104           
  Branches          ?        0           
=========================================
  Hits              ?      102           
  Misses            ?        2           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2915e0a...2b3bf60. Read the comment docs.

Copy link

@ejulio ejulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👏
Just a minor thing with typing :)


data: dict

def to_items(self) -> Optional[List[Item]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems from_list always return a dict, so it shouldn't be optional.
Or is there some other use case?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in the other PR we had a conversation on the return type.
So, I guess it follows here, right?
Since this is the base class, it should've no return type as you can't infer from the base classes.
Even though, a product is an item, in the other it was opted out for the many complications it could cause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though, a product is an item, in the other it was opted out for the many complications it could cause.

This. I don't think we'll need to override this method.

data: dict

def to_items(self) -> Optional[List[Item]]:
return self.item_class.from_list(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch request may contain data of different data types, e.g. some records can be articles, and some can be products. What do you thiink about making a single record an input, not a list? I.e. don't put raw AutoExtract response to self.data, but require a single entry from the list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Here it should be just a single entry, no list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method was renamed to singular (to_item) and we're using from_dict instead of from_list now as suggested.



@attr.s(auto_attribs=True)
class ResponseData:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generics can be leveraged here to have better type checking. Example:

from typing import TypeVar, Generic, ClassVar, Type


class Item:
    pass

class Product(Item):
    pass

T = TypeVar('T', bound=Item)

class ResponseData(Generic[T]):
    item_class: ClassVar[Type[T]]

    def to_items(self) -> T:
        return self.item_class()

class ProductResponseData(ResponseData[Product]):
    item_class = Product

print(type(ProductResponseData().to_items()))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victor-torres what about this proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed something here but in my tests, it didn't present any advantage over a simple type annotation such as Type["Item"].

PyCharm was still identifying return type as Item instead of Product (an Item subclass) when the to_items method was being called from an instance of a Product class.

What happens in your test is that you're dynamically checking the type while mypy, PyCharm, and other solutions perform a static type check. It's not the same thing. I think @ejulio's example followed the same approach of yours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy is able to perform better checking. mypy fails with:

error: "Item" has no attribute "product_method"

on this code:

class Item:
    pass

class Product(Item):
    def product_method(self):
        pass

class ResponseData():
    item_class: ClassVar[Type[Item]]

    def to_items(self) -> Item:
        return self.item_class()

class ProductResponseData(ResponseData):
    item_class = Product

p = ProductResponseData()
item = p.to_items()
item.product_method()

But don't fail with this code:

class Item:
    pass

class Product(Item):
    def product_method(self):
        pass

T = TypeVar('T', bound=Item)

class ResponseData(Generic[T]):
    item_class: ClassVar[Type[T]]

    def to_items(self) -> T:
        return self.item_class()

class ProductResponseData(ResponseData[Product]):
    item_class = Product

p = ProductResponseData()
item = p.to_items()
item.product_method()

Also, PyCharm is able to suggest me product_method for item in the later but not in the first case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, PyCharm is able to suggest me product_method for item in the later but not in the first case.

It will work because we're talking about a class attribute specific to the Product class.

The problem I'm trying to describe here is that AFAIK there's no way to subclass the base class without having to override those class methods to annotate them with the proper return type.

We're also probably mixing two different contexts here: Page Inputs and Items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example above, PyCharm still shows the return type of ProductResponseData.to_item() as being T and not Product as we would expect.

Copy link
Member

@kmike kmike Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victor-torres PyCharm correctly infers that item is a Product in @ivanprado's item = p.to_items() example, even if it shows that to_items() returns T. In the first @ivanprado's example PyCharm shows a warning at item.product_method(), but not in the second example. I guess it is the same with mypy. Could you please check it again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind-of important, because without proper typing support we loose a many of advantages of having these Item classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victor-torres PyCharm correctly infers that item is a Product in @ivanprado's item = p.to_items() example, even if it shows that to_items() returns T.

That's true. I was looking at the return type of the to_item function only.

I've just implemented @ivanprado's suggestion.

This will avoid confusion with web-poet's ResponseData class and make it clearer to the users that it's an internal implementation detail and shouldn't be used directly when implementing Page Objects.
- Remove "Response" from the class name since it means a very different thing in web-poet (url + html).
- Include AutoExtract in the class name to avoid ambiguities when importing page inputs from multiple places (it could also improve logging).
Copy link
Contributor

@ivanprado ivanprado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @victor-torres . I went through the PR and left some comments. Also wonder what should we do when html is present in data, but this is a difficult question so let's discuss it separately.

data: dict

def to_items(self) -> Optional[List[Item]]:
return self.item_class.from_list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Here it should be just a single entry, no list.

We've just removed the "Response" term from the Article and Product subclasses,
it makes no sense to keep the term on the main class as well.
AutoExtract API returns a JSON Array with responses but it may contain responses with different page types. Therefore, we need to receive individual responses that should have been previously selected for our Item subclass.
@kmike
Copy link
Member

kmike commented Aug 6, 2020

I think it should be return self.item_class.from_dict(self.data[self.item_key]) as we still want to retain other information which can be useful in data, like html or the future webPage attribute.

I'm not 100% sure about that. If we speak about scrapy-poet, it'd be great to extend providers so that they know how to provide html or WebPage data without making an additional request. So here is more of a question of whether we need an escape hatch for this information in this particular place, by making the input data slightly less convenient. Not against it, but it can be solved in other ways.

@ivanprado
Copy link
Contributor

it'd be great to extend providers so that they know how to provide html or WebPage data without making an additional request.

@kmike 100% agree with that. We should then prioritize planning and designing for that because is very important and will condition the design of providers in scrapy-poet, and is a requirement to have a functional system. Otherwise, we would need the escape hatch for the time being.

We'll probably need to access other metadata such as the response HTML in the future so it's better to save it on the data attribute.
@victor-torres
Copy link
Contributor Author

@ivanprado, regarding #2 (comment), I've just updated the source code to store the whole response data and not just the unified schema data.

@victor-torres
Copy link
Contributor Author

@ivanprado @kmike could you guys please check if there's anything else pending on this pull request?

@ivanprado
Copy link
Contributor

@victor-torres from my side two considerations:

  • I think you missed this comment Add Page Inputs #2 (comment)
  • The discussion @kmike opened here Add Page Inputs #2 (comment) is really important IMO. Depending on how we will solve the question of multiple page-inputs from the same requests, we should store only the data relevant to the page-input, or the whole response data. We should think a little bit about that question. But probably meanwhile we can go with the approach in this PR to not to block it, having in mind that maybe the solution is temporal.

@victor-torres
Copy link
Contributor Author

victor-torres commented Aug 7, 2020

@ivanprado

Answered on the comment.

  • The discussion @kmike opened here #2 (comment) is really important IMO. Depending on how we will solve the question of multiple page-inputs from the same requests, we should store only the data relevant to the page-input, or the whole response data. We should think a little bit about that question. But probably meanwhile we can go with the approach in this PR to not to block it, having in mind that maybe the solution is temporal.

We need to discuss it further. Probably could be addressed in other PR. As we were talking with @ejulio, we might be trying to cover a use case that doesn't exist yet.

@victor-torres
Copy link
Contributor Author

@kmike @ivanprado Is this pull request ready to be merged?

I think we can improve the validation logic, probably using something different than attrs like Pydantic, but this should be done in another pull request since this one was meant to be very basic.

"""

item_class: ClassVar[Type[Item]]
item_key: ClassVar[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure about this item_key. The question is if data should be {"article": {...}} dict or just a dict with article data ({...}, i.e. provider calls data["article"] itself before creating AutoExtractArticleData instance).

As we discussed on a call, escape hatch (e.g. to get html) may be unnecessary, as it can be solved in providers.

So this leaves us with an use case where AutoExtractArticleData (or a similar class) needs to access several top-level fields to create an item. I think that's a valid use case. For example, why not provide page language in addition to article language right here. But in this case item_key is not needed, because to_item is going to access several keys.

So, a proposal: keep the current approach, but make item_key private, rename it to _item_key. It looks like an implementation detail, not a part of API intended to be exposed.


An additional argument for not requiring {"article": {...}} in data dicts: if article is missing from the response for some reason, it would fail earlier: in a provider, not in to_item method. Provider would need to do this check explicitly with the current approach, to fail earlier. However, I guess we can discuss & fix this later; it'd be backwards incompatible, but that can be ok. For now my proposal is to just make item_key private.

Copy link
Contributor Author

@victor-torres victor-torres Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The item_key class attribute is being used as a way to dynamically create new AutoExtract data classes, I don't think users are supposed to play with this class attribute. Anyway, I've just renamed it into _item_key for now as I don't have any strong opinion about it at this point and we're supposed to discuss it further before making additional changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the same way, shouldn't we make item_class private as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the same way, shouldn't we make item_class private as well?

I'm fine with both. Though it could make sense to make it private now, as you're suggesting, because changing it back to public is easier than the other way around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike, I've reverted the item_key to a public class attribute since it's going to be useful for other libraries when implementing providers for those Page Inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And item_class was converted to a property that uses the information passed to the generic type T.

@kmike
Copy link
Member

kmike commented Aug 13, 2020

hey @victor-torres! It'd be good to solve typing issue before the merge. I've added a comment about data content, but it is less critical. It also looks like .json file changes and some of the test changes are not needed now, but it can be more work to revert them than to keep them, so feel free to keep.

This was proposed by @kmike in an attempt to keep users away from this implementation detail.
Although PyCharm still shows `AutoExtractProductData.to_item()` as `Optional[T]`,
now it shows `p = AutoExtractProductData.to_item()` as `Optional[Product]`.
@victor-torres
Copy link
Contributor Author

@kmike and @ivanprado, I guess this pull request is ready for another review.

Note that after latest refactoring, we could replace the item_class attribute with something like self.__orig_bases__[0].__args__[0] to get the AutoExtract Item class and avoid duplication.

@@ -9,3 +18,26 @@ def load_fixture(name):
)
with open(path, 'r') as f:
return json.loads(f.read())


def compare_item_with_dict(item: Item, data: dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nitpick: what do you think about renaming it, to make it clear what's the output - that it returns True if item and dict are equal? E.g. item_equals_dict. Also, you can annotate the returned value as bool, for completeness:

def item_equals_dict(item: Item, data: dict) -> bool:

@kmike
Copy link
Member

kmike commented Aug 17, 2020

@victor-torres self.__orig_bases__[0].__args__[0] is an interesting option! I'm fine with doing it (e.g. you can make item_class a property which returns this expression). I'm also fine with keeping the current approach - it is not a lot of duplication, less code, slightly easier to understand, and has less runtime impact.

@kmike
Copy link
Member

kmike commented Aug 17, 2020

@victor-torres I've left some minor comments, but overall I think this PR is ready to be merged.

… when subclassing

This will avoid duplication.
Renaming it for better meaning, type hinting and simplifying docstring.
@victor-torres
Copy link
Contributor Author

@kmike, I've updated this pull request:

  • implementing item_class property that returns self.__orig_bases__[0].__args__[0]
  • improving test function item_equals_dict (type hinting, better naming, and simpler docstring)

Although, before merging this pull request, I'd like to ask you again about the _item_key class attribute as it could be used to simplify the code here on this pull request https://github.com/scrapinghub/scrapy-autoextract/pull/13/files#r471466989.

It will be useful for other libraries when developing providers for those Page Inputs.
@victor-torres
Copy link
Contributor Author

@kmike I think this pull request is ready for a final check.

@kmike
Copy link
Member

kmike commented Aug 18, 2020

Looks good, thanks for addressing all the comments @victor-torres!
Thanks @ivanprado and @ejulio for you helpful comments.

@kmike kmike merged commit d2fc277 into master Aug 18, 2020
@victor-torres victor-torres deleted the page-inputs branch August 18, 2020 13:27
@victor-torres
Copy link
Contributor Author

Thank you for the kind reviews and your patience!
@kmike @ejulio @ivanprado 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants