-
Notifications
You must be signed in to change notification settings - Fork 1
Add Page Inputs #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2 +/- ##
=========================================
Coverage ? 98.07%
=========================================
Files ? 2
Lines ? 104
Branches ? 0
=========================================
Hits ? 102
Misses ? 2
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👏
Just a minor thing with typing :)
autoextract_poet/page_inputs.py
Outdated
|
||
data: dict | ||
|
||
def to_items(self) -> Optional[List[Item]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems from_list
always return a dict
, so it shouldn't be optional.
Or is there some other use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in the other PR we had a conversation on the return type.
So, I guess it follows here, right?
Since this is the base class, it should've no return type as you can't infer from the base classes.
Even though, a product is an item, in the other it was opted out for the many complications it could cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though, a product is an item, in the other it was opted out for the many complications it could cause.
This. I don't think we'll need to override this method.
autoextract_poet/page_inputs.py
Outdated
data: dict | ||
|
||
def to_items(self) -> Optional[List[Item]]: | ||
return self.item_class.from_list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch request may contain data of different data types, e.g. some records can be articles, and some can be products. What do you thiink about making a single record an input, not a list? I.e. don't put raw AutoExtract response to self.data, but require a single entry from the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Here it should be just a single entry, no list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method was renamed to singular (to_item
) and we're using from_dict
instead of from_list
now as suggested.
autoextract_poet/page_inputs.py
Outdated
|
||
|
||
@attr.s(auto_attribs=True) | ||
class ResponseData: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generics can be leveraged here to have better type checking. Example:
from typing import TypeVar, Generic, ClassVar, Type
class Item:
pass
class Product(Item):
pass
T = TypeVar('T', bound=Item)
class ResponseData(Generic[T]):
item_class: ClassVar[Type[T]]
def to_items(self) -> T:
return self.item_class()
class ProductResponseData(ResponseData[Product]):
item_class = Product
print(type(ProductResponseData().to_items()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victor-torres what about this proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed something here but in my tests, it didn't present any advantage over a simple type annotation such as Type["Item"]
.
PyCharm was still identifying return type as Item
instead of Product
(an Item
subclass) when the to_items
method was being called from an instance of a Product
class.
What happens in your test is that you're dynamically checking the type while mypy, PyCharm, and other solutions perform a static type check. It's not the same thing. I think @ejulio's example followed the same approach of yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy is able to perform better checking. mypy
fails with:
error: "Item" has no attribute "product_method"
on this code:
class Item:
pass
class Product(Item):
def product_method(self):
pass
class ResponseData():
item_class: ClassVar[Type[Item]]
def to_items(self) -> Item:
return self.item_class()
class ProductResponseData(ResponseData):
item_class = Product
p = ProductResponseData()
item = p.to_items()
item.product_method()
But don't fail with this code:
class Item:
pass
class Product(Item):
def product_method(self):
pass
T = TypeVar('T', bound=Item)
class ResponseData(Generic[T]):
item_class: ClassVar[Type[T]]
def to_items(self) -> T:
return self.item_class()
class ProductResponseData(ResponseData[Product]):
item_class = Product
p = ProductResponseData()
item = p.to_items()
item.product_method()
Also, PyCharm is able to suggest me product_method
for item in the later but not in the first case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the idea from https://mypy.readthedocs.io/en/stable/generics.html?highlight=generics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, PyCharm is able to suggest me
product_method
for item in the later but not in the first case.
It will work because we're talking about a class attribute specific to the Product
class.
The problem I'm trying to describe here is that AFAIK there's no way to subclass the base class without having to override those class methods to annotate them with the proper return type.
We're also probably mixing two different contexts here: Page Inputs and Items.
- https://stackoverflow.com/questions/39205527/can-you-annotate-return-type-when-value-is-instance-of-cls/39205612#39205612
- How to add hint to factory method? python/typing#58
- SelfType or another way to spell "type of self" (or, How to define a copy() function) python/mypy#1212
- Type of the same class inside the class python/mypy#3661
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example above, PyCharm still shows the return type of ProductResponseData.to_item()
as being T
and not Product
as we would expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victor-torres PyCharm correctly infers that item is a Product in @ivanprado's item = p.to_items()
example, even if it shows that to_items()
returns T. In the first @ivanprado's example PyCharm shows a warning at item.product_method()
, but not in the second example. I guess it is the same with mypy. Could you please check it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind-of important, because without proper typing support we loose a many of advantages of having these Item classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victor-torres PyCharm correctly infers that item is a Product in @ivanprado's item = p.to_items() example, even if it shows that to_items() returns T.
That's true. I was looking at the return type of the to_item
function only.
I've just implemented @ivanprado's suggestion.
This will avoid confusion with web-poet's ResponseData class and make it clearer to the users that it's an internal implementation detail and shouldn't be used directly when implementing Page Objects.
- Remove "Response" from the class name since it means a very different thing in web-poet (url + html). - Include AutoExtract in the class name to avoid ambiguities when importing page inputs from multiple places (it could also improve logging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @victor-torres . I went through the PR and left some comments. Also wonder what should we do when html
is present in data, but this is a difficult question so let's discuss it separately.
autoextract_poet/page_inputs.py
Outdated
data: dict | ||
|
||
def to_items(self) -> Optional[List[Item]]: | ||
return self.item_class.from_list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Here it should be just a single entry, no list.
We've just removed the "Response" term from the Article and Product subclasses, it makes no sense to keep the term on the main class as well.
AutoExtract API returns a JSON Array with responses but it may contain responses with different page types. Therefore, we need to receive individual responses that should have been previously selected for our Item subclass.
I'm not 100% sure about that. If we speak about scrapy-poet, it'd be great to extend providers so that they know how to provide html or WebPage data without making an additional request. So here is more of a question of whether we need an escape hatch for this information in this particular place, by making the input data slightly less convenient. Not against it, but it can be solved in other ways. |
@kmike 100% agree with that. We should then prioritize planning and designing for that because is very important and will condition the design of providers in scrapy-poet, and is a requirement to have a functional system. Otherwise, we would need the escape hatch for the time being. |
We'll probably need to access other metadata such as the response HTML in the future so it's better to save it on the data attribute.
@ivanprado, regarding #2 (comment), I've just updated the source code to store the whole response data and not just the unified schema data. |
@ivanprado @kmike could you guys please check if there's anything else pending on this pull request? |
@victor-torres from my side two considerations:
|
Answered on the comment.
We need to discuss it further. Probably could be addressed in other PR. As we were talking with @ejulio, we might be trying to cover a use case that doesn't exist yet. |
@kmike @ivanprado Is this pull request ready to be merged? I think we can improve the validation logic, probably using something different than attrs like Pydantic, but this should be done in another pull request since this one was meant to be very basic. |
""" | ||
|
||
item_class: ClassVar[Type[Item]] | ||
item_key: ClassVar[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unsure about this item_key
. The question is if data
should be {"article": {...}}
dict or just a dict with article data ({...}
, i.e. provider calls data["article"]
itself before creating AutoExtractArticleData instance).
As we discussed on a call, escape hatch (e.g. to get html) may be unnecessary, as it can be solved in providers.
So this leaves us with an use case where AutoExtractArticleData (or a similar class) needs to access several top-level fields to create an item. I think that's a valid use case. For example, why not provide page language in addition to article language right here. But in this case item_key
is not needed, because to_item is going to access several keys.
So, a proposal: keep the current approach, but make item_key
private, rename it to _item_key
. It looks like an implementation detail, not a part of API intended to be exposed.
An additional argument for not requiring {"article": {...}}
in data
dicts: if article is missing from the response for some reason, it would fail earlier: in a provider, not in to_item method. Provider would need to do this check explicitly with the current approach, to fail earlier. However, I guess we can discuss & fix this later; it'd be backwards incompatible, but that can be ok. For now my proposal is to just make item_key private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The item_key
class attribute is being used as a way to dynamically create new AutoExtract data classes, I don't think users are supposed to play with this class attribute. Anyway, I've just renamed it into _item_key
for now as I don't have any strong opinion about it at this point and we're supposed to discuss it further before making additional changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the same way, shouldn't we make item_class
private as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the same way, shouldn't we make item_class private as well?
I'm fine with both. Though it could make sense to make it private now, as you're suggesting, because changing it back to public is easier than the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike, I've reverted the item_key
to a public class attribute since it's going to be useful for other libraries when implementing providers for those Page Inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And item_class
was converted to a property that uses the information passed to the generic type T.
hey @victor-torres! It'd be good to solve typing issue before the merge. I've added a comment about |
This was proposed by @kmike in an attempt to keep users away from this implementation detail.
Although PyCharm still shows `AutoExtractProductData.to_item()` as `Optional[T]`, now it shows `p = AutoExtractProductData.to_item()` as `Optional[Product]`.
@kmike and @ivanprado, I guess this pull request is ready for another review. Note that after latest refactoring, we could replace the |
tests/__init__.py
Outdated
@@ -9,3 +18,26 @@ def load_fixture(name): | |||
) | |||
with open(path, 'r') as f: | |||
return json.loads(f.read()) | |||
|
|||
|
|||
def compare_item_with_dict(item: Item, data: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nitpick: what do you think about renaming it, to make it clear what's the output - that it returns True if item and dict are equal? E.g. item_equals_dict. Also, you can annotate the returned value as bool, for completeness:
def item_equals_dict(item: Item, data: dict) -> bool:
@victor-torres |
@victor-torres I've left some minor comments, but overall I think this PR is ready to be merged. |
… when subclassing This will avoid duplication.
Renaming it for better meaning, type hinting and simplifying docstring.
@kmike, I've updated this pull request:
Although, before merging this pull request, I'd like to ask you again about the |
It will be useful for other libraries when developing providers for those Page Inputs.
@kmike I think this pull request is ready for a final check. |
Looks good, thanks for addressing all the comments @victor-torres! |
Thank you for the kind reviews and your patience! |
No description provided.