-
Notifications
You must be signed in to change notification settings - Fork 1
Add Page Inputs #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
a3acbb2
Add Page Inputs
victor-torres e175524
Turn ResponseData into a private class (_ResponseData)
victor-torres 485287f
Rename page inputs
victor-torres be24828
Rename class from _ResponseData to _AutoExtractData
victor-torres c2fdb76
Receive a single response instead of a list
victor-torres fcbd322
Keep the whole response instead of just the page data (unified schema)
victor-torres 09542b2
Turn item_key into a private class attribute _item_key
victor-torres 12554f3
Improve type hinting
victor-torres aa3797b
Create property to discover item type instead of having to specify it…
victor-torres 1a76723
Refactor comparison function
victor-torres da49366
Remove not used import
victor-torres 2b3bf60
Revert item_key to a public class attribute
victor-torres File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
from typing import ClassVar, Generic, Optional, TypeVar | ||
|
||
import attr | ||
|
||
from autoextract_poet.items import ( | ||
Article, | ||
Item, | ||
Product, | ||
) | ||
|
||
T = TypeVar("T", bound=Item) | ||
|
||
|
||
@attr.s(auto_attribs=True) | ||
class _AutoExtractData(Generic[T]): | ||
"""Container for AutoExtract data. | ||
|
||
Should not be used directly by providers. | ||
Use derived classes like AutoExtractArticleData and similar. | ||
|
||
API responses are wrapped in a JSON array | ||
(this is to facilitate query batching) | ||
but we're receiving single responses here.. | ||
|
||
https://doc.scrapinghub.com/autoextract.html#responses | ||
""" | ||
|
||
item_key: ClassVar[str] | ||
|
||
data: dict | ||
|
||
@property | ||
def item_class(self): | ||
return self.__orig_bases__[0].__args__[0] | ||
|
||
def to_item(self) -> Optional[T]: | ||
return self.item_class.from_dict(self.data[self.item_key]) | ||
|
||
|
||
@attr.s(auto_attribs=True) | ||
class AutoExtractArticleData(_AutoExtractData[Article]): | ||
"""Container for AutoExtract Article data. | ||
|
||
https://doc.scrapinghub.com/autoextract/article.html | ||
""" | ||
|
||
item_key = "article" | ||
|
||
|
||
@attr.s(auto_attribs=True) | ||
class AutoExtractProductData(_AutoExtractData[Product]): | ||
"""Container for AutoExtract Product data. | ||
|
||
https://doc.scrapinghub.com/autoextract/product.html | ||
""" | ||
|
||
item_key = "product" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,51 +1,57 @@ | ||
{ | ||
[ | ||
{ | ||
"article": { | ||
"headline": "Article headline", | ||
"datePublished": "2019-06-19T00:00:00", | ||
"datePublishedRaw": "June 19, 2019", | ||
"dateModified": "2019-06-21T00:00:00", | ||
"dateModifiedRaw": "June 21, 2019", | ||
"author": "Article author", | ||
"authorsList": [ | ||
"Article author" | ||
], | ||
"inLanguage": "en", | ||
"breadcrumbs": [ | ||
{ | ||
"name": "Level 1", | ||
"link": "http://example.com" | ||
} | ||
], | ||
"mainImage": "http://example.com/image.png", | ||
"images": [ | ||
"http://example.com/image.png" | ||
], | ||
"description": "Article summary", | ||
"articleBody": "Article body ...", | ||
"articleBodyHtml": "<article><p>Article body ... </p> ... </article>", | ||
"articleBodyRaw": "<div id=\"an-article\">Article body ...", | ||
"videoUrls": [ | ||
"https://example.com/video.mp4" | ||
], | ||
"audioUrls": [ | ||
"https://example.com/audio.mp3" | ||
], | ||
"probability": 0.95, | ||
"canonicalUrl": "https://example.com/article/article-about-something", | ||
"url": "https://example.com/article?id=24" | ||
"headline": "Article headline", | ||
"datePublished": "2019-06-19T00:00:00", | ||
"datePublishedRaw": "June 19, 2019", | ||
"dateModified": "2019-06-21T00:00:00", | ||
"dateModifiedRaw": "June 21, 2019", | ||
"author": "Article author", | ||
"authorsList": [ | ||
"Article author" | ||
], | ||
"inLanguage": "en", | ||
"breadcrumbs": [ | ||
{ | ||
"name": "Level 1", | ||
"link": "http://example.com" | ||
} | ||
], | ||
"mainImage": "http://example.com/image.png", | ||
"images": [ | ||
"http://example.com/image.png" | ||
], | ||
"description": "Article summary", | ||
"articleBody": "Article body ...", | ||
"articleBodyHtml": "<article><p>Article body ... </p> ... </article>", | ||
"articleBodyRaw": "<div id=\"an-article\">Article body ...", | ||
"videoUrls": [ | ||
"https://example.com/video.mp4" | ||
], | ||
"audioUrls": [ | ||
"https://example.com/audio.mp3" | ||
], | ||
"probability": 0.95, | ||
"canonicalUrl": "https://example.com/article/article-about-something", | ||
"url": "https://example.com/article?id=24" | ||
}, | ||
"webPage": { | ||
"inLanguages": [ | ||
{"code": "en"}, | ||
{"code": "es"} | ||
] | ||
"inLanguages": [ | ||
{ | ||
"code": "en" | ||
}, | ||
{ | ||
"code": "es" | ||
} | ||
] | ||
}, | ||
"query": { | ||
"id": "1564747029122-9e02a1868d70b7a3", | ||
"domain": "example.com", | ||
"userQuery": { | ||
"pageType": "article", | ||
"url": "http://example.com/article?id=24" | ||
} | ||
"id": "1564747029122-9e02a1868d70b7a3", | ||
"domain": "example.com", | ||
"userQuery": { | ||
"pageType": "article", | ||
"url": "http://example.com/article?id=24" | ||
} | ||
} | ||
} | ||
} | ||
] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,59 +1,65 @@ | ||
{ | ||
[ | ||
{ | ||
"product": { | ||
"name": "Product name", | ||
"offers": [ | ||
{ | ||
"price": "42", | ||
"currency": "USD", | ||
"availability": "InStock" | ||
} | ||
], | ||
"sku": "product sku", | ||
"mpn": "product mpn", | ||
"gtin": [ | ||
{ | ||
"type": "ean13", | ||
"value": "978-3-16-148410-0" | ||
} | ||
], | ||
"brand": "product brand", | ||
"breadcrumbs": [ | ||
{ | ||
"name": "Level 1", | ||
"link": "http://example.com" | ||
} | ||
], | ||
"mainImage": "http://example.com/image.png", | ||
"images": [ | ||
"http://example.com/image.png" | ||
], | ||
"description": "product description", | ||
"aggregateRating": { | ||
"ratingValue": 4.5, | ||
"bestRating": 5.0, | ||
"reviewCount": 31 | ||
}, | ||
"additionalProperty": [ | ||
{ | ||
"name": "property 1", | ||
"value": "value of property 1" | ||
} | ||
], | ||
"probability": 0.95, | ||
"url": "https://example.com/product" | ||
"name": "Product name", | ||
"offers": [ | ||
{ | ||
"price": "42", | ||
"currency": "USD", | ||
"availability": "InStock" | ||
} | ||
], | ||
"sku": "product sku", | ||
"mpn": "product mpn", | ||
"gtin": [ | ||
{ | ||
"type": "ean13", | ||
"value": "978-3-16-148410-0" | ||
} | ||
], | ||
"brand": "product brand", | ||
"breadcrumbs": [ | ||
{ | ||
"name": "Level 1", | ||
"link": "http://example.com" | ||
} | ||
], | ||
"mainImage": "http://example.com/image.png", | ||
"images": [ | ||
"http://example.com/image.png" | ||
], | ||
"description": "product description", | ||
"aggregateRating": { | ||
"ratingValue": 4.5, | ||
"bestRating": 5.0, | ||
"reviewCount": 31 | ||
}, | ||
"additionalProperty": [ | ||
{ | ||
"name": "property 1", | ||
"value": "value of property 1" | ||
} | ||
], | ||
"probability": 0.95, | ||
"url": "https://example.com/product" | ||
}, | ||
"webPage": { | ||
"inLanguages": [ | ||
{"code": "en"}, | ||
{"code": "es"} | ||
] | ||
"inLanguages": [ | ||
{ | ||
"code": "en" | ||
}, | ||
{ | ||
"code": "es" | ||
} | ||
] | ||
}, | ||
"query": { | ||
"id": "1564747029122-9e02a1868d70b7a2", | ||
"domain": "example.com", | ||
"userQuery": { | ||
"pageType": "product", | ||
"url": "https://example.com/product" | ||
} | ||
"id": "1564747029122-9e02a1868d70b7a2", | ||
"domain": "example.com", | ||
"userQuery": { | ||
"pageType": "product", | ||
"url": "https://example.com/product" | ||
} | ||
} | ||
} | ||
} | ||
] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import pytest | ||
|
||
from autoextract_poet.page_inputs import ( | ||
AutoExtractArticleData, | ||
AutoExtractProductData, | ||
) | ||
|
||
from tests import load_fixture, item_equals_dict | ||
|
||
example_article_result = load_fixture("sample_article.json") | ||
example_product_result = load_fixture("sample_product.json") | ||
|
||
|
||
@pytest.mark.parametrize("cls, results", [ | ||
(AutoExtractArticleData, example_article_result), | ||
(AutoExtractProductData, example_product_result), | ||
]) | ||
def test_response_data(cls, results): | ||
response_data = cls(results[0]) | ||
item = response_data.to_item() | ||
assert isinstance(item, response_data.item_class) | ||
assert item_equals_dict(item, results[0][cls.item_key]) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unsure about this
item_key
. The question is ifdata
should be{"article": {...}}
dict or just a dict with article data ({...}
, i.e. provider callsdata["article"]
itself before creating AutoExtractArticleData instance).As we discussed on a call, escape hatch (e.g. to get html) may be unnecessary, as it can be solved in providers.
So this leaves us with an use case where AutoExtractArticleData (or a similar class) needs to access several top-level fields to create an item. I think that's a valid use case. For example, why not provide page language in addition to article language right here. But in this case
item_key
is not needed, because to_item is going to access several keys.So, a proposal: keep the current approach, but make
item_key
private, rename it to_item_key
. It looks like an implementation detail, not a part of API intended to be exposed.An additional argument for not requiring
{"article": {...}}
indata
dicts: if article is missing from the response for some reason, it would fail earlier: in a provider, not in to_item method. Provider would need to do this check explicitly with the current approach, to fail earlier. However, I guess we can discuss & fix this later; it'd be backwards incompatible, but that can be ok. For now my proposal is to just make item_key private.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
item_key
class attribute is being used as a way to dynamically create new AutoExtract data classes, I don't think users are supposed to play with this class attribute. Anyway, I've just renamed it into_item_key
for now as I don't have any strong opinion about it at this point and we're supposed to discuss it further before making additional changes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the same way, shouldn't we make
item_class
private as well?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with both. Though it could make sense to make it private now, as you're suggesting, because changing it back to public is easier than the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike, I've reverted the
item_key
to a public class attribute since it's going to be useful for other libraries when implementing providers for those Page Inputs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And
item_class
was converted to a property that uses the information passed to the generic type T.