Skip to content

Add Page Inputs #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 18, 2020
57 changes: 57 additions & 0 deletions autoextract_poet/page_inputs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
from typing import ClassVar, Generic, Optional, TypeVar

import attr

from autoextract_poet.items import (
Article,
Item,
Product,
)

T = TypeVar("T", bound=Item)


@attr.s(auto_attribs=True)
class _AutoExtractData(Generic[T]):
"""Container for AutoExtract data.

Should not be used directly by providers.
Use derived classes like AutoExtractArticleData and similar.

API responses are wrapped in a JSON array
(this is to facilitate query batching)
but we're receiving single responses here..

https://doc.scrapinghub.com/autoextract.html#responses
"""

item_key: ClassVar[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure about this item_key. The question is if data should be {"article": {...}} dict or just a dict with article data ({...}, i.e. provider calls data["article"] itself before creating AutoExtractArticleData instance).

As we discussed on a call, escape hatch (e.g. to get html) may be unnecessary, as it can be solved in providers.

So this leaves us with an use case where AutoExtractArticleData (or a similar class) needs to access several top-level fields to create an item. I think that's a valid use case. For example, why not provide page language in addition to article language right here. But in this case item_key is not needed, because to_item is going to access several keys.

So, a proposal: keep the current approach, but make item_key private, rename it to _item_key. It looks like an implementation detail, not a part of API intended to be exposed.


An additional argument for not requiring {"article": {...}} in data dicts: if article is missing from the response for some reason, it would fail earlier: in a provider, not in to_item method. Provider would need to do this check explicitly with the current approach, to fail earlier. However, I guess we can discuss & fix this later; it'd be backwards incompatible, but that can be ok. For now my proposal is to just make item_key private.

Copy link
Contributor Author

@victor-torres victor-torres Aug 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The item_key class attribute is being used as a way to dynamically create new AutoExtract data classes, I don't think users are supposed to play with this class attribute. Anyway, I've just renamed it into _item_key for now as I don't have any strong opinion about it at this point and we're supposed to discuss it further before making additional changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the same way, shouldn't we make item_class private as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the same way, shouldn't we make item_class private as well?

I'm fine with both. Though it could make sense to make it private now, as you're suggesting, because changing it back to public is easier than the other way around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike, I've reverted the item_key to a public class attribute since it's going to be useful for other libraries when implementing providers for those Page Inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And item_class was converted to a property that uses the information passed to the generic type T.


data: dict

@property
def item_class(self):
return self.__orig_bases__[0].__args__[0]

def to_item(self) -> Optional[T]:
return self.item_class.from_dict(self.data[self.item_key])


@attr.s(auto_attribs=True)
class AutoExtractArticleData(_AutoExtractData[Article]):
"""Container for AutoExtract Article data.

https://doc.scrapinghub.com/autoextract/article.html
"""

item_key = "article"


@attr.s(auto_attribs=True)
class AutoExtractProductData(_AutoExtractData[Product]):
"""Container for AutoExtract Product data.

https://doc.scrapinghub.com/autoextract/product.html
"""

item_key = "product"
29 changes: 29 additions & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
import json
import os

from autoextract_poet.items import (
AdditionalProperty,
Breadcrumb,
Item,
GTIN,
Offer,
Rating,
)


def load_fixture(name):
path = os.path.join(
Expand All @@ -9,3 +18,23 @@ def load_fixture(name):
)
with open(path, 'r') as f:
return json.loads(f.read())


def item_equals_dict(item: Item, data: dict) -> bool:
"""Return True if Item and Dict are equivalent or False otherwise."""
for key, value in data.items():
if key == 'additionalProperty':
value = AdditionalProperty.from_list(value)
if key == 'aggregateRating':
value = Rating.from_dict(value)
if key == 'breadcrumbs':
value = Breadcrumb.from_list(value)
if key == 'gtin':
value = GTIN.from_list(value)
if key == 'offers':
value = Offer.from_list(value)

if getattr(item, key) != value:
return False

return True
96 changes: 51 additions & 45 deletions tests/fixtures/sample_article.json
Original file line number Diff line number Diff line change
@@ -1,51 +1,57 @@
{
[
{
"article": {
"headline": "Article headline",
"datePublished": "2019-06-19T00:00:00",
"datePublishedRaw": "June 19, 2019",
"dateModified": "2019-06-21T00:00:00",
"dateModifiedRaw": "June 21, 2019",
"author": "Article author",
"authorsList": [
"Article author"
],
"inLanguage": "en",
"breadcrumbs": [
{
"name": "Level 1",
"link": "http://example.com"
}
],
"mainImage": "http://example.com/image.png",
"images": [
"http://example.com/image.png"
],
"description": "Article summary",
"articleBody": "Article body ...",
"articleBodyHtml": "<article><p>Article body ... </p> ... </article>",
"articleBodyRaw": "<div id=\"an-article\">Article body ...",
"videoUrls": [
"https://example.com/video.mp4"
],
"audioUrls": [
"https://example.com/audio.mp3"
],
"probability": 0.95,
"canonicalUrl": "https://example.com/article/article-about-something",
"url": "https://example.com/article?id=24"
"headline": "Article headline",
"datePublished": "2019-06-19T00:00:00",
"datePublishedRaw": "June 19, 2019",
"dateModified": "2019-06-21T00:00:00",
"dateModifiedRaw": "June 21, 2019",
"author": "Article author",
"authorsList": [
"Article author"
],
"inLanguage": "en",
"breadcrumbs": [
{
"name": "Level 1",
"link": "http://example.com"
}
],
"mainImage": "http://example.com/image.png",
"images": [
"http://example.com/image.png"
],
"description": "Article summary",
"articleBody": "Article body ...",
"articleBodyHtml": "<article><p>Article body ... </p> ... </article>",
"articleBodyRaw": "<div id=\"an-article\">Article body ...",
"videoUrls": [
"https://example.com/video.mp4"
],
"audioUrls": [
"https://example.com/audio.mp3"
],
"probability": 0.95,
"canonicalUrl": "https://example.com/article/article-about-something",
"url": "https://example.com/article?id=24"
},
"webPage": {
"inLanguages": [
{"code": "en"},
{"code": "es"}
]
"inLanguages": [
{
"code": "en"
},
{
"code": "es"
}
]
},
"query": {
"id": "1564747029122-9e02a1868d70b7a3",
"domain": "example.com",
"userQuery": {
"pageType": "article",
"url": "http://example.com/article?id=24"
}
"id": "1564747029122-9e02a1868d70b7a3",
"domain": "example.com",
"userQuery": {
"pageType": "article",
"url": "http://example.com/article?id=24"
}
}
}
}
]
112 changes: 59 additions & 53 deletions tests/fixtures/sample_product.json
Original file line number Diff line number Diff line change
@@ -1,59 +1,65 @@
{
[
{
"product": {
"name": "Product name",
"offers": [
{
"price": "42",
"currency": "USD",
"availability": "InStock"
}
],
"sku": "product sku",
"mpn": "product mpn",
"gtin": [
{
"type": "ean13",
"value": "978-3-16-148410-0"
}
],
"brand": "product brand",
"breadcrumbs": [
{
"name": "Level 1",
"link": "http://example.com"
}
],
"mainImage": "http://example.com/image.png",
"images": [
"http://example.com/image.png"
],
"description": "product description",
"aggregateRating": {
"ratingValue": 4.5,
"bestRating": 5.0,
"reviewCount": 31
},
"additionalProperty": [
{
"name": "property 1",
"value": "value of property 1"
}
],
"probability": 0.95,
"url": "https://example.com/product"
"name": "Product name",
"offers": [
{
"price": "42",
"currency": "USD",
"availability": "InStock"
}
],
"sku": "product sku",
"mpn": "product mpn",
"gtin": [
{
"type": "ean13",
"value": "978-3-16-148410-0"
}
],
"brand": "product brand",
"breadcrumbs": [
{
"name": "Level 1",
"link": "http://example.com"
}
],
"mainImage": "http://example.com/image.png",
"images": [
"http://example.com/image.png"
],
"description": "product description",
"aggregateRating": {
"ratingValue": 4.5,
"bestRating": 5.0,
"reviewCount": 31
},
"additionalProperty": [
{
"name": "property 1",
"value": "value of property 1"
}
],
"probability": 0.95,
"url": "https://example.com/product"
},
"webPage": {
"inLanguages": [
{"code": "en"},
{"code": "es"}
]
"inLanguages": [
{
"code": "en"
},
{
"code": "es"
}
]
},
"query": {
"id": "1564747029122-9e02a1868d70b7a2",
"domain": "example.com",
"userQuery": {
"pageType": "product",
"url": "https://example.com/product"
}
"id": "1564747029122-9e02a1868d70b7a2",
"domain": "example.com",
"userQuery": {
"pageType": "product",
"url": "https://example.com/product"
}
}
}
}
]
28 changes: 8 additions & 20 deletions tests/test_items.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
import pytest

from autoextract_poet.items import (
Offer,
Breadcrumb,
Rating,
AdditionalProperty,
GTIN,
Article,
Breadcrumb,
GTIN,
Offer,
Product,
Rating,
)

from tests import load_fixture
from tests import load_fixture, item_equals_dict

example_product_result = load_fixture("sample_product.json")
example_article_result = load_fixture("sample_article.json")
example_article_result = load_fixture("sample_article.json")[0]
example_product_result = load_fixture("sample_product.json")[0]


@pytest.mark.parametrize(
Expand All @@ -28,19 +28,7 @@
) # type: ignore
def test_item(cls, data):
item = cls.from_dict(data)
for key, value in data.items():
if key == 'breadcrumbs':
value = Breadcrumb.from_list(value)
if key == 'offers':
value = Offer.from_list(value)
if key == 'additionalProperty':
value = AdditionalProperty.from_list(value)
if key == 'gtin':
value = GTIN.from_list(value)
if key == 'aggregateRating':
value = Rating.from_dict(value)

assert getattr(item, key) == value
assert item_equals_dict(item, data)

# AttributeError: 'cls' object has no attribute 'foo'
with pytest.raises(AttributeError):
Expand Down
22 changes: 22 additions & 0 deletions tests/test_page_inputs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import pytest

from autoextract_poet.page_inputs import (
AutoExtractArticleData,
AutoExtractProductData,
)

from tests import load_fixture, item_equals_dict

example_article_result = load_fixture("sample_article.json")
example_product_result = load_fixture("sample_product.json")


@pytest.mark.parametrize("cls, results", [
(AutoExtractArticleData, example_article_result),
(AutoExtractProductData, example_product_result),
])
def test_response_data(cls, results):
response_data = cls(results[0])
item = response_data.to_item()
assert isinstance(item, response_data.item_class)
assert item_equals_dict(item, results[0][cls.item_key])