Skip to content

Create Page Object Input Providers #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Sep 1, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
515e8a5
Create Page Object Input Providers
victor-torres Aug 13, 2020
18940f2
Update scrapy-poet dependency on setup.py
victor-torres Aug 14, 2020
8fea07c
Update scrapinghub-autoextract dependency on setup.py
victor-torres Aug 14, 2020
751d13e
Remove Scrapy from setup.py dependencies
victor-torres Aug 14, 2020
3505fc0
Use current scrapy-poet's providers API
victor-torres Aug 14, 2020
2d5dfba
Remove "Response" from class name
victor-torres Aug 17, 2020
7356d56
Replace "provides" decorator with an install method that makes use of…
victor-torres Aug 17, 2020
adbbd4d
Avoid duplication of provided class name
victor-torres Aug 17, 2020
a3e844b
Add docstrings
victor-torres Aug 17, 2020
c6783db
Add tests
victor-torres Aug 17, 2020
25c4d0d
Temporarily fix setup.py using git while autoextract-poet is not avai…
victor-torres Aug 17, 2020
30af1cb
Temporarily fix setup.py using git while autoextract-poet is not avai…
victor-torres Aug 17, 2020
f612c2f
Change protocol from git to https since it's a public repository
victor-torres Aug 17, 2020
dc5e194
Fix autoextract-poet dependency after its release on PyPI
victor-torres Aug 18, 2020
2365429
Remove page_type class attribute
victor-torres Aug 18, 2020
5de020c
Drop Python 3.5 and add Python 3.8 support
victor-torres Aug 18, 2020
eb99fb7
Calling `.as_dict()` here is not needed, we can just pass the request…
victor-torres Aug 19, 2020
38ce3fa
Create AUTOEXTRACT_MAX_QUERY_ERROR_RETRIES settings which defaults to 3
victor-torres Aug 19, 2020
76a0e9e
Improve documentation on readme file
victor-torres Aug 19, 2020
0e11b1b
Get AutoExtract endpoint from Scrapy settings
victor-torres Aug 19, 2020
b39e6c7
Fix rst links
victor-torres Aug 19, 2020
3676d89
Improve documentation about providers configuration
victor-torres Aug 20, 2020
a705989
Improve documentation about middleware and providers limitations
victor-torres Aug 20, 2020
48f81d8
Update minimum Python version on readme file since Python 3.5 has bee…
victor-torres Aug 21, 2020
3045184
Add documentation about duplicate requests and how DummyResponse can …
victor-torres Aug 21, 2020
7d4fb93
Fix TypeError: 'coroutine' object is not subscriptable
victor-torres Aug 22, 2020
63cddd4
Update scrapinghub-autoextract dependency
victor-torres Aug 22, 2020
96bf868
Include documentatio about Scrapy's asyncio support and the need to c…
victor-torres Aug 22, 2020
44836b0
provided_class class attribute should not be optional since we depend…
victor-torres Aug 24, 2020
cda77ac
Remove unused import
victor-torres Aug 24, 2020
69cd2a7
Raise QueryLevelError exceptions when response data contains an error
victor-torres Aug 24, 2020
7c983da
Improve exception message
victor-torres Aug 24, 2020
686c31f
Add comment to provided_class attribute
victor-torres Aug 24, 2020
22ef043
Better documentation for AUTOEXTRACT_USER when using providers
victor-torres Aug 24, 2020
b1464b2
Improve QueryError exception based on @ivanprado's suggestions
victor-torres Aug 24, 2020
2e950f1
Update README.rst
victor-torres Aug 25, 2020
54574f7
Improve documentation about middleware limitations
victor-torres Aug 25, 2020
4de0834
Improve readme file's examples by surrounding parser by a real spider…
victor-torres Aug 26, 2020
5317c75
Improve readme file by incrementing the settings.py example code snip…
victor-torres Aug 26, 2020
5b26ca0
Improve readme file describing common exceptions and how to capture t…
victor-torres Aug 26, 2020
fb6f0c3
Trying to escape single quotes
victor-torres Aug 26, 2020
d3c54fa
Trying to fix single quotes
victor-torres Aug 26, 2020
22078bd
Fix typo with RST links/references
victor-torres Aug 26, 2020
a3dc46b
Add documentation about our QueryError exception
victor-torres Aug 26, 2020
c4bd3d0
Remove QueryError exception because it's never raised to the user
victor-torres Aug 31, 2020
2c930e4
Improve documentation message
victor-torres Aug 31, 2020
76e2679
Remove QueryRetryError from list of exceptions
victor-torres Sep 1, 2020
300a0e2
Update README.rst
victor-torres Sep 1, 2020
581ca7e
Remove outdated example
victor-torres Sep 1, 2020
ac85a96
Update README.rst
victor-torres Sep 1, 2020
aa26041
Remove one of the middleware limitations because according to @kmike,…
victor-torres Sep 1, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@ language: python

matrix:
include:
- python: 3.5
env: TOXENV=py35
- python: 3.6
env: TOXENV=py36
- python: 3.7
env: TOXENV=py37
- python: 3.7
- python: 3.8
env: TOXENV=py38
- python: 3.8
env: TOXENV=flake8

# command to install dependencies
Expand Down
192 changes: 174 additions & 18 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,32 @@ Installation

pip install scrapy-autoextract

scrapy-autoextract requires Python 3.5+
scrapy-autoextract requires Python 3.6+


Usage
=====

There are two different ways to consume the AutoExtract API with this library:

* using our Scrapy middleware
* using our Page Object providers

The middleware
--------------

The middleware is opt-in and can be explicitly enabled per request,
with the ``{'autoextract': {'enabled': True}}`` request meta.
All the options below can be set either in the project settings file,
or just for specific spiders, in the ``custom_settings`` dict.

Within the spider, consuming the AutoExtract result is as easy as::

def parse(self, response):
yield response.meta['autoextract']

Configuration
=============
^^^^^^^^^^^^^

Add the AutoExtract downloader middleware in the settings file::

Expand All @@ -42,16 +63,132 @@ Add the AutoExtract downloader middleware in the settings file::

Note that this should be the last downloader middleware to be executed.

The providers
-------------

Usage
=====
Another way of consuming AutoExtract API is using the Page Objects pattern
proposed by the `web-poet`_ library and implemented by `scrapy-poet`_.

The middleware is opt-in and can be explicitly enabled per request,
with the ``{'autoextract': {'enabled': True}}`` request meta.
All the options below can be set either in the project settings file,
or just for specific spiders, in the ``custom_settings`` dict.
Page Objects their returned Items are defined by the `autoextract-poet`_
library.

Within the spider, consuming the AutoExtract result is as easy as::

import scrapy
from autoextract_poet.page_inputs import AutoExtractArticleData

class SampleSpider(scrapy.Spider):

name = "sample"

def parse(self, response, article: AutoExtractArticleData):
# We're making two requests here:
# - one through Scrapy to build the response argument
# - another through providers to build the article argument
yield article.to_item()

Note that on the example above, we're going to perform two requests:

* one goes through Scrapy (it might use Crawlera, Splash or no proxy at all, depending on your configuration)
* another goes through AutoExtract API using `scrapinghub-autoextract`_

If you don't need the additional request going through Scrapy,
you can annotate the response argument of your callback with ``DummyResponse``.
This will ignore the Scrapy request and only the AutoExtract API will be fetched.

For example::

import scrapy
from autoextract_poet.page_inputs import AutoExtractArticleData
from scrapy_poet.utils import DummyResponse

Available settings:
class SampleSpider(scrapy.Spider):

name = "sample"

def parse(self, response: DummyResponse, article: AutoExtractArticleData):
# We're making a single request here to build the article argument
yield article.to_item()

Configuration
^^^^^^^^^^^^^

First, you need to configure scrapy-poet as described on `scrapy-poet's documentation`_
and then enable AutoExtract providers by putting the following code to Scrapy's ``settings.py`` file::

# Install AutoExtract providers
import scrapy_autoextract.providers
scrapy_autoextract.providers.install()

# Enable scrapy-poet's provider injection middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_poet.InjectionMiddleware': 543,
}

# Configure Twisted's reactor for asyncio support on Scrapy
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

Currently, our providers are implemented using asyncio.
Scrapy has introduced asyncio support since version 2.0
but as of Scrapy 2.3 you need to manually enable it by configuring Twisted's default reactor.
Check `Scrapy's asyncio documentation`_ for more information.

Checklist:

* scrapy-poet is installed and downloader/injector middleware is configured
* autoextract-poet is installed (page inputs are imported from this lib)
* providers are installed on settings.py
* Scrapy's asyncio support is enabled on settings.py

Now you should be ready to use our AutoExtract providers.

Exceptions
^^^^^^^^^^

While trying to fetch AutoExtract API, providers might raise some exceptions.
Those exceptions might come from scrapy-autoextract providers themselves,
`scrapinghub-autoextract`_, or Tenacity, the library used to implement retries.
For example:

* ``autoextract.aio.errors.RequestError``: raised when a `Request-level error`_ is returned
* ``tenacity.RetryError``: raised when an error persists even after the retrials

Check `scrapinghub-autoextract's async errors`_ for exception definitions.

You can capture those exceptions using an error callback (``errback``)::

import scrapy
from autoextract.aio.errors import RequestError, QueryRetryError
from tenacity import RetryError
from twisted.python.failure import Failure

class SampleSpider(scrapy.Spider):

name = "sample"
urls = [...]

def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, callback=self.parse_article, errback=self.errback_article)

def parse_article(self, response: DummyResponse, article: AutoExtractArticleData):
yield article.to_item()

def errback_article(self, failure: Failure):
if failure.check(RequestError):
self.logger.error(f"RequestError on {failure.request.url})

if failure.check(RetryError):
self.logger.error(f"RetryError on {failure.request.url})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be nice to show how to get the wrapped exception as the relevant information is there (and probably it will be a RequestError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But well, It is Ok leave as is given that it is likely that we will remove RetryError at all.


See `Scrapy documentation <https://docs.scrapy.org/en/latest/topics/request-response.html#using-errbacks-to-catch-exceptions-in-request-processing>`_
for more details on how to capture exceptions using request's errback.

Settings
========

Middleware settings
-------------------

- ``AUTOEXTRACT_USER`` [mandatory] is your AutoExtract API key
- ``AUTOEXTRACT_URL`` [optional] the AutoExtract service url. Defaults to autoextract.scrapinghub.com.
Expand All @@ -67,11 +204,12 @@ Available settings:
- If set to ``SlotPolicy.PER_DOMAIN``, then consider setting ``SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'``
to make better usage of AutoExtract concurrency and avoid delays.

Within the spider, consuming the AutoExtract result is as easy as::

def parse(self, response):
yield response.meta['autoextract']
Provider settings
-----------------

- ``AUTOEXTRACT_USER`` [optional] is your AutoExtract API key. Defaults to ``SCRAPINGHUB_AUTOEXTRACT_KEY`` environment variable.
- ``AUTOEXTRACT_URL`` [optional] the AutoExtract service url. Defaults to the official AutoExtract endpoint.
- ``AUTOEXTRACT_MAX_QUERY_ERROR_RETRIES`` [optional] Max number of retries for Query-level errors. Defaults to ``3``.

Limitations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to mention some of the limitations of the Scrapy middleware, as compared to the scrapy-poet provider:

  • 429 errors are handled as standard retries when using Scrapy middleware, but they're handled properly&automatically with scrapy-poet, as it relies on scrapinghub-autoextract. One may loose some responses with the middleware. With scrapy-poet there is no need to change RETRY_HTTP_CODES.
  • retrying of query-level errors is only supported in scrapy-poet integration (?)
  • retries overall have a better behavior with scrapy-poet integration
  • with scrapy-poet integration retry requests don't go through Scrapy
  • not all data types are supported with scrapy-poet, currently only article and product are supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good comments that are worth mentioning in our limitations section. I've just pushed a new commit adding this information. Could you check them, please? If you have further suggestions or would like to change the way something is written, please, feel free to submit a commit to this branch making these changes to avoid back and forth.

===========
Expand All @@ -89,8 +227,26 @@ When using the AutoExtract middleware, there are some limitations.
so it's best to use ``AUTHTHROTTLE_ENABLED=False`` in the settings.
* Redirects are handled by AutoExtract, not by Scrapy,
so these kinds of middlewares might have no effect
* Retries should be disabled, because AutoExtract handles them internally
(use ``RETRY_ENABLED=False`` in the settings)
There is an exception, if there are too many requests sent in
a short amount of time and AutoExtract returns HTTP code 429.
For that case it's best to use ``RETRY_HTTP_CODES=[429]``.
* 429 errors could be handled as standard retries when using Scrapy middleware,
but they're handled properly and automatically with scrapy-poet integration,
as it relies on `scrapinghub-autoextract`_.
You may lose some responses with the middleware approach.
* Overall, retries have a better behavior with scrapy-poet integration
and it includes support for automatic Query-level errors retries with
no need to change ``RETRY_HTTP_CODES``.

When using the AutoExtract providers, be aware that:

* With scrapy-poet integration, retry requests don't go through Scrapy
* Not all data types are supported with scrapy-poet,
currently only Articles and Products are supported

.. _`web-poet`: https://github.com/scrapinghub/web-poet
.. _`scrapy-poet`: https://github.com/scrapinghub/scrapy-poet
.. _`autoextract-poet`: https://github.com/scrapinghub/autoextract-poet
.. _`scrapinghub-autoextract`: https://github.com/scrapinghub/scrapinghub-autoextract
.. _`scrapinghub-autoextract's async errors`: https://github.com/scrapinghub/scrapinghub-autoextract/blob/master/autoextract/aio/errors.py
.. _`scrapy-poet's documentation`: https://scrapy-poet.readthedocs.io/en/latest/intro/tutorial.html#configuring-the-project
.. _`Scrapy's asyncio documentation`: https://docs.scrapy.org/en/latest/topics/asyncio.html
.. _`Request-level error`: https://doc.scrapinghub.com/autoextract.html#request-level
.. _`Query-level error`: https://doc.scrapinghub.com/autoextract.html#query-level
114 changes: 114 additions & 0 deletions scrapy_autoextract/providers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
from typing import ClassVar, Type

from autoextract.aio import request_raw
from autoextract.request import Request as AutoExtractRequest
from autoextract_poet.page_inputs import (
AutoExtractArticleData,
AutoExtractProductData,
)
from scrapy import Request
from scrapy.settings import Settings
from scrapy.statscollectors import StatsCollector
from scrapy_poet.page_input_providers import (
PageObjectInputProvider,
register,
)


class QueryError(Exception):

def __init__(self, query: dict, message: str):
self.query = query
self.message = message

def __str__(self):
return f"QueryError: query={self.query}, message='{self.message}'"


class _Provider(PageObjectInputProvider):
"""An interface that describes a generic AutoExtract Provider.

It should not be used publicly as it serves the purpose of being a base
class for more specific providers such as Article and Product providers.
"""

provided_class: ClassVar[Type] # needs item_key attr and to_item method

def __init__(
self,
request: Request,
settings: Settings,
stats: StatsCollector,
):
"""Initialize provider storing its dependencies as attributes."""
self.request = request
self.stats = stats
self.settings = settings

async def __call__(self):
"""Make an AutoExtract request and build a Page Input of provided class
based on API response data.
"""
page_type = self.get_page_type()
self.stats.inc_value(f"autoextract/{page_type}/total")

request = AutoExtractRequest(
url=self.request.url,
pageType=page_type,
)
api_key = self.settings.get("AUTOEXTRACT_USER")
endpoint = self.settings.get("AUTOEXTRACT_URL")
max_query_error_retries = self.settings.getint(
"AUTOEXTRACT_MAX_QUERY_ERROR_RETRIES", 3
)

try:
response = await request_raw(
[request],
api_key=api_key,
endpoint=endpoint,
max_query_error_retries=max_query_error_retries
)
except Exception:
self.stats.inc_value(f"autoextract/{page_type}/error/request")
raise

data = response[0]

if "error" in data:
self.stats.inc_value(f"autoextract/{page_type}/error/query")
raise QueryError(data["query"], data["error"])

self.stats.inc_value(f"autoextract/{page_type}/success")
return self.provided_class(data=data)

@classmethod
def register(cls):
"""Register this provider for its provided class on scrapy-poet
registry. This will make it possible to declare provided class as
a callback dependency when writing Scrapy spiders.
"""
register(cls, cls.provided_class)

@classmethod
def get_page_type(cls) -> str:
"""Page type is defined by the class attribute `item_key` available on
`autoextract_poet.page_inputs` classes.
"""
return cls.provided_class.item_key


class ArticleDataProvider(_Provider):

provided_class = AutoExtractArticleData


class ProductDataProvider(_Provider):

provided_class = AutoExtractProductData


def install():
"""Register all providers for their respective provided classes."""
ArticleDataProvider.register()
ProductDataProvider.register()
7 changes: 6 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,11 @@ def get_version():
long_description=open('README.rst').read(),
url='https://github.com/scrapinghub/scrapy-autoextract',
packages=find_packages(),
install_requires=[
'autoextract-poet>=0.0.1',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these should be installed by default, even if the user does not want it.
It can help to reduce the end package size, on the user side.
Also, we might want to use scrapy-autoextract in a python version that doesn't support asyncio, so not sure if it can be a problem

Copy link
Contributor Author

@victor-torres victor-torres Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these should be installed by default, even if the user does not want it.

You mean, if users are going to use the middleware approach, this library will be installed as if they were going to use providers, right?

Well, I'm suspect and probably biased, but I believe libraries should include all its dependencies even if users end up not using all of them. It represents less trouble and makes it easier for users when they decide it's time to try the other way around.

The alternative would be something like we have/used to have on Spidermon, something like pip install spidermon[monitoring] but I'm not a big fan of this approach.

It can help to reduce the end package size, on the user side.

I think the difference in package size would be negligible.

Also, we might want to use scrapy-autoextract in a python version that doesn't support asyncio, so not sure if it can be a problem

I'm not sure it would be a good idea to use those Scrapy providers in a sync way because it could block our requests cycle. In any case, I believe asyncio is supported at least since Python 3.6, and Python 3.5 is currently deprecated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it would be a good idea to use those Scrapy providers in a sync way because it could block our requests cycle. In any case, I believe asyncio is supported at least since Python 3.6, and Python 3.5 is currently deprecated.

Yeah, but what if I just want to use the middleware version?
And I'm in version 3.4 and a new version of this library is released with a fix in the middleware?

I think the difference in package size would be negligible.

No hard feelings on this one.
Just that we count the popularity of a package by downloads 😛

About the installation, it will depend on the first case, if we want to install async things by default, even if it may break future fixes..

'scrapinghub-autoextract>=0.5.1',
'scrapy-poet>=0.0.3',
],
keywords='scrapy autoextract middleware',
classifiers=[
'Development Status :: 4 - Beta',
Expand All @@ -35,9 +40,9 @@ def get_version():
'Operating System :: OS Independent',
'License :: OSI Approved :: BSD License',
'Programming Language :: Python :: 3 :: Only',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Framework :: Scrapy',
],
)
Loading