Skip to content

Commit 1d31c6c

Browse files
authored
fix!: Refactor service usage to rely on service_locator (#691)
### Description - The `service_container` module has been completely refactored. It introduces changes to its usage, and resulting in many changes across the code base. - While it remains possible to pass "services" directly to components as before, components now rely on the `service_container` internally. They no longer store service instances themselves. - A new `force` flag has been added to the `service_container`'s setters, which is especially useful for testing purposes. - This also quite simplifies the `memory_storage_client`. - We now have only `set_storage_client`, the same approach as for the `event_manager`. This is more flexible (allows more envs than just local & cloud). And in the SDK Actor, we can them based on the `is_at_home`. - This is a breaking change, but it affects only the `service_container` interface. ### Open discussion - [x] Should we go further and remove the option to pass configuration, event managers, or storage clients directly to components - requiring them to be set exclusively via the `service_container`? Thoughts are welcome. - No - [x] Better name for `service_container`? `service_locator`? - `service_locator` ### Issues - Closes: #699 - Closes: #539 - Closes: #369 - It also unlocks the: - #670, - and maybe apify/apify-sdk-python#324 (comment). ### Testing - Existing tests, including those covering the `service_container`, have been updated to reflect these changes. - New tests covering the `MemoryStorageClient` respects the `Configuration`. ### Manual reproduction - This code snippet demonstrates that the `Crawler` and `MemoryStorageClient` respects the custom `Configuration`. - Note: Some fields remain non-working for now. These will be addressed in a future PR, as this refactor is already quite big. However, with the new architecture, those updates should now be easy. ```python import asyncio from crawlee.configuration import Configuration from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext from crawlee.service_container import set_configuration async def main() -> None: config = Configuration(persist_storage=False, write_metadata=False) set_configuration(config) # or Crawler(config=config) crawler = HttpCrawler() @crawler.router.default_handler async def request_handler(context: HttpCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') await context.push_data({'url': context.request.url}) await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main()) ``` ### Checklist - [x] CI passed
1 parent e44b62b commit 1d31c6c

34 files changed

+692
-657
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ __pycache__
1313
# Poetry
1414
poetry.toml
1515

16+
# Other Python tools
17+
.ropeproject
18+
1619
# Mise
1720
mise.toml
1821
.mise.toml

docs/guides/code/request_storage/purge_explicitly_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55

66
async def main() -> None:
7-
storage_client = MemoryStorageClient()
7+
storage_client = MemoryStorageClient.from_config()
88
# highlight-next-line
99
await storage_client.purge_on_start()
1010

docs/upgrading/upgrading_to_v0x.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,18 @@ This page summarizes the breaking changes between Crawlee for Python zero-based
1010
This section summarizes the breaking changes between v0.4.x and v0.5.0.
1111

1212
### BeautifulSoupParser
13+
1314
- Renamed `BeautifulSoupParser` to `BeautifulSoupParserType`. Probably used only in type hints. Please replace previous usages of `BeautifulSoupParser` by `BeautifulSoupParserType`.
1415
- `BeautifulSoupParser` is now a new class that is used in refactored class `BeautifulSoupCrawler`.
1516

17+
### Service locator
18+
19+
- The `crawlee.service_container` was completely refactored and renamed to `crawlee.service_locator`.
20+
21+
### Statistics
22+
23+
- The `crawlee.statistics.Statistics` class do not accept an event manager as an input argument anymore. It uses the default, global one.
24+
1625
## Upgrading to v0.4
1726

1827
This section summarizes the breaking changes between v0.3.x and v0.4.0.

poetry.lock

Lines changed: 21 additions & 21 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -60,13 +60,15 @@ parsel = { version = ">=1.9.0", optional = true }
6060
playwright = { version = ">=1.27.0", optional = true }
6161
psutil = ">=6.0.0"
6262
pydantic = ">=2.8.1, !=2.10.0, !=2.10.1, !=2.10.2"
63-
pydantic-settings = ">=2.2.0"
63+
# TODO: relax the upper bound once the issue is resolved:
64+
# https://github.com/apify/crawlee-python/issues/814
65+
pydantic-settings = ">=2.2.0 <2.7.0"
6466
pyee = ">=9.0.0"
6567
sortedcollections = ">=2.1.0"
6668
tldextract = ">=5.1.0"
6769
typer = ">=0.12.0"
6870
typing-extensions = ">=4.1.0"
69-
yarl = "^1.18.0"
71+
yarl = ">=1.18.0"
7072

7173
[tool.poetry.group.dev.dependencies]
7274
build = "~1.2.0"
@@ -206,9 +208,9 @@ warn_unused_ignores = true
206208

207209
[[tool.mypy.overrides]]
208210
# Example codes are sometimes showing integration of crawlee with external tool, which is not dependency of crawlee.
209-
module =[
210-
"apify", # Example code shows integration of apify and crawlee.
211-
"camoufox" # Example code shows integration of camoufox and crawlee.
211+
module = [
212+
"apify", # Example code shows integration of apify and crawlee.
213+
"camoufox", # Example code shows integration of camoufox and crawlee.
212214
]
213215
ignore_missing_imports = true
214216

src/crawlee/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
from importlib import metadata
22

33
from ._request import Request
4+
from ._service_locator import service_locator
45
from ._types import ConcurrencySettings, EnqueueStrategy, HttpHeaders
56
from ._utils.globs import Glob
67

78
__version__ = metadata.version('crawlee')
89

9-
__all__ = ['ConcurrencySettings', 'EnqueueStrategy', 'Glob', 'HttpHeaders', 'Request']
10+
__all__ = ['ConcurrencySettings', 'EnqueueStrategy', 'Glob', 'HttpHeaders', 'Request', 'service_locator']

src/crawlee/_autoscaling/snapshotter.py

Lines changed: 8 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,8 @@
1010
import psutil
1111
from sortedcontainers import SortedList
1212

13-
from crawlee._autoscaling.types import (
14-
ClientSnapshot,
15-
CpuSnapshot,
16-
EventLoopSnapshot,
17-
MemorySnapshot,
18-
Snapshot,
19-
)
13+
from crawlee import service_locator
14+
from crawlee._autoscaling.types import ClientSnapshot, CpuSnapshot, EventLoopSnapshot, MemorySnapshot, Snapshot
2015
from crawlee._utils.byte_size import ByteSize
2116
from crawlee._utils.context import ensure_context
2217
from crawlee._utils.docs import docs_group
@@ -26,8 +21,6 @@
2621
if TYPE_CHECKING:
2722
from types import TracebackType
2823

29-
from crawlee.events import EventManager
30-
3124
logger = getLogger(__name__)
3225

3326
T = TypeVar('T')
@@ -45,7 +38,6 @@ class Snapshotter:
4538

4639
def __init__(
4740
self,
48-
event_manager: EventManager,
4941
*,
5042
event_loop_snapshot_interval: timedelta = timedelta(milliseconds=500),
5143
client_snapshot_interval: timedelta = timedelta(milliseconds=1000),
@@ -63,8 +55,6 @@ def __init__(
6355
"""A default constructor.
6456
6557
Args:
66-
event_manager: The event manager used to emit system info events. From data provided by this event
67-
the CPU and memory usage are read.
6858
event_loop_snapshot_interval: The interval at which the event loop is sampled.
6959
client_snapshot_interval: The interval at which the client is sampled.
7060
max_used_cpu_ratio: Sets the ratio, defining the maximum CPU usage. When the CPU usage is higher than
@@ -90,7 +80,6 @@ def __init__(
9080
if available_memory_ratio is None and max_memory_size is None:
9181
raise ValueError('At least one of `available_memory_ratio` or `max_memory_size` must be specified')
9282

93-
self._event_manager = event_manager
9483
self._event_loop_snapshot_interval = event_loop_snapshot_interval
9584
self._client_snapshot_interval = client_snapshot_interval
9685
self._max_event_loop_delay = max_event_loop_delay
@@ -145,8 +134,9 @@ async def __aenter__(self) -> Snapshotter:
145134
raise RuntimeError(f'The {self.__class__.__name__} is already active.')
146135

147136
self._active = True
148-
self._event_manager.on(event=Event.SYSTEM_INFO, listener=self._snapshot_cpu)
149-
self._event_manager.on(event=Event.SYSTEM_INFO, listener=self._snapshot_memory)
137+
event_manager = service_locator.get_event_manager()
138+
event_manager.on(event=Event.SYSTEM_INFO, listener=self._snapshot_cpu)
139+
event_manager.on(event=Event.SYSTEM_INFO, listener=self._snapshot_memory)
150140
self._snapshot_event_loop_task.start()
151141
self._snapshot_client_task.start()
152142
return self
@@ -168,8 +158,9 @@ async def __aexit__(
168158
if not self._active:
169159
raise RuntimeError(f'The {self.__class__.__name__} is not active.')
170160

171-
self._event_manager.off(event=Event.SYSTEM_INFO, listener=self._snapshot_cpu)
172-
self._event_manager.off(event=Event.SYSTEM_INFO, listener=self._snapshot_memory)
161+
event_manager = service_locator.get_event_manager()
162+
event_manager.off(event=Event.SYSTEM_INFO, listener=self._snapshot_cpu)
163+
event_manager.off(event=Event.SYSTEM_INFO, listener=self._snapshot_memory)
173164
await self._snapshot_event_loop_task.stop()
174165
await self._snapshot_client_task.stop()
175166
self._active = False

src/crawlee/_log_config.py

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,12 @@
44
import logging
55
import sys
66
import textwrap
7-
from typing import TYPE_CHECKING, Any
7+
from typing import Any
88

99
from colorama import Fore, Style, just_fix_windows_console
1010
from typing_extensions import assert_never
1111

12-
if TYPE_CHECKING:
13-
from crawlee.configuration import Configuration
12+
from crawlee import service_locator
1413

1514
just_fix_windows_console()
1615

@@ -35,35 +34,32 @@
3534
_LOG_MESSAGE_INDENT = ' ' * 6
3635

3736

38-
def get_configured_log_level(configuration: Configuration) -> int:
39-
verbose_logging_requested = 'verbose_log' in configuration.model_fields_set and configuration.verbose_log
37+
def get_configured_log_level() -> int:
38+
config = service_locator.get_configuration()
4039

41-
if 'log_level' in configuration.model_fields_set:
42-
if configuration.log_level == 'DEBUG':
40+
verbose_logging_requested = 'verbose_log' in config.model_fields_set and config.verbose_log
41+
42+
if 'log_level' in config.model_fields_set:
43+
if config.log_level == 'DEBUG':
4344
return logging.DEBUG
44-
if configuration.log_level == 'INFO':
45+
if config.log_level == 'INFO':
4546
return logging.INFO
46-
if configuration.log_level == 'WARNING':
47+
if config.log_level == 'WARNING':
4748
return logging.WARNING
48-
if configuration.log_level == 'ERROR':
49+
if config.log_level == 'ERROR':
4950
return logging.ERROR
50-
if configuration.log_level == 'CRITICAL':
51+
if config.log_level == 'CRITICAL':
5152
return logging.CRITICAL
5253

53-
assert_never(configuration.log_level)
54+
assert_never(config.log_level)
5455

5556
if sys.flags.dev_mode or verbose_logging_requested:
5657
return logging.DEBUG
5758

5859
return logging.INFO
5960

6061

61-
def configure_logger(
62-
logger: logging.Logger,
63-
configuration: Configuration,
64-
*,
65-
remove_old_handlers: bool = False,
66-
) -> None:
62+
def configure_logger(logger: logging.Logger, *, remove_old_handlers: bool = False) -> None:
6763
handler = logging.StreamHandler()
6864
handler.setFormatter(CrawleeLogFormatter())
6965

@@ -72,7 +68,7 @@ def configure_logger(
7268
logger.removeHandler(old_handler)
7369

7470
logger.addHandler(handler)
75-
logger.setLevel(get_configured_log_level(configuration))
71+
logger.setLevel(get_configured_log_level())
7672

7773

7874
class CrawleeLogFormatter(logging.Formatter):

src/crawlee/_service_locator.py

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
from __future__ import annotations
2+
3+
from crawlee._utils.docs import docs_group
4+
from crawlee.base_storage_client._base_storage_client import BaseStorageClient
5+
from crawlee.configuration import Configuration
6+
from crawlee.errors import ServiceConflictError
7+
from crawlee.events._event_manager import EventManager
8+
9+
10+
@docs_group('Classes')
11+
class ServiceLocator:
12+
"""Service locator for managing the services used by Crawlee.
13+
14+
All services are initialized to its default value lazily.
15+
"""
16+
17+
def __init__(self) -> None:
18+
self._configuration: Configuration | None = None
19+
self._event_manager: EventManager | None = None
20+
self._storage_client: BaseStorageClient | None = None
21+
22+
# Flags to check if the services were already set.
23+
self._configuration_was_set = False
24+
self._event_manager_was_set = False
25+
self._storage_client_was_set = False
26+
27+
def get_configuration(self) -> Configuration:
28+
"""Get the configuration."""
29+
if self._configuration is None:
30+
self._configuration = Configuration()
31+
32+
return self._configuration
33+
34+
def set_configuration(self, configuration: Configuration) -> None:
35+
"""Set the configuration.
36+
37+
Args:
38+
configuration: The configuration to set.
39+
40+
Raises:
41+
ServiceConflictError: If the configuration was already set.
42+
"""
43+
if self._configuration_was_set:
44+
raise ServiceConflictError(Configuration, configuration, self._configuration)
45+
46+
self._configuration = configuration
47+
self._configuration_was_set = True
48+
49+
def get_event_manager(self) -> EventManager:
50+
"""Get the event manager."""
51+
if self._event_manager is None:
52+
from crawlee.events import LocalEventManager
53+
54+
self._event_manager = LocalEventManager()
55+
56+
return self._event_manager
57+
58+
def set_event_manager(self, event_manager: EventManager) -> None:
59+
"""Set the event manager.
60+
61+
Args:
62+
event_manager: The event manager to set.
63+
64+
Raises:
65+
ServiceConflictError: If the event manager was already set.
66+
"""
67+
if self._event_manager_was_set:
68+
raise ServiceConflictError(EventManager, event_manager, self._event_manager)
69+
70+
self._event_manager = event_manager
71+
self._event_manager_was_set = True
72+
73+
def get_storage_client(self) -> BaseStorageClient:
74+
"""Get the storage client."""
75+
if self._storage_client is None:
76+
from crawlee.memory_storage_client import MemoryStorageClient
77+
78+
self._storage_client = MemoryStorageClient.from_config()
79+
80+
return self._storage_client
81+
82+
def set_storage_client(self, storage_client: BaseStorageClient) -> None:
83+
"""Set the storage client.
84+
85+
Args:
86+
storage_client: The storage client to set.
87+
88+
Raises:
89+
ServiceConflictError: If the storage client was already set.
90+
"""
91+
if self._storage_client_was_set:
92+
raise ServiceConflictError(BaseStorageClient, storage_client, self._storage_client)
93+
94+
self._storage_client = storage_client
95+
self._storage_client_was_set = True
96+
97+
98+
service_locator = ServiceLocator()

0 commit comments

Comments
 (0)