Skip to content

Confluence Connector Pagination #3320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
WildDogOne opened this issue Mar 22, 2025 · 1 comment · May be fixed by #3321
Open

Confluence Connector Pagination #3320

WildDogOne opened this issue Mar 22, 2025 · 1 comment · May be fixed by #3321

Comments

@WildDogOne
Copy link

WildDogOne commented Mar 22, 2025

Bug Description

The fullsync on the confluence connector only pulls 50 documents if a CQL is set.

To Reproduce

Set a CQL as an "advanced rule" in the connector "sync rules" for example:
[
{
"query": "created >= now('-5y')"
}
]

Expected behavior

Pull the confluence content of the last 5 years (obvious overkill but that is a different story)

Environment

8.17.3

Solution

I have been playing around with the "paginated_api_call" function in "confluence.py" and have noticed that the function looks for a next link.
However in the /api/search call this does not actually seem to exist according to the API documentation:
https://docs.atlassian.com/atlassian-confluence/REST/6.6.0/#content-search

It seems that pagination for a search has to be done with moving of the start window.
quick prof of concept while still keeping the next link if it would be needed by another function:

    async def paginated_api_call(self, url_name, **url_kwargs):
        """Make a paginated API call for Confluence objects using the passed url_name.
        Args:
            url_name (str): URL Name to identify the API endpoint to hit
        Yields:
            response: JSON response.
        """
        base_url = os.path.join(self.host_url, URLS[url_name].format(**url_kwargs))
        start = 0

        while True:
            try:
                url = f"{base_url}&start={start}"
                print("Starting Pagination for API endpoint: ", url)
                self._logger.debug(f"Starting pagination for API endpoint {url}")
                response = await self.api_call(url=url)
                json_response = await response.json()

                #print(json_response)
                links = json_response.get("_links")
                yield json_response
                print(links.get("next"))
                if links.get("next"):
                    print("Next URL Found")
                    url = os.path.join(
                        self.host_url,
                        links.get("next")[1:],
                    )
                elif json_response.get("start") + json_response.get("size") < json_response.get("totalSize"):
                    print("Calculating next URL")
                    start = json_response.get("start") + json_response.get("size")
                    url = f"{base_url}&start={start}"
                    print("Next URL: ", url)
                else:
                    print("No more data to fetch")
                    return
            except Exception as exception:
                print("Exception: ", exception)
                self._logger.warning(
                    f"Skipping data for type {url_name} from {base_url}. Exception: {exception}."
                )
                break

While debugging this I also found another issue in the function "search_by_query", it never is checked if "entity_details" exists, so if entity details is none, it will fail.
I fixed this with an additional condition

    async def search_by_query(self, query):
        async for entity in self.confluence_client.search_by_query(query=query):
            # entity can be space or content
            entity_details = entity.get(SPACE) or entity.get(CONTENT)

            if not entity_details:
                continue
            if (entity_details.get("type", "") == "attachment"
                and entity_details.get("container", {}).get("title") is None
            ):
                continue
@WildDogOne WildDogOne added the bug Something isn't working label Mar 22, 2025
@WildDogOne WildDogOne linked a pull request Mar 22, 2025 that will close this issue
11 tasks
@seanstory
Copy link
Member

Thanks for filing, @WildDogOne! Indeed, this feels like a bug worth fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants