Skip to content

HtmlButton.click() does not work #728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thedigginman opened this issue Feb 15, 2024 · 4 comments
Closed

HtmlButton.click() does not work #728

thedigginman opened this issue Feb 15, 2024 · 4 comments

Comments

@thedigginman
Copy link

I am trying to scrape this website for DATACENTER. I am able to get to the first page and scrape the necessary information. Within the page there is an array of buttons that allow you to paginate to get the next page of information. I can successfully find the button, but then executing the button.click() method does not return a new page of data. It is the same data. Below is the test code that I am using. I did note that when I manually paginate that the URL remains the same (https://www.datacenters.com/locations) and the data changes within the page, so there is something else going on in the page that I am not aware of or don't understand. I have searched through other similar questions and answers for this issue but none I have tried seem to work.

Here is the code that replicates the problem.

package com.mycompany.datacenter.app;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import org.w3c.dom.Node;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class App {


    public static void main(String[] args) throws IOException {
        // get the locations page
        String baseURL = "https://www.datacenters.com";
        String locURL = baseURL + "/locations";

        WebClient client = new WebClient();
        client.getOptions().setTimeout(60000);
        client.getOptions().setRedirectEnabled(true);
        client.getOptions().setJavaScriptEnabled(true);
        client.getOptions().setThrowExceptionOnFailingStatusCode(false);
        client.getOptions().setThrowExceptionOnScriptError(false);
        client.getOptions().setCssEnabled(true);
        client.getOptions().setUseInsecureSSL(true);
        client.setAjaxController(new NicelyResynchronizingAjaxController());

        HtmlPage page = client.getPage(locURL);

        Map<String, DCInfo> dcInfoList = new HashMap<>();
        do {
            // get all the location urls on this page
            DomNodeList<DomNode> list = page.querySelectorAll("a[class^='LocationTile__anchor']");
            for (DomNode node : list) {
                DCInfo dcInfo = new DCInfo();

                // get owner text
                DomNode ownerNode = node.querySelector("div[class^='LocationTile__provider']");
                dcInfo.setDcOwner(ownerNode.getFirstChild().getNodeValue());

                // get location name
                DomNode locationNode = node.querySelector("div[class^='LocationTile__name']");
                dcInfo.setDcName(locationNode.getFirstChild().getNodeValue());

                // get location address
                DomNode locAddressNode = node.querySelector("div[class^='LocationTile__address']");
                dcInfo.setDcAddress(locAddressNode.getFirstChild().getNodeValue());

                // get location specific URL
                Node hRef = node.getAttributes().getNamedItem("href");
                dcInfo.setDcLocationLink(baseURL + hRef.getNodeValue());
                dcInfoList.put(baseURL + '\\' + hRef.getNodeValue(), dcInfo);
                System.out.println(dcInfo);
            }

            // do the pagination
            HtmlButton button = page.querySelector("nav button[class^='Control__control']:last-child");
            HtmlPage newPage = button.click();
            client.waitForBackgroundJavaScript(10000);
            page = newPage;
        } while (page != null);

        client.close();
    }
}
@fleboulch
Copy link

fleboulch commented Feb 19, 2024

I'm not sure it's related to the current issue but trying to migrate from 3.9.0 to 3.11.0 broke a click on a pagination for me.

Url scrapped: https://www.opera-bordeaux.com/saison

Actual behaviour with 3.9.0: everything is working fine and no error log
Trying to migrate to 3.11.0: Broke the click on the pagination and add error logs

My code is like

fun fetch(now: LocalDateTime): List<EventJpa> {
        val webClient = WebClient().apply {
            options.isCssEnabled = true
            options.isJavaScriptEnabled = true
            cssErrorHandler = SilentCssErrorHandler()
            javaScriptErrorListener = SilentJavaScriptErrorListener()
            options.isThrowExceptionOnFailingStatusCode = false
            options.isThrowExceptionOnScriptError = false
        }
        
        return try {

            val page = webClient.getPage<HtmlPage>("https://www.opera-bordeaux.com/saison")
            webClient.waitForBackgroundJavaScript(4000)
            val container: HtmlElement = page.getFirstByXPath("//div[@class='views-element-container']")
            val nextPageNavigation: HtmlElement? =
            container.getFirstByXPath<HtmlElement>("descendant::nav[@class='pager']/ul/li[@class='pager__item pager__item--next ']")

            val lastPage = container.getByXPath<HtmlElement>("descendant::nav[@class='pager']/ul/li")[6].asNormalizedText().toInt()
            val eventsOnOtherPages = IntRange(2, lastPage).flatMap {
                val newPage = nextPageNavigation.getFirstByXPath<HtmlAnchor>("a").click<HtmlPage>().body // NullPointerException
                val pair = fetchEvents(newPage, now)
                nextPageNavigation = pair.second
                pair.first
            }

        } catch (e: Exception) {
            emptyList()
        } finally {
            webClient.close()
        }
}

fun fetchEvents(...) {
   ...
}

Stack trace 1:

2024-02-19T19:44:27.925+01:00 ERROR 12461 --- [           main] org.htmlunit.WebConsole                  : Error: Minified React error #423; visit https://reactjs.org/docs/error-decoder.html?invariant=423 for the full message or use the non-minified dev environment for full errors and additional helpful warnings.
@https://embed-cdn.spotifycdn.com/_next/static/chunks/framework-9061fa2704610d1a.js:9
Vk()@https://embed-cdn.spotifycdn.com/_next/static/chunks/framework-9061fa2704610d1a.js:9
@https://embed-cdn.spotifycdn.com/_next/static/chunks/framework-9061fa2704610d1a.js:9
Jk()@https://embed-cdn.spotifycdn.com/_next/static/chunks/framework-9061fa2704610d1a.js:9
Ok()@https://embed-cdn.spotifycdn.com/_next/static/chunks/framework-9061fa2704610d1a.js:9
Hk()@https://embed-cdn.spotifycdn.com/_next/static/chunks/framework-9061fa2704610d1a.js:9
J()@https://embed-cdn.spotifycdn.com/_next/static/chunks/framework-9061fa2704610d1a.js:33
R()@https://embed-cdn.spotifycdn.com/_next/static/chunks/framework-9061fa2704610d1a.js:33

Stack trace 2:

java.lang.NullPointerException: null
	at com.flb.bdxevents.infra.fetcher.GrandTheatreFetcher.fetch(GrandTheatreFetcher.kt:47) ~[main/:na]
	at com.flb.bdxevents.infra.EventFetcher.fetch(EventFetcher.kt:54) ~[main/:na]
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
	at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:351) ~[spring-aop-6.1.3.jar:6.1.3]

@fleboulch
Copy link

This issue is still here (I tried the latest snapshot 4.0.0)

@fleboulch
Copy link

My issue disappears with the 4.0.0 version! Thanks a lot 🎉

@rbri
Copy link
Member

rbri commented Apr 28, 2024

@fleboulch not really sure why this happens but great to hear.

@rbri rbri closed this as completed Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants