Skip to content
This repository was archived by the owner on Mar 20, 2024. It is now read-only.

A headless browser manager with multi tasking RESTful API, crawling oriented

License

Notifications You must be signed in to change notification settings

luka-dev/playwright-task-server

Repository files navigation

!!!REPO DEPRACTICATED!!!

Look into backend headless-task-server and php helper headless-task-server-php.

Why? Main reasons is...

- Hero(ex SecretAgent) provide more stable, patched version of Chrome (not Chromium)

- Hero(ex SecretAgent) contains builded-in a lot of techics to make headless browser undetectabe for bot detectors.

PLAYWRIGHT task server

It's a Node.Js server that's hold playwright to process tasks (mainly - crawling)

Concept:

  • Express hold RESTful API and receive authorized(or not) request with task script
  • Add your task to queue (will be runed as soon as any worker will be free)
  • Run separate context(incognito env.)
  • Run your task script in something like isolated context
  • Return to express callback result of task script and answer for your request

Example of request:

POST to "http://server_address:port/task"

Content-Type: application/x-www-form-urlencoded

if in config.json AUTH_KEY is not null, add header

Authorization: HERE_AUTH_KEY

in form, field with name 'script'

Example of request

!!!WARNING!!!

Field script should be a string.

fetch("http://server_address:port/task", {
  "method": "POST",
  "headers": {
    "content-type": "application/x-www-form-urlencoded",
    "authorization": "HERE_AUTH_KEY"
  },
  "body": {
      "options": {
          "proxy": {
              "server": "PROTOCOL://ADDRESS:PORT", 
              "bypass": "", 
              "username": "USERNAME", 
              "password": "PASSWORD"
          }
      }, 
      "script": "HERE_IS_SCRIPT"
  }
});

Example of script (playwright docs)

//Creating page inside context
const page = await context.newPage();

//Preparing key's for data storage
let data = {
        hosts: [],
        res: [],
        ip: null
    };

//Listener, that's catch all requests, block everything except HTML and loging them.
page.route('**', route => {
    
    //Used module.URL (instance of node.js URL)
    data.hosts.push(modules.URL.parse(route.request().url()).hostname);
    
    if (route.request().resourceType() !== 'document') 
    {
       route.abort('aborted');
    }
    else {
      data.res.push(route.request().resourceType());
      route.continue();
    }
});

//Open 2ip main page and waiting for load
await page.goto('https://2ip.ru/');

//Extracting ip from html
data.ip = (await page.$('div.ip')).innerText();

//End script execution and return data
//also can be reject in case of script failure
resolve(data);

Var data locally created and puted throw resolve. Everything from var, will be displayed in response. All manually created var's/const's/e.t.c. inside script will be ignored in response.

Also task server support modeules, custom libs set, that will be available inside runed script context.

config.json

Proxy

In config, proxy property can be null, object or per-context (default: per-context), follow this docs. Example of proxy object

{
        "server": "hostname:port",
        "bypass": "",
        "username": "usernameForProxy",
        "password": "passwordForProxy"
}

Proxy per-context configuration docs

To set GLOBAL proxy, use ENV

In case of unnecessary authorization with username & password, fields username and password can be skipped or can be null

Env

PW_TASK_KEY - Key for Authorization

PW_TASK_PORT - Running port

PW_TASK_PROXY - Proxy hostname:port

PW_TASK_USERNAME - Proxy username

PW_TASK_PASSWORD - Proxy password

Additional

PHP-Lib for generating simple task script. (lib cover min. req.)

todo

  • Cover node inside docker container with xvfb
  • Submit issues with ideas