Crawling multiple URL in a loop using puppeteer

puppeteer loop through elements
puppeteer executablepath
puppeteer api
puppeteer crawler
puppeteer-examples
puppeteer-cluster
puppeteer tracing
puppeteer cookies

I have

urls = ['url','url','url'...]

this is what I'm doing

urls.map(async (url)=>{
  await page.goto(url);
  await page.waitForNavigation({ waitUntil: 'networkidle' });
})

This seems to not wait for page load and visit all the urls quite rapidly(i even tried using page.waitFor )

just wanted to know am I doing something fundamentally wrong or this type of functionality is not advised/supported

map, forEach, reduce, etc, does not wait for the asynchronous operation within them, before they proceed to the next element of the iterator they are iterating over.

There are multiple ways of going through each item of an iterator synchronously while performing an asynchronous operation, but the easiest in this case I think would be to simply use a normal for operator, which does wait for the operation to finish.

const urls = [...]

for (let i = 0; i < urls.length; i++) {
    const url = urls[i];
    await page.goto(`${url}`);
    await page.waitForNavigation({ waitUntil: 'networkidle2' });
}

This would visit one url after another, as you are expecting. If you are curious about iterating serially using await/async, you can have a peek at this answer: https://stackoverflow.com/a/24586168/791691

Crawling multiple URL in a loop using puppeteer, I have urls = ['url','url','url'] this is what I'm doing urls.map(async (url)=>{ await page.goto(`${url}`); await page.waitForNavigation({ waitUntil: 'networkidle' }); }). Remember, I have used the request.url in page.goto() method to send the URL dynamically using the API or Agenty cloud portal. await page.goto(request.url); So, if we change the agent input as MANUAL and enter multiple URLs in the input box, Agenty will run the Puppeteer script by sending the request object with each URL dynamically. For example…

If you find that you are waiting on your promise indefinitely, the proposed solution is to use the following:

const urls = [...]

for (let i = 0; i < urls.length; i++) {
    const url = urls[i];
    const promise = page.waitForNavigation({ waitUntil: 'networkidle' });
    await page.goto(`${url}`);
    await promise;
}

As referenced from this github issue

Headless Browser Examples with Puppeteer, Web Scraping with a Headless Browser: A Puppeteer Tutorial const url = process.argv[2]; if (!url) { throw "Please provide a URL as the first argument"; } We also wrapped our evaluate() function in a while loop, so that it keeps running as long as Chek it out https://github.com/oscarnevarezleal/ecommerce-crawler. Simple web crawling with Puppeteer in TypeScript Puppeteer is a tool to manipulate web page by using headless Chrome. It can access pre-rendered content so that we can touch the page which could not be accessed without web browsers. Puppeteer can be controlled by node.js since it’s providing JavaScript API.

Best way I found to achieve this.

 const puppeteer = require('puppeteer');
(async () => {
    const urls = ['https://www.google.com/', 'https://www.google.com/']
    for (let i = 0; i < urls.length; i++) {

        const url = urls[i];
        const browser = await puppeteer.launch({ headless: false });
        const page = await browser.newPage();
        await page.goto(`${url}`, { waitUntil: 'networkidle2' });
        await browser.close();

    }
})();

web scraping, there multiple ways of going through each item of iterator synchronously while performing asynchronous operation, easiest in case think use normal for operator​,  While it was simple to write the code of the crawler, it still needs more than 8 minutes to crawl only 100 pages. In proportion, it would take approximately 58 days to crawl to top Alexa 1 million. In the next post, we will show how you can easily parallelize your crawling process by spawning multiple browsers and using multiple pages.

Using Puppeteer to crawl pages and save them as Markdown files, We loop through the above defined postUrls and use page.goto() to open each URL. Now we get the pathname, which we will later use as our  Puppeteer is a node.js library which provides a powerful but simple API that allows you to control Google’s Chrome browser. In this tutorial post, we will show you how to use puppeteer to control chrome and build a web scraper to scrape details of hotel listings from booking.com

How to parallelize a web crawler, We can use two strategies to parallelize our crawler on a single machine: First, in the outer loop, we launch NUM_BROWSERS instances of Chrome headless. URL from the urls array, visit it, and take a screenshot using the same await puppeteer.launch(); const promisesPages = []; for (let numPage  Puppeteer Cluster. Create a cluster of puppeteer workers. This library spawns a pool of Chromium instances via Puppeteer and helps to keep track of jobs and errors. This is helpful if you want to crawl multiple pages or run tests in parallel. Puppeteer Cluster takes care of reusing Chromium and restarting the browser in case of errors

Advanced web spidering with Puppeteer, Puppeteer is a node.js library that makes it easy to do advanced web scraping and spidering. yarn add puppeteer when using yarn; npm --save puppeteer when using npm. This is the simplest node test.js request url: https://blog.​kowalczyk.info/pas response url: It's faster to iterate on code this way. Navigating by URL. One of the earliest things is, intuitively, instructing the blank page to navigate to a specified URL: We use goto to drive the created page to navigate Puppeteer’s website. Afterward, we just take the title of Page’s main frame, print it, and expect to get that as an output:

Comments
  • Wierd, this gives await page.goto(${url}); Unexpected identifier syntaxErrpr.
  • @user2875289 Which version of node are you using? You need to use 7.6 or higher to have async/await work without doing transpiling.
  • @tomahaug I'm using Node 8.9. The problem was solved. I was using async/wait mixed with promises that cause the syntaxError. It works now after changing to async/wait only. Thanks!
  • Hi @MehranShafqat, it might be better to post this as a new question rather than a comment