Trouble running a parser created using scrapy with selenium

scrapy-selenium middleware
scrapy-selenium example
scrapy-selenium github'
python selenium scrapy tutorial
scrapy-splash
scrapy vs selenium
scrapy-splash vs selenium
how to run a scrapy spider

I've written a scraper in Python scrapy in combination with selenium to scrape some titles from a website. The css selectors defined within my scraper is flawless. I wish my scraper to keep on clicking on the next page and parse the information embedded in each page. It is doing fine for the first page but when it comes to play the role for selenium part the scraper keeps clicking on the same link over and over again.

As this is my first time to work with selenium along with scrapy, I don't have any idea to move on successfully. Any fix will be highly appreciated.

If I try like this then it works smoothly (there is nothing wrong with selectors):

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def parse(self,response):
        self.driver.get(response.url)

        while True:
            for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"h1.faqsno-heading"))):
                name = elem.find_element_by_css_selector("div[id^='arrowex']").text
                print(name)

            try:
                self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
                self.wait.until(EC.staleness_of(elem))
            except TimeoutException:break

But my intention is to make my script run this way:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link):
        self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()  
        self.wait.until(EC.staleness_of(elem))


    def parse(self,response):
        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage(response.url) #initiate the method to do the clicking
            except TimeoutException:break

These are the titles visible on that landing page (to let you know what I'm after):

INDIA INCLUSION FOUNDATION
INDIAN WILDLIFE CONSERVATION TRUST
VATSALYA URBAN AND RURAL DEVELOPMENT TRUST

I'm not willing to get the data from that site, so any alternative approach other than what I've tried above is useless to me. My only intention is to have any solution related to the way I tried in my second approach.

Your initial code was almost correct with one key piece missing from it. You were using the same response object always. The response object needs to be from the latest page source.

Also you were browsing the link again and again in click next page which was resetting it to page 1 every time. That is why you get page 1 and 2 (max). You need to get the url only once in the parse stage and then let the next page click to happen

Below is final code working fine

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link):
        # self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
        self.wait.until(EC.staleness_of(elem))


    def parse(self, response):
        self.driver.get(response.url)

        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage(response.url) #initiate the method to do the clicking
                response = response.replace(body=self.driver.page_source)
            except TimeoutException:break

After that change it works perfect

Trouble running a parser created using scrapy with selenium, I've written a scraper in Python scrapy in combination with selenium to scrape some titles from a website. The css selectors defined within my scraper is flawless. I've written a scraper in Python scrapy in combination with selenium to scrape 1000 company names and their revenue from a website. The site has got lazy-loading method enabled so it is not possible to make the site load all the items unless the scraper is able to scroll that page downmost.

In case you need pure Selenium solution:

driver.get("https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx")

while True:
    for item in wait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div[id^='arrowex']"))):
        print(item.text)
    try:
        driver.find_element_by_xpath("//input[@text='Next' and not(contains(@class, 'disabledImageButton'))]").click()
    except NoSuchElementException:
        break

Scraping Dynamic Pages with Scrapy + Selenium, Scrapy does not have the ability to execute this JavaScript. Selenium is a So we are going to need to use Scrapy and Selenium to get the data we want. Now we are going to create a spider to crawl twitch. Scrapy will send a request to each url in start_urls and pass the response to the parse method. Selenium is often necessary to extract data from websites using lots of Javascript. The problem is running lots of Selenium/Headless chrome instance at scale is hard and this is one of the things we solve with ScrapingBee, our web scraping api. Selenium is also really an excellent tool to automate almost anything on the web.

import scrapy
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from scrapy.crawler import CrawlerProcess

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

        link = 'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx'
        self.driver.get(link)

    def click_nextpage(self):        
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()  
        self.wait.until(EC.staleness_of(elem))
        time.sleep(4)

    def parse(self,response):
        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage() #initiate the method to do the clicking
            except TimeoutException:break

process = CrawlerProcess()

process.crawl(IncomeTaxSpider)
process.start()

Selecting dynamically-loaded content, Solving specific problems However, when you download them using Scrapy, you cannot reach the JavaScript code within a <script/> element, see Parsing JavaScript code. The easiest way to use a headless browser with Scrapy is to use Selenium, along Built with Sphinx using a theme provided by Read the Docs. Modern Web Scraping With Python Using Scrapy Splash Selenium Udemy course free Download, Download all Udemy courses for free FreeCourseNet - Download Udemy Paid Courses for Free. Learn Hacking, Programming, IT & Software, Marketing, Music, Free Online Courses, and more.

Whenever the page gets loaded using the 'Next Page' arrow (using Selenium) it gets reset back to Page '1'. Not sure about the reason for this (may be the java script) Hence changed the approach to use the input field to enter the page number needed and hitting ENTER key to navigate.

Here is the modified code. Hope this may be useful for you.

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"
    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]
    def __init__(self):
        self.driver = webdriver.Firefox()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link, number):
        self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

    inputElement = self.driver.find_element_by_xpath("//input[@id='ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_ctl00_txtPageNumber']")
    inputElement.clear()
    inputElement.send_keys(number)
    inputElement.send_keys(Keys.ENTER)
        self.wait.until(EC.staleness_of(elem))


    def parse(self,response):
        number = 1
        while number < 10412: #Website shows it has 10411 pages.
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}
                print (name)

            try:
                number += 1
                self.click_nextpage(response.url, number) #initiate the method to do the clicking
            except TimeoutException:break

Advanced Web Scraping Tactics, A web scraping script can load and extract the data from multiple BeautifulSoup: Beautiful soup is a library for parsing HTML and XML documents. In Scrapy, we create Spiders which are python classes that define how The Selenium WebDriver is one of the most popular tools for Web UI Automation. from scrapy_selenium import SeleniumRequest yield SeleniumRequest(url = url, callback = self.parse_result) The request will be handled by selenium, and the request will have an additional meta key, named driver containing the selenium driver with the request processed.

Create a self.page_num or something.

def parse(self,response):
    self.pages = self.driver.find_element_by_css_selector("#ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_ctl00_totalRecordsDiv.act_search_footer span")
    self.pages = int(self.pages.split('of ')[1].split(']')[0])

    self.page_num = 1

    while self.page_num <= self.pages:
        for item in response.css("h1.faqsno-heading"):
            name = item.css("div[id^='arrowex']::text").extract_first()
            yield {"Name": name}

        try:
            self.click_nextpage(response.url) #initiate the method to do the clicking
        except TimeoutException:break

def click_nextpage(self,link):
    self.driver.get(link)
    elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

    page_link = 'ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_ctl00_lnkBtn_' + str(self.page_num)
    self.page_num = self.page_num + 1


    self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()  
    self.wait.until(EC.staleness_of(elem))

Scraping Dynamic Web Pages with Python and Selenium, Webpages that are generated dynamically can offer a faster user experience; For example, consider that in a dynamic webpage: much of the For this guide, we are going to use the 'Selenium' library to both GET and PARSE the data. To illustrate this, let's run a search on the Python website by adding  Selenium: It can handle up to some range butn’t equivalent to Scrapy. EcoSystem Scrapy: It has a good ecosystem, we can use proxies and VPN’s to automate the task.

Selenium+Scrapy : scrapy, then scrape with scrapy. But I am running into trouble. import scrapy import time from selenium import webdriver from response = body, callback = self.​parse) def parse(self, response): print(response.text) I made an edit to the above code, where i added: More posts from the scrapy community. 14. There are a few Python packages we could use to illustrate with, but we’ll focus on Scrapy for these examples. Scrapy makes it very easy for us to quickly prototype and develop web scrapers with Python. Scrapy vs. Selenium and Beautiful Soup. If you’re interested in getting into Python’s other packages for web scraping, we’ve laid it

Hands-On Web Scraping with Python: Perform advanced scraping , operations using various Python libraries and tools such as Selenium, Regex, and parse(): This function is implemented with the logic relevant to data extraction verify the successful creation of Spider by executing these commands: scrapy The item is also automatically generated by Scrapy while issuing the scrapy  An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site. selenium selenium-webdriver scraper scraping scraping-websites scrapper asp-net python scrapy webcrawler webcrawling. 8 commits. 1 branch.

Web Scraping in Python using Scrapy (with multiple examples), Tutorial on web scraping using Scrapy, a library for scraping the web using Python. This will create a new spider “redditbot.py” in your spiders/ folder with a basic def parse(self, response): #Extracting the content using css selectors titles YouTube Data using Python and Selenium to Classify Videos. In the above code you can see name, allowed_domains, s start_urls and a parse function. name: Name is the name of the spider. Proper names will help you keep track of all the spider's you make. Names must be unique as it will be used to run the spider when scrapy crawl name_of_spider is used.

Comments
  • is the "next" button generated dynamically?? if not, why not use a Scrapy to traverse from page to page?
  • When I created this post I was seriously expecting your intervention cause I always find your solution very helpful. However, as for your answer: it gives me the content of the first page and then second page and then again second page and so on. It doesn't (or can't) go to the third page, fourth page etc.
  • Let me debug, I saw content changing so didn't go further. But this may be a CSS issue then
  • @Topto, issue resolved, please look at the updated answer
  • I can't wait any longer to see when your answer is gonna get the bounty. It perfectly did the job. A oneliner clarification as to why you used self.driver.get(response.url) before while loop in parse method will be very helpful. Apology for the ignorance. Thanks and congrats.
  • We need the browser to be on the first page before we starting clicking the url. We could have done that in __init__ also, or here in parse. The key is browse the first page only once and not while moving to next page
  • Thanks for your input sir. I wish to accomplish the project in combination with scrapy.
  • Thanks @Maxwell77, for your solution. It is very close to what I was expecting. Your provided script is able to click links incrementally. However, I'm still having the same data from the first page over and over again. I think it is only possible If we could pass self.driver.page_source to self.parse method from within click_nextpage(self) method after time.sleep(4) .
  • Sorry, I have tested the script and it's working on my computer. I think it is not necessary to pass the page source because it is always the same url, it's only the content which is reloaded with the new information.
  • You didn't get my point. When the driver clicks on the next page link then driver.page_source surely be different. I'm not talking about driver.url, it's the driver.page_source. However, if you take a closer look at the spider then you can see that self.parse() method never gets update with the new response that is the only reason why the content I'm having are always the same. Thanks. Btw, I tested again to be sure and found the same result as I've already mentioned.
  • Thanks for your answer @Krishna. Although your script appears to change the page number by entering that harcoded number in that inputbox, but It still gives me the same output over and over again (output from the first page).