Web scraping when scrolling down is needed

python web scraping load more
scraping infinite scrolling pages python selenium
beautiful soup scrolling
selenium scroll down python
python scrapy scroll down
requests scroll down
scrape lazy loading pages python
beautiful soup scroll page

I want to scrape, e.g., the title of the first 200 questions under the web page https://www.quora.com/topic/Stack-Overflow-4/all_questions. And I tried the following code:

import requests
from bs4 import BeautifulSoup

url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"
print("url")
print(url)
r = requests.get(url) # HTTP request
print("r")
print(r)
html_doc = r.text # Extracts the html
print("html_doc")
print(html_doc)
soup = BeautifulSoup(html_doc, 'lxml') # Create a BeautifulSoup object
print("soup")
print(soup)

It gave me a text https://pastebin.com/9dSPzAyX. If we search href='/, we can see that the html does contain title of some questions. However, the problem is that the number is not enough; actually on the web page, a user needs to manually scroll down to trigger extra load.

Does anyone know how I could mimic "scrolling down" by the program to load more content of the page?

Infinite scrolls on a webpage is based on the Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code working inside the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools. See example for quora

the more you scroll down, the more requests generated. so now your requests will be done to that url instead of normal url but keep in mind to send correct headers and playload.

other easier solution will be by using selenium

How to Scrape Data That Appears on scroll - How to?, Unfortunately, these links aren't present on infinite scrolling web pages. Open the panel and then scroll down the page to see the requests that the we don't even have to scrape the HTML contents to get the data we need. The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times. To use the code below you need to install chromedriver. http://chromedriver.chromium.org/downloads

Couldn't find a response using request. But you can use Selenium. First printed out the number of questions at first load, then send the End key to mimic scrolling down. You can see number of questions went from 20 to 40 after sending the End key.

I used driver.implicitly wait for 5 seconds before loading the DOM again in case the script load to fast before the DOM was loaded. You can improve by using EC with selenium.

The page loads 20 questions per scroll. So if you are looking to scrape 100 questions, then you need to send the End key 5 times.

To use the code below you need to install chromedriver. http://chromedriver.chromium.org/downloads

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.common.by import By

    CHROMEDRIVER_PATH = ""
    CHROME_PATH = ""
    WINDOW_SIZE = "1920,1080"

    chrome_options = Options()
    # chrome_options.add_argument("--headless")  
    chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
    chrome_options.binary_location = CHROME_PATH
    prefs = {'profile.managed_default_content_settings.images':2}
    chrome_options.add_experimental_option("prefs", prefs)

    url = "https://www.quora.com/topic/Stack-Overflow-4/all_questions"

    def scrape(url, times):

    if not url.startswith('http'):
        raise Exception('URLs need to start with "http"')

    driver = webdriver.Chrome(
    executable_path=CHROMEDRIVER_PATH,
    chrome_options=chrome_options
    )

    driver.get(url)

    counter = 1
    while counter <= times:

        q_list = driver.find_element_by_class_name('TopicAllQuestionsList')
        questions = [x for x in q_list.find_elements_by_xpath('//div[@class="pagedlist_item"]')]
        q_len = len(questions)
        print(q_len)

        html = driver.find_element_by_tag_name('html')
        html.send_keys(Keys.END)

        wait = WebDriverWait(driver, 5)
        time.sleep(5)

        questions2 = [x for x in q_list.find_elements_by_xpath('//div[@class="pagedlist_item"]')]
        print(len(questions2))

        counter += 1

    driver.close()

if __name__ == '__main__':
    scrape(url, 5)

How to Crawl Infinite Scrolling Pages using Python, In this video you'll learn how to scrape data from pages built with AJAX-based infinite scrolling Duration: 5:25 Posted: Mar 23, 2017 Tricks for Scraping Scrolling Pages. May 12, 2017 For a recent project, I was scraping data from a few different websites and needed to solve for how to handle infinite scrolling. Here are two useful methods I found that worked for me. (They won’t work in every situation, but I think they are good to know.)

I recommend using selenium rather than bs. selenium can control browser and parsing. like scroll down, click button, etc…

this example is for scroll down for get all liker user in instagram. https://stackoverflow.com/a/54882356/5611675

Scraping Infinite Scrolling Pages, (I'm searching the topic about web scraping.) Waiting until the page loaded, select the “Advanced Options”. 3. Choose “scroll down to page bottom when finished� 1) Scroll down to the bottom of the page continuously to fully load all the contents 2) keep clicking on the “load more” button as you scroll down the page. This tutorial will show you how to configure a task in Octoparse to deal with these two situations, making sure all available data is extracted.

If the content only loads on "scrolling down", this probably means that the page is using Javascript to dynamically load the content.

You can try using a web client such as PhantomJS to load the page and execute the javascript in it, and simulate the scroll by injecting some JS such as document.body.scrollTop = sY; (Simulate scroll event using Javascript).

Scraping Infinite Scrolling Pages, Similar to how you will manually scroll down the page, Octoparse does it the same way with the proper settings. Basically all you need to do is� When I did web scraping before, like scrolling down, clicking, forward or backward. in order to create the server for manipulating web browser to scrape data, I needed another two things

Scrape Websites with Infinite Scrolling, How to scroll down a webpage to search for an element to make it visible, then perform web scraping (Get OCR Text) ? Problem: I need to� What is Web Scraping Web Scraping is an automatic way to retrieve unstructured data from website and store them in a structured format. For example, if you want to analyse what kind of face mask can sell better in Singapore, you may want to scrape all the face mask information on E-Commerce website like Lazada.

Dealing with Infinitive Scrolling/Load More in web scraping process , Infinite Scrolling, Load More and Next Click Pagination in Web Scraping as the user scrolls down the page in browser, eliminating the need for pagination with� If you were feeling daunted by the prospect of scraping infinite scrolling websites, hopefully you’re feeling a bit more confident now. The next time that you have to deal with a page based on AJAX calls triggered by user actions, take a look at the requests that your browser is making and then replay them in your spider.

Scrolling webpage for web scraping! Pls Help! - Build, Hakim Mliki You have an idea/project and you need a Full Stack developer or you AI-Powered Visual Web Scraping Tool move mouse, drop-down box, scroll � Using Puppeteer to Scrape Websites with Infinite Scrolling Infinite scrolling has become a ubiquitous design pattern on the web. Social media sites like Facebook, Twitter, and Instagram all feature infinitely scrolling feeds to keep users engaged with an essentially unbounded amount of content.

Comments
  • Possible duplicate of How can I scroll a web page using selenium webdriver in python?
  • Thank you for the full code... But are you sure driver.implicitly_wait(5) works? the browser closes immediately in my test, and we get questions2 same as questions.
  • Additionally, we need to scroll down for extra load, we don't see scroll down in your code.
  • Used sending End key to mimic scroll down. Updated code with wait and time.sleep. This should not be the best way to do this, but I can't find out how to use EC to wait for element to appear in the DOM.