Selenium Python webscraper really slow
web scraping with scrapy
scrapy next page
I'm a newbie getting into web scrapers. I've made something that works, but it takes hours and hours to get everything I need. I read something about using parallel processes to process the URLs but I have no clue how to go about it and incorporate it in what I already have. Help is much appreciated!
Here is my, still extremely messy, code. I'm still learning :)
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from bs4 import BeautifulSoup from selenium.common.exceptions import NoSuchElementException import time import random import pprint import itertools import csv import pandas as pd start_url = "https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO" driver = webdriver.Firefox() driver.set_page_load_timeout(20) driver.get(start_url) driver.find_element_by_xpath('//*[@id="form_save"]').click() #accepts cookies wait = WebDriverWait(driver, random.randint(1500,3200)/1000.0) j = random.randint(1500,3200)/1000.0 time.sleep(j) num_jobs = int(driver.find_element_by_xpath('/html/body/div/div/main/div/div/div/header/h2/span').text) num_pages = int(num_jobs/102) urls =  list_of_links =  for i in range(num_pages+1): try: elements = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="search-results-container"]//article/job/a'))) for i in elements: list_of_links.append(i.get_attribute('href')) j = random.randint(1500,3200)/1000.0 time.sleep(j) if 'page=3' not in driver.current_url: driver.find_element_by_xpath('//html/body/div/div/main/div/div/div/paginator/div/nav/ul/li/a').click() else: driver.find_element_by_xpath('//html/body/div/div/main/div/div/div/paginator/div/nav/ul/li/a').click() url = driver.current_url if url not in urls: print(url) urls.append(url) else: break except: continue set_list_of_links = list(set(list_of_links)) print(len(set_list_of_links), "results") driver.close() def grouper(n, iterable): it = iter(iterable) while True: chunk = tuple(itertools.islice(it, n)) if not chunk: return yield chunk def remove_empty_lists(l): keep_going = True prev_l = l while keep_going: new_l = remover(prev_l) #are they identical objects? if new_l == prev_l: keep_going = False #set prev to new prev_l = new_l #return the result return new_l def remover(l): newlist =  for i in l: if isinstance(i, list) and len(i) != 0: newlist.append(remover(i)) if not isinstance(i, list): newlist.append(i) return newlist vacatures =  chunks = grouper(100, set_list_of_links) chunk_count = 0 for chunk in chunks: chunk_count +=1 print(chunk_count) j = random.randint(1500,3200)/1000.0 time.sleep(j) for url in chunk: driver = webdriver.Firefox() driver.set_page_load_timeout(20) try: driver.get(url) driver.find_element_by_xpath('//*[@id="form_save"]').click() #accepts cookies vacature =  vacature.append(url) j = random.randint(1500,3200)/1000.0 time.sleep(j) elements = driver.find_elements_by_tag_name('dl') p_elements = driver.find_elements_by_tag_name('p') li_elements = driver.find_elements_by_tag_name('li') for i in elements: if "Salaris:" not in i.text: vacature.append(i.text) running_text = list() for p in p_elements: running_text.append(p.text) text= [''.join(running_text)] remove_ls = ['vacatures', 'carrièretips', 'help', 'inloggen', 'inschrijven', 'Bezoek website', 'YouTube', 'Over Nationale Vacaturebank', 'Werken bij de Persgroep', 'Persberichten', 'Autotrack', 'Tweakers', 'Tweakers Elect', 'ITBanen', 'Contact', 'Carrière Mentors', 'Veelgestelde vragen', 'Vacatures, stages en bijbanen', 'Bruto Netto Calculator', 'Salariswijzer', 'Direct vacature plaatsen', 'Kandidaten zoeken', 'Bekijk de webshop', 'Intermediair', 'Volg ons op Facebook'] for li in li_elements: if li.text not in remove_ls: text.append(li.text) text = ''. join(text) vacature.append(text) vacatures.append(vacature) driver.close() except TimeoutException as ex: isrunning = 0 print("Exception has been thrown. " + str(ex)) driver.close() except NoSuchElementException: continue
Python Selenium webdriver is not thread-safe. This means your browser can not correctly consume asynchronous calls from multiple threads. Try to scrape websites with requests and bs4 + lxml. It's much faster than Selenium. This answer can be helpful.
Python: Selenium Speed Scraping - dmitriiweb, Sometimes in my work I should use selenium for scraping the different websites, but this tool is too slow. This life hack is blowing in the wind, Python Selenium webdriver is not thread-safe. This means your browser can not correctly consume asynchronous calls from multiple threads. Try to scrape websites with requests and bs4 + lxml. It's much faster than Selenium. This answer can be helpful.
- You're using Firefox which is slower than Chrome in almost all real-life applications.
- Xpath is the slowest selector, match by id or class. If that is not possible then by CSS.
- Use headless mode and don't load images unless you need to.
Web Scraping with Selenium in Python, What can an amateur web scraper learn while navigating the maze of the Working with selenium is also relatively slow, because you are Today we are going to take a look at Selenium (with Python ️ ) with a step by step tutorial. Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
You can use Scrapy and this is much faster and more flexible than anything. See link for more information.
Why is Selenium Webdriver and python is slow? : selenium, I exported the same script to Python and ran with Webdriver. Granted it opens a fresh Firefox but the page load time takes soooo long. I don't know if the issue is Python version: We will be using Python 3.0, however feel free to use Python 2.0 by making slight adjustments. We will be using jupyter notebook, so you don’t need any command line knowledge. We will be using jupyter notebook, so you don’t need any command line knowledge.
Advanced Python Web Scraping: Best Practices & Workarounds, Web scraping, in simple terms, is the act of extracting data from websites. to do this is to set the CSS as display: none — and if the web scraper ever makes a they're resource intensive and comparatively slower when compared to Selenium supports multiple languages for scripting, including Python. Selenium WebDriver typing speed is slow in input 0 votes I am executing a test script on IE 11 browser with Selenium 2.4.1, where the script allows to enter text in a text field using the following:
Best Open Source Web Scraping Frameworks and Tools, Scrapy is an open source web scraping framework in Python used to Selenium WebDriver uses a real web browser to access the website, so it Using WebDriver makes web scraping easier, but the scraping process is much slower Learn more: How to build a Web Scraper using Puppeteer and Node. by Dave Gray Web Scraping Using the Python programming language, it is possible to “scrape” data from the web in a quick and efficient manner. Web scraping is defined as: > a tool for turning the unstructured data on the web into machine readable, structured data which is ready for analysis.
The Guide to Python Web Scraping Libraries & Frameworks , However, if you search “how to build a web scraper in python,” you will get numerous Selenium is another library that can be useful when scraping the web. However, the web scraping process is much slower compared to a simple HTTP In this tutorial, we will talk about Python web scraping and how to scrape web pages using multiple libraries such as Beautiful Soup, Selenium, and some other magic tools like PhantomJS. You’ll learn how to scrape static web pages, dynamic pages (Ajax loaded content), iframes, get specific HTML elements, how to handle cookies, and much more
- If you want to improve working code you'd better post your question on CodeReview
- I'm not real sure, but the use of Selenium might be a reason behind the slow nature of this. Selenium visually renders the page and loads all the images, adds, etc. If you just use the html-data and scrape that, it might be a lot faster. I built for example a script with
BeautifulSoupand it scrapes all the data (not the images) from Jaap in about 10-15 minutes (3000+ pages). So Nationale Vacaturebank should also be possible in a reasonable time...
- @Andersson I did but got the reaction that they do not help with code that does not exist i.e. how to go about parallel processing
- Oh really? I thought Selenium was normally used for this kind of thing. Thanks :)
- No, like I said, using requests and BeautifulSoap is a lot faster because you only get the html.
- @Lunalight you can try to use PhantomJS webdriver with Selenium. This is headless browser and can be faster than Firefox.
- @NielsHenkens nothing prevents to inspect the web page for some API calls. But in case if data generates with JS we should use browser of course.
- But how do I go to a next page with requests?
- Ok, but in the second part I use tag name, is that not a CSS thing?
- That's the native approach which is the fastest. But I see a lot of XPaths.