Scrapy - How to initiate multiple instances of same spider process?

Scrapy - How to initiate multiple instances of same spider process?

scrapy run multiple spiders
scrapy multithreading
scrapy multiple start_urls
scrapy spider
scrapy run spider sequentially
scrapy run spider multiple times
scrapy yield
scrapy close spider

I am stuck while initiating multiple instances of same spider. I want to run it like 1 url for 1 spider instance. I have to process 50k urls and for this i need to initiate separate instances for each. In my main spider script, I have set closedpider timeut for 7 mins, to make sure that I am not crawling for a long time. Please see the code below:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import urlparse

for start_url in all_urls:
    domain = urlparse.urlparse(start_url).netloc
    if domain.startswith('ww'):
        domain = domain.split(".",1)[1]

    process = CrawlerProcess(get_project_settings())
    process.crawl('textextractor', start_url=start_url,allowed_domains=domain)
    process.start()

It runs completely for 1st url, bur after that when the 2nd url is passed it gives below error:

raise error.ReactorNotRestartable()
ReactorNotRestartable

Please suggest what should i do to make it run for multiple instances of same spider. Also, I am thinking to initiate multiple instances of scrapy at a time using threads. Would it be a fine approach?


How about this

process = CrawlerProcess(get_project_settings())

for start_url in all_urls:
    domain = urlparse.urlparse(start_url).netloc
    if domain.startswith('ww'):
        domain = domain.split(".",1)[1]
    process.crawl('textextractor', start_url=start_url,allowed_domains=domain)

process.start()

Common Practices, This class will start a Twisted reactor for you, configuring the logging and setting use get_project_settings to get a Settings instance with your project settings. is already using Twisted and you want to run Scrapy in the same reactor. However, Scrapy supports running multiple spiders per process using the internal API. hi @grammy-jiang I has read again. And i use the example to try again, and it did't work . The problem is when use Crawler,CrawlerRunner,CrawlerProcess make multiple spiders run in the same thread or process, if one of these spider was blocked , just like parse callback function was printing something, the other spider wouldn't be scheduled,and always wait for other spider.


Is there a specific reason you want to start 50k instances of spiders? Twisted by default only allows a single instance to run (unless you kill the entire process and restart)

Secondly, "1 url for 1 spider instance" will cause a huge overhead in memory. You should instead consider passing all the urls to the same instance.

how to run Multiple spider ** independently ** at the same time , My question is when use scrapy to crawl multiple spiders, the multiple This is a big problem when run multiple spider in the same process. Scrapy 1.0 allows us to run full crawler instances within a process thanks to its internal API. Docs at http://doc.scrapy.org/en/0.24/topics/practices.html#running


In my case, your purpose is not need. Because the spider in scrapy, every request you yield is asynchronous. That's no need to make multiple instances.

The way to speed up spider is Increase concurrency

And the way to process 50k urls is spider-arguments

how to run multiple spiders concurrently in code?, Short answer: Running multiple spiders into the same scrapy crawl (1) does one spider per process with multiple spiders running means I instantiate never crawls anything, although its spider is initiated ok and everything else seems fine. Running multiple spiders in the same process ΒΆ By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API. Here is an example that runs multiple spiders simultaneously:


Common Practices, This class will start a Twisted reactor for you, configuring the logging and setting use get_project_settings to get a Settings instance with your project settings. is already using Twisted and you want to run Scrapy in the same reactor. However, Scrapy supports running multiple spiders per process using the internal API. If you are thinking that trying to start multiple Spiders in parallel is going to download or spider things faster, do pause and rethink a bit. Scrapy itself is designed to be fast based on the Twisted event-driven networking engine. It is not making network requests one-by-one sequentially and waiting for each request to be finished.


Scrapy: crawl multiple spiders sharing same items, pipeline, and , BotProxy Docs: Scrapy: crawl multiple spiders sharing same items, pipeline, and settings but process.crawl(Spider1) process.crawl(Spider2) process.start(). So I decided to write a single spider to scrape each website rather than writing multiple spiders for each one. Because I wanted to make it work with one single spider I had to write product name and price selectors specifically for each website. Though for some websites I could use the same selector.


Running multiple scrapy spiders programmatically, Running multiple scrapy spiders programmatically. This post The logic remains the same. crawler.configure() crawler.crawl(crawler_obj) crawler.start() # blocks process; so always keep as the last statement reactor.run(). Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case, the parse method) passing the response as argument.