scrapy passing custom_settings to spider from script using CrawlerProcess.crawl()
schedule scrapy crawl
scrapy run spider sequentially
scrapy update_settings
crawl error: running 'scrapy crawl' with more than one spider is no longer supported
scrapy duplicate filter
scrapy distributed crawling
scrapy stop reactor
I am trying to programatically call a spider through a script. I an unable to override the settings through the constructor using CrawlerProcess. Let me illustrate this with the default spider for scraping quotes from the official scrapy site (last code snippet at official scrapy quotes example spider).
class QuotesSpider(Spider): name = "quotes" def __init__(self, somestring, *args, **kwargs): super(QuotesSpider, self).__init__(*args, **kwargs) self.somestring = somestring self.custom_settings = kwargs def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield Request(url=url, callback=self.parse) def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }
Here is the script through which I try to run the quotes spider
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from scrapy.settings import Settings def main(): proc = CrawlerProcess(get_project_settings()) custom_settings_spider = \ { 'FEED_URI': 'quotes.csv', 'LOG_FILE': 'quotes.log' } proc.crawl('quotes', 'dummyinput', **custom_settings_spider) proc.start()
Scrapy Settings are a bit like Python dicts.
So you can update the settings object before passing it to CrawlerProcess
:
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings from scrapy.settings import Settings def main(): s = get_project_settings() s.update({ 'FEED_URI': 'quotes.csv', 'LOG_FILE': 'quotes.log' }) proc = CrawlerProcess(s) proc.crawl('quotes', 'dummyinput', **custom_settings_spider) proc.start()
Edit following OP's comments:
Here's a variation using CrawlerRunner
, with a new CrawlerRunner
for each crawl and re-configuring logging at each iteration to write to different files each time:
import logging from twisted.internet import reactor, defer import scrapy from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging, _get_handler from scrapy.utils.project import get_project_settings class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): page = getattr(self, 'page', 1) yield scrapy.Request('http://quotes.toscrape.com/page/{}/'.format(page), self.parse) def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } @defer.inlineCallbacks def crawl(): s = get_project_settings() for i in range(1, 4): s.update({ 'FEED_URI': 'quotes%03d.csv' % i, 'LOG_FILE': 'quotes%03d.log' % i }) # manually configure logging for LOG_FILE configure_logging(settings=s, install_root_handler=False) logging.root.setLevel(logging.NOTSET) handler = _get_handler(s) logging.root.addHandler(handler) runner = CrawlerRunner(s) yield runner.crawl(QuotesSpider, page=i) # reset root handler logging.root.removeHandler(handler) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished
Settings, They can do so by setting their custom_settings attribute: In a spider, the settings are available through self.settings : the initialization (e.g., in your spider's __init__() method), you'll need to override the from_crawler() method. Crawler.settings attribute of the Crawler that is passed to from_crawler method in extensions nyov changed the title Broken CrawlerProcess.crawl(spider) functionality in master Document changed CrawlerProcess.crawl(spider) functionality in Release notes Jul 12, 2019 Gallaecio added docs enhancement bug and removed enhancement labels Jul 15, 2019
I think you can't override the custom_settings
variable of a Spider Class when calling it as a script, basically because the settings are being loaded before the spider is instantiated.
Now, I don't really see a point on changing the custom_settings
variable specifically, as it is only a way to override your default settings, and that's exactly what the CrawlerProcess
offers too, this works as expected:
import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): name = 'simple' start_urls = ['http://httpbin.org/headers'] def parse(self, response): for k, v in self.settings.items(): print('{}: {}'.format(k, v)) yield { 'headers': response.body } process = CrawlerProcess({ 'USER_AGENT': 'my custom user anget', 'ANYKEY': 'any value', }) process.crawl(MySpider) process.start()
Common Practices, The first utility you can use to run your spiders is scrapy.crawler. process.start() # the script will block here until the crawling is finished your spiders passing their name to CrawlerProcess , and use get_project_settings to get a Settings Hmm, it looks like the setting is read in the scrapy.extensions.spiderstate.SpiderState from_crawler method which is called before instantiating the spider. So the value from the spider class custom_settings will be used but I don't think you can do anything with it inside the spider methods before it's used.
You can override a setting from the command line
https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options
For example: scrapy crawl myspider -s LOG_FILE=scrapy.log
Scrapy Tutorial, Save it in a file named quotes_spider.py under the tutorial/spiders directory in of requests or write a generator function) which the Spider will begin to crawl from. The parse() method usually parses the response, extracting the scraped data method associated with the request (in this case, the parse method) passing The following are code examples for showing how to use scrapy.crawler.CrawlerProcess().They are from open source Python projects. You can vote up the examples you like or vote down the ones you don't like.
It seems you want to have custom log for each spiders. You need to activate the logging like this:
from scrapy.utils.log import configure_logging class MySpider(scrapy.Spider): #ommited def __init__(self): configure_logging({'LOG_FILE' : "logs/mylog.log"})
[PDF] Scrapy Documentation, function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests. • parse(): a In this scrapy tutorial, we successfully create a Scrapy project and a Scrapy spider using some Scrapy commands, and we have a spider which can crawl data for us now. To help user focus on the key part, I only paste part of the source code instead of the whole file in this tutorial, If you want source code which can run in your local env directly, just
Core API, The Crawler object must be instantiated with a scrapy.spiders. responsible of using it accordingly) unless writing scripts that manually handle the crawling process. Set of crawlers started by crawl() and managed by this class. Key-value entries can be passed on initialization with the values argument, and they would Run Scrapy from a script ¶. You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl. Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. The first utility you can use to run your spiders is scrapy.crawler.CrawlerProcess.
How to create JOBDIR setting from Spider __init__ method ? or , I want to create different JOBDIR for different spiders , like FEED_URI in the below example class QtsSpider(scrapy. from scrapy.crawler import CrawlerProcess from scrapy.utils.project import You could set JOBDIR in the Settings object you are passing to your CrawlerProcess but that won't work with Settings¶ The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves. The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull configuration values from.
Unable to override settings inside spider from spider constructor , I am trying to programatically call a spider through a script. Let me illustrate this with the default spider for scraping quotes from the official scrapy inside spider from spider constructor using CrawlerProcess.crawl() #2605 be able to override custom_settings from inside spider, I was thinking this was How to create simple spider with python and scrapy and save the scraped data as JSON. Source code: https://github.com/zaro/scrapy_simple_spider/tree/part1
Comments
- For my use case, I need to pass a .csv file for each run of the spider using proc.crawl(). I want to have 1 crawler process (with the common settings) but call crawl successively with different names for the log and csv feed output. Can I achieve this using scrapy?
- @hAcKnRoCk you can use a
for
loop when callingCrawlerProcess
, and updating the settings there, instead of overridingcustom_settings
- @hAcKnRoCk, have you looked at the last example in Running multiple spiders in the same process, i.e. running spiders sequentially with
CrawlerRunner
? - @eLRuLL: Yes, I already tried with a for loop. The code is at pastebin.com/RTnUWntQ. I receive a 'twisted.internet.error.ReactorNotRestartable' error during the 2nd iteration.
- @paultrmbrth Yes, I did see that example. But I am not sure if it will suit my usecase. The problem in the question will still persist. I wont be able to run my spider with each run giving me a .csv and a .log file.
- The point in being able to override custom_settings is this. I want to be able to do a 'crawl('myspider', list1_urlstoscrape, 'list1output.csv', 'list1.log' )', then again do a 'crawl('myspider', list2_urlstoscrape, 'list2output.csv', 'list2.log'). To achieve this, therefore, I have to create multiple CrawlerProcess instances which is not possible due to the twister reactor problem.
- you could change your spider code to receive multiple lists at once, and then process each
- Yes, but the problem would still exist. The issue is not in passing the inputs list to be scraped but in saying how you want the outputs for each of those lists (that is, for each crawl of the same spider).