scrapy passing custom_settings to spider from script using CrawlerProcess.crawl()

scrapy crawlerprocess
schedule scrapy crawl
scrapy run spider sequentially
scrapy update_settings
crawl error: running 'scrapy crawl' with more than one spider is no longer supported
scrapy duplicate filter
scrapy distributed crawling
scrapy stop reactor

I am trying to programatically call a spider through a script. I an unable to override the settings through the constructor using CrawlerProcess. Let me illustrate this with the default spider for scraping quotes from the official scrapy site (last code snippet at official scrapy quotes example spider).

class QuotesSpider(Spider):

    name = "quotes"

    def __init__(self, somestring, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.somestring = somestring
        self.custom_settings = kwargs


    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

Here is the script through which I try to run the quotes spider

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings

    def main():

    proc = CrawlerProcess(get_project_settings())

    custom_settings_spider = \
    {
        'FEED_URI': 'quotes.csv',
        'LOG_FILE': 'quotes.log'
    }
    proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
    proc.start()

Scrapy Settings are a bit like Python dicts. So you can update the settings object before passing it to CrawlerProcess:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings

def main():

    s = get_project_settings()
    s.update({
        'FEED_URI': 'quotes.csv',
        'LOG_FILE': 'quotes.log'
    })
    proc = CrawlerProcess(s)

    proc.crawl('quotes', 'dummyinput', **custom_settings_spider)
    proc.start()

Edit following OP's comments:

Here's a variation using CrawlerRunner, with a new CrawlerRunner for each crawl and re-configuring logging at each iteration to write to different files each time:

import logging
from twisted.internet import reactor, defer

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging, _get_handler
from scrapy.utils.project import get_project_settings


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        page = getattr(self, 'page', 1)
        yield scrapy.Request('http://quotes.toscrape.com/page/{}/'.format(page),
                             self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }


@defer.inlineCallbacks
def crawl():
    s = get_project_settings()
    for i in range(1, 4):
        s.update({
            'FEED_URI': 'quotes%03d.csv' % i,
            'LOG_FILE': 'quotes%03d.log' % i
        })

        # manually configure logging for LOG_FILE
        configure_logging(settings=s, install_root_handler=False)
        logging.root.setLevel(logging.NOTSET)
        handler = _get_handler(s)
        logging.root.addHandler(handler)

        runner = CrawlerRunner(s)
        yield runner.crawl(QuotesSpider, page=i)

        # reset root handler
        logging.root.removeHandler(handler)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Settings, They can do so by setting their custom_settings attribute: In a spider, the settings are available through self.settings : the initialization (e.g., in your spider's __init__() method), you'll need to override the from_crawler() method. Crawler.settings attribute of the Crawler that is passed to from_crawler method in extensions  nyov changed the title Broken CrawlerProcess.crawl(spider) functionality in master Document changed CrawlerProcess.crawl(spider) functionality in Release notes Jul 12, 2019 Gallaecio added docs enhancement bug and removed enhancement labels Jul 15, 2019

I think you can't override the custom_settings variable of a Spider Class when calling it as a script, basically because the settings are being loaded before the spider is instantiated.

Now, I don't really see a point on changing the custom_settings variable specifically, as it is only a way to override your default settings, and that's exactly what the CrawlerProcess offers too, this works as expected:

import scrapy
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    name = 'simple'
    start_urls = ['http://httpbin.org/headers']

    def parse(self, response):
        for k, v in self.settings.items():
            print('{}: {}'.format(k, v))
        yield {
            'headers': response.body
        }

process = CrawlerProcess({
    'USER_AGENT': 'my custom user anget',
    'ANYKEY': 'any value',
})

process.crawl(MySpider)
process.start()

Common Practices, The first utility you can use to run your spiders is scrapy.crawler. process.start() # the script will block here until the crawling is finished your spiders passing their name to CrawlerProcess , and use get_project_settings to get a Settings  Hmm, it looks like the setting is read in the scrapy.extensions.spiderstate.SpiderState from_crawler method which is called before instantiating the spider. So the value from the spider class custom_settings will be used but I don't think you can do anything with it inside the spider methods before it's used.

You can override a setting from the command line

https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options

For example: scrapy crawl myspider -s LOG_FILE=scrapy.log

Scrapy Tutorial, Save it in a file named quotes_spider.py under the tutorial/spiders directory in of requests or write a generator function) which the Spider will begin to crawl from​. The parse() method usually parses the response, extracting the scraped data method associated with the request (in this case, the parse method) passing  The following are code examples for showing how to use scrapy.crawler.CrawlerProcess().They are from open source Python projects. You can vote up the examples you like or vote down the ones you don't like.

It seems you want to have custom log for each spiders. You need to activate the logging like this:

from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    #ommited
    def __init__(self):
        configure_logging({'LOG_FILE' : "logs/mylog.log"})

[PDF] Scrapy Documentation, function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests. • parse(): a  In this scrapy tutorial, we successfully create a Scrapy project and a Scrapy spider using some Scrapy commands, and we have a spider which can crawl data for us now. To help user focus on the key part, I only paste part of the source code instead of the whole file in this tutorial, If you want source code which can run in your local env directly, just

Core API, The Crawler object must be instantiated with a scrapy.spiders. responsible of using it accordingly) unless writing scripts that manually handle the crawling process. Set of crawlers started by crawl() and managed by this class. Key-​value entries can be passed on initialization with the values argument, and they would  Run Scrapy from a script ¶. You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl. Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. The first utility you can use to run your spiders is scrapy.crawler.CrawlerProcess.

How to create JOBDIR setting from Spider __init__ method ? or , I want to create different JOBDIR for different spiders , like FEED_URI in the below example class QtsSpider(scrapy. from scrapy.crawler import CrawlerProcess from scrapy.utils.project import You could set JOBDIR in the Settings object you are passing to your CrawlerProcess but that won't work with  Settings¶ The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves. The infrastructure of the settings provides a global namespace of key-value mappings that the code can use to pull configuration values from.

Unable to override settings inside spider from spider constructor , I am trying to programatically call a spider through a script. Let me illustrate this with the default spider for scraping quotes from the official scrapy inside spider from spider constructor using CrawlerProcess.crawl() #2605 be able to override custom_settings from inside spider, I was thinking this was  How to create simple spider with python and scrapy and save the scraped data as JSON. Source code: https://github.com/zaro/scrapy_simple_spider/tree/part1

Comments
  • For my use case, I need to pass a .csv file for each run of the spider using proc.crawl(). I want to have 1 crawler process (with the common settings) but call crawl successively with different names for the log and csv feed output. Can I achieve this using scrapy?
  • @hAcKnRoCk you can use a for loop when calling CrawlerProcess, and updating the settings there, instead of overriding custom_settings
  • @hAcKnRoCk, have you looked at the last example in Running multiple spiders in the same process, i.e. running spiders sequentially with CrawlerRunner?
  • @eLRuLL: Yes, I already tried with a for loop. The code is at pastebin.com/RTnUWntQ. I receive a 'twisted.internet.error.ReactorNotRestartable' error during the 2nd iteration.
  • @paultrmbrth Yes, I did see that example. But I am not sure if it will suit my usecase. The problem in the question will still persist. I wont be able to run my spider with each run giving me a .csv and a .log file.
  • The point in being able to override custom_settings is this. I want to be able to do a 'crawl('myspider', list1_urlstoscrape, 'list1output.csv', 'list1.log' )', then again do a 'crawl('myspider', list2_urlstoscrape, 'list2output.csv', 'list2.log'). To achieve this, therefore, I have to create multiple CrawlerProcess instances which is not possible due to the twister reactor problem.
  • you could change your spider code to receive multiple lists at once, and then process each
  • Yes, but the problem would still exist. The issue is not in passing the inputs list to be scraped but in saying how you want the outputs for each of those lists (that is, for each crawl of the same spider).