How can I trigger a Mongodb import once a Scrapy Spider has completed?

scrapy-mongodb pipeline
importing scrapy
scrapy spider examples
scrapy pipeline
scrapy tutorial
scrapy close spider
python scrapy run spider
scrapy from_crawler

I am using Scrapy, the Python library, to scrape websites and generate json output files at regular intervals. For efficiency, I want to do bulk upserts of these json files into Mongodb after each spider is completed.

I believe I can do the upsert like so:

mongoimport -c <collection> -d <db> --mode merge --file test.json

However, I am wondering what is the best way to to trigger this import once the spider has complete? And how?

I was hopeful I could use the close_spider method described here: https://doc.scrapy.org/en/latest/topics/item-pipeline.html#writing-your-own-item-pipeline

However, after playing around with it I discovered that the json file is only created, and not written to when inside this method.

It would be nice if there was some way for me to listen for a new file in a certain directory and then execute the import statement above.

Perhaps, this can all be done in a bash script?s

You could write items directly into Mongo using Item Pipelines. Take a look in this example, from the Scrapy's documentation:

Write items to MongoDB

In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.

The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

Scraping Websites into MongoDB using Scrapy Pipelines, Scrapy provides an extendible web scraping framework we can utilize to extract structured data. item in MongoDB; Automate scraping and set up as CRON job or trigger scrapy from reddit.items import RedditItem class PostSpider(scrapy. Once an item is scraped, it can be processed through an Item  Create the Spider. Create a file called stack_spider.py in the “spiders” directory. This is where the magic happens – e.g., where we’ll tell Scrapy how to find the exact data we’re looking for.

This method works for me (in your spider file):

import os
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class MySpider(scrapy.Spider):

    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        os.system("your_command")

Web Scraping with Scrapy and MongoDB – Real Python, Once Scrapy is setup, verify your installation by running this command in the Python shell: >>> from scrapy import Spider class StackSpider(Spider): name = "stack" allowed_domains Start with this script, which is nearly complete. Pipeline into MongoDB. Once an item is scraped, it can be processed through an Item Pipeline where we perform tasks such as: cleansing HTML data. validating scraped data (checking that the items contain certain fields) checking for duplicates (and dropping them) storing the scraped item in a database.

One solution is use pyinotify to watch a file in the chosen directory. I got the idea from here and adapted it to execute the mongo import statement.

class MyEventHandler(pyinotify.ProcessEvent):

    def process_IN_ACCESS(self, event):
        print("ACCESS event:", event.pathname)

    def process_IN_ATTRIB(self, event):
        print("ATTRIB event:", event.pathname)

    def process_IN_CLOSE_NOWRITE(self, event):
        print("CLOSE_NOWRITE event:", event.pathname)

    def process_IN_CLOSE_WRITE(self, event):
        print("CLOSE_WRITE event:", event.pathname)
        result = os.system('mongoimport -c kray4 -d kray4 --mode merge --file /home/kevin/PycharmProjects/spider/krawler/output/test.json')
        print("Result: " + str(result))

    def process_IN_CREATE(self, event):
        print("CREATE event:", event.pathname)

    def process_IN_DELETE(self, event):
        print("DELETE event:", event.pathname)

    def process_IN_MODIFY(self, event):
        print("MODIFY event:", event.pathname)

    def process_IN_OPEN(self, event):
        print("OPEN event:", event.pathname)


def main():
    # watch manager
    wm = pyinotify.WatchManager()

    wm.add_watch('/home/kevin/PycharmProjects/spider/krawler/output/test.json', pyinotify.ALL_EVENTS, rec=True)

    # event handler
    eh = MyEventHandler()

    # notifier
    notifier = pyinotify.Notifier(wm, eh)
    #command = 'echo 1 > /proc/sys/net/ipv4/ip_forward'
    notifier.loop()

if __name__ == '__main__':
    main()

Scrapy run spider, Oct 12, 2015 · Every Scrapy spider is required to have To run our Scrapy spider to spiders. internet import reactor from Jun 25, 2019 Scrapy is an application framework run spider scrapy crawl newest_question # after done, check your mongodb It's free for manually triggering spider crawls but it has a very reasonably  Now we can start building the crawler. Scrapy Project. Let’s start a new Scrapy project: $ scrapy startproject stack 2015-09-05 20:56:40 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot) 2015-09-05 20:56:40 [scrapy] INFO: Optional features available: ssl, http11 2015-09-05 20:56:40 [scrapy] INFO: Overridden settings: {} New Scrapy project 'stack' created in: /stack-spider/stack You can

Release notes, AssertionError exceptions triggered by assert statements have been replaced by new Zsh completion no longer allows options after arguments (issue 4438) Spider.make_requests_from_url , deprecated in Scrapy 1.4.0, now issues a Removed top-level reactor imports to prevent errors about the wrong Twisted reactor  We see here how a scrapy parse method can return not just scrapy Request objects, but also Item objects. Here is one of our basic scrapy items at this stage: class PlayByPlay(scrapy.Item): game_id = scrapy.Field() quarter = scrapy.Field() period = scrapy.Field() clock = scrapy.Field() score = scrapy.Field() team = scrapy.Field()

[PDF] Scrapy Documentation, import scrapy class QuotesSpider(scrapy.Spider): name = 'quotes' means that Scrapy doesn't need to wait for a request to be finished Once you have created a virtualenv, you can install scrapy inside it com, shell will treat index.html as a domain name and trigger a DNS Write items to MongoDB. web-scraping,scrapy,scrapy-spider. You can use scrapy-pipeline and from there you can insert each item into seperate files. I have set a counter in my spider so that it increments on each item yield and added that value to item. Using that counter value I'm creating file names. Test_spider.py class TestSpider(Spider): # spider

Scrapy items, Scrapy: Scrapy is a full-fledged spider library, capable of performing load balancing Once this library is installed, you can create new Scrapy project with this command: Field() You can now use it in your spider by importing your Product. We will use MongoDB here, but you could use a regular SQL database too. Selector API changes¶. While these are not changes in Scrapy itself, but rather in the parsel library which Scrapy uses for xpath/css selectors, these changes are worth mentioning here. Scrapy now depends on parsel >= 1.5, and Scrapy documentation is updated to follow recent parsel API conventions.

Comments
  • Thanks for the suggestions @Thiago. This approach, although common, is very slow and inefficient because it hammers the database and makes many round trips over the network. Hence the approach I'm using.
  • I put an inspect_response inside the spider_closed and saw that the file was created at that point but the data wasn't written so didn't go any further.