Scrapy - pymongo not inserting items to DB

Related searches

So im playing about with scrapy trying to learn, and using MongoDB as my DB ive come to a dead end. Basically the scraping works as the items im fetching are showing in the terminal log, but i cant get the data to publish on my DB. The MONGO_URI is correct as i tried it in the python shell where i can create and store data..

Here are my files

items.py

import scrapy

class MaterialsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
   ## url = scrapy.Field()
    pass

spider.py

import scrapy
from scrapy.selector import Selector

from ..items import MaterialsItem

class mySpider(scrapy.Spider):
    name = "<placeholder for post>"
    allowed_domains = ["..."]
    start_urls = [
   ...
    ]

    def parse(self, response):
        products = Selector(response).xpath('//div[@class="content"]')

        for product in products:        
                item = MaterialsItem()
                item['title'] = product.xpath("//a[@class='product-card__title product-card__title-v2']/text()").extract(),
                item['price'] = product.xpath("//div[@class='product-card__price-value ']/text()").extract()
               ## product['url'] = 
                yield item

settings.py

MONGO_PIPELINES = {
    'materials.pipelines.MongoPipeline': 300,
}


#setup mongo DB
MONGO_URI = "my MongoDB Atlas address"
MONGO_DB = "materials"

pipelines.py

import pymongo

class MongoPipeline(object):

    collection_name = 'my-prices'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        ## pull in information from settings.py
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB', ', <placeholder-spider name>')

        )

    def open_spider(self, spider):
        ## initializing spider
        ## opening db connection
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        ## clean up when spider is closed
        self.client.close()

    def process_item(self, item, spider):
        ## how to handle each post
        self.db[self.collection_name].insert(dict(item))
        logging.debug("Post added to MongoDB")
        return item

Any help would be great!

**Edit

file structure

materials
  spiders
  my-spider
items.py
pipelines.py
settings.py

Shouldn't the line in the MongoPipeline class:

collection_name = 'my-prices'

be:

self.collection_name = 'my-prices'

since you call:

self.db[self.collection_name].insert(dict(item))

Web Scraping with Scrapy and MongoDB – Real Python, This tutorial covers how to write a Python web crawler using Scrapy to scrape by Real Python 110 Comments databases web-scraping Make sure to adhere to ethical scraping practices by not flooding the site with format(data)) if valid: self.collection.insert(dict(item)) log.msg("Question added to MongoDB database! However, I'm not sure I'm doing it the right way, since after my scrape, when I go into the mongo shell and use the find() method, nothing comes back. During my scrape, scrapy's logs do show me that all of ithe items have been scraped, and with the save to json command, all my items are successfully scraped and saved onto the json file.

I figured it out, with a fresh head i looked over everything again. Turns out in settings i had to edit the

MONGO_PIPELINES = {
    'materials.pipelines.MongoPipeline': 300,
}

to

ITEM_PIPELINES = {
    'materials.pipelines.MongoPipeline': 300,
}

I guess i shouldnt have changed the naming format, from ITEM_PIPELINES to MONGO_PIPELINES.

Check Mongo before adding Scrapy item : mongodb, Check Mongo before adding Scrapy item Everything just gets re-added to the database, and I'm not really sure where I'm going wrong, so I'm having a hard� The items.py file is used to define storage “containers” for the data that we plan to scrape. The StackItem() class inherits from Item , which basically has a number of pre-defined objects that Scrapy has already built for us:

what the code error and i think i need to beunder init if it posible can you upload it to git ? and i might try to have a look

sebdah/scrapy-mongodb: MongoDB pipeline for Scrapy , It will insert items to MongoDB as soon as your spider finds data to extract. MONGODB_DATABASE, scrapy-mongodb, No, Database to use. Last time we only downloaded 50 questions, but since we are grabbing a lot more data this time, we want to avoid adding duplicate questions to the database. We can do that by using a MongoDB upsert, which means we update the question title if it is already in the database and insert otherwise. Modify the MongoDBPipeline we defined earlier:

Item Pipeline — Scrapy 2.3.0 documentation, Dropped items are no longer processed by further pipeline components. MongoDB address and database name are specified in Scrapy settings; MongoDB� MONGODB_DATABASE: scrapy-mongodb: No: Database name to use. Does not need to exist. MONGODB_COLLECTION: items: No: Collection within the database to use. Does not need to exist. MONGODB_URI: mongodb://localhost:27017: No: Add the URI to the MongoDB instance or replica set you want to connect to. It must start with mongodb://. See more in the

Check duplicate items in Mongodb, Method. The fastest way to check if an item into a MongoDB is unique (and if it isn 't, not insert it) is to create a unique index on the related columns and catch the� Write items to MongoDB¶ In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class. The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:

It is not possible to combine this feature with MONGODB_UNIQUE_KEY. Technically due to that the update method in pymongo doesn't support multi doc updates. Adding timestamps. scrapy-mongodb can append a timestamp to your item when inserting it to the database. Enable this feature by like this:

Comments
  • Although this seems correct, it does not work in this example for me.. This is the scrapy example for pipeline to mongodb which i have just edited with my own data. i tried adding self.colection_name as you suggested but i breaks the code.
  • Thanks but as above i think my issue is resolved, was a misnamed rule in settings :)