Scrapy - Importing Excel .csv as start_url

Related searches

So I'm building a scraper that imports a .csv excel file which has one row of ~2,400 websites (each website is in its own column) and using these as the start_url. I keep getting this error saying that I am passing in a list and not a string. I think this may be caused by the fact that my list basically just has one reallllllly long list in it that represents the row. How can I overcome this and basically put each website from my .csv as its own seperate string within the list?

raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
    exceptions.TypeError: Request url must be str or unicode, got list:


import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv

with open('websites.csv', 'rbU') as csv_file:
  data = csv.reader(csv_file)
  scrapurls = []
  for row in data:
    scrapurls.append(row)

class DanishSpider(scrapy.Spider):
  name = "dmoz"
  allowed_domains = []
  start_urls = scrapurls

  def parse(self, response):
    for sel in response.xpath('//link[@rel="icon" or @rel="shortcut icon"]'):
      item = DanishItem()
      item['website'] = response
      item['favicon'] = sel.xpath('./@href').extract()
      yield item

Thanks!

Joey

Just generating a list for start_urls does not work as it is clearly written in Scrapy documentation.

From documentation:

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

I would rather do it in this way:

def get_urls_from_csv():
    with open('websites.csv', 'rbU') as csv_file:
        data = csv.reader(csv_file)
        scrapurls = []
        for row in data:
            scrapurls.append(row)
        return scrapurls


class DanishSpider(scrapy.Spider):

    ...

    def start_requests(self):
        return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

Scrapy - Importing Excel .csv as start_url, So I'm building a scraper that imports a .csv excel file which has one row of ~ 2,400 websites (each website is in its own column) and using these as the start_url. class scrapy.exporters.CsvItemExporter (file, include_headers_line = True, join_multivalued = ',', ** kwargs) [source] ¶ Exports items in CSV format to the given file-like object. If the fields_to_export attribute is set, it will be used to define the CSV columns and their order. The export_empty_fields attribute has no effect on this exporter

Try opening the .csv file inside the class (not outside as you have done before) and append the start_urls. This solution worked for me. Hope this helps :-)

    class DanishSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = []
        start_urls = []

        f = open('websites.csv'), 'r')
        for i in f:
        u = i.split('\n')
        start_urls.append(u[0])

how to output a CSV file for each start_url? : scrapy, When I provide a list of start_urls my spider crawls them and outputs the data all into 1 csv file. newline='') as output_file: wr = csv.writer(output_file, dialect=' excel') for data in cleaned_list: import scrapy class ToScrapeSpiderEU(scrapy. Scrapy – Importar Excel .csv como start_url Así que estoy construyendo un raspador que importa un file . csv excel que tiene una fila de ~ 2,400 sitios web (cada website está en su propia columna) y los usa como el start_url.

I find the following useful when in need:

import csv
import scrapy

class DanishSpider(scrapy.Spider):
    name = "rei"
    with open("output.csv","r") as f:
        reader = csv.DictReader(f)
        start_urls = [item['Link'] for item in reader]

    def parse(self, response):
        yield {"link":response.url}

Scrapy import websites from CSV file : scrapy, I dont know how to import csv file that Scrapy can read website URLs from first column start_urls=[] allowed_domains=[] df=pd.read_excel("xyz.xlsx") for url in � When FEED_EXPORT_FIELDS is empty or None (default), Scrapy uses the fields defined in item objects yielded by your spider. If an exporter requires a fixed set of fields (this is the case for CSV export format) and FEED_EXPORT_FIELDS is empty or None, then Scrapy tries to infer field names from the exported data - currently it uses field names

  for row in data:
    scrapurls.append(row)

row is a list [column1, column2, ..] So I think you need to extract the columns, and append to your start_urls.

  for row in data:
      # if all the column is the url str
      for column in row:
          scrapurls.append(column)

How to scrape Alibaba.com product data using Scrapy, Tutorial to build a scrapy spider to crawl Alibaba.com search results and extract that spider is allowed to crawl; start_urls is the urls which the spider will start crawling when it Export Product Data into JSON or CSV using Scrapy Rating, Date, Author etc from Product Reviews into an Excel Spreadsheet. scrapy crawl myspider -o data.json scrapy crawl myspider -o data.csv scrapy crawl myspider -o data.xml Scrapy has its built-in tool to generate json, csv, xml and other serialization formats . If you want to specify either relative or absolute path of the produced file or set other properties from command line you can do it as well.

Try this way also,

filee = open("filename.csv","r+")

# Removing the \n 'new line' from the url

r=[i for i in filee]
start_urls=[r[j].replace('\n','') for j in range(len(r))]

How to save scraped data as a CSV file using Scrapy, Here's a basic Scrapy spider I built: import scrapy. class SpiderSpider(scrapy. Spider):. name = 'spider'. allowed_domains = ['books.toscrape.com']. start_urls� Scrapy Tutorial¶ In this tutorial, we’ll assume that Scrapy is already installed on your system. If that’s not the case, see Installation guide. We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors. This tutorial will walk you through these tasks: Creating a new Scrapy project

Look into Scrapy web-scraping framework. There is also aiohttp which is based on AsyncIO. Gathering scraping results. I think you don't actually need an Excel writer here since you are only writing simple text data - you are not concerned with advanced data types or workbook style and formatting. Use a CSV writer - Python has a built-in csv module.

First, we import scrapy. Then, a class is created inheriting ‘Spider’ from Scrapy. That class has 3 variables and a method. The variables are the spider’s name, the allowed_domains and the start_URL. Pretty self-explanatory.

Finally, export the dataframe to a CSV file which we named quoted.csv in this case. Don't forget to close the chrome driver using driver.close(). Adittional resources. 1. finding elements You'll notice that I used the find_elements_by_class method in this walkthrough. This is not the only way to find elements.

Comments
  • Update your error log please
  • i always get: KeyError: 'Link'
  • If your csv file doesn't have a column header with the name Link, you should get that error.
  • thanks @SIM you helped me!