Web scraping script is returning duplicate values

Related searches

My web scraping script is returning duplicate results for some reason, i've tried so many alternatives, but just can't get it to work whatsoever. Can anyone help please?

import requests
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import csv

soup = [ ]
pages = [ ]

csv_file = open('444.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Practice', 'Practice Manager'])

for i in range(35899, 35909):
   url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
   pages.append(url)

for item in pages:
   page = requests.get(item)
   soup.append(bs(page.text, 'lxml'))

business = []
for items in soup:
   h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
   for i in h1Obj:
      tagArray = i.findChildren()
   for tag in tagArray:
      if isinstance(tag,Tag) and tag.name in 'h1':
         business.append(tag.text)
      else:
         print('no-business')

names = []
for items in soup:
   h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
   for i in h4Obj:
      tagArray = i.findChildren()
      for tag in tagArray:
         if isinstance(tag,Tag) and tag.name in 'h4':
            names.append(tag.text)
         else:
            print('no-name')

print(business, names)
csv_writer.writerow([business, names])
csv_file.close()

It's currently returning duplicate values on all.

What it needs to do is return one 'business' and one 'names' value per url call. If there is no 'business' or 'name', it needs to return a value of 'no-business' or 'no-name'.

Can anyone please help me?

You could use the following id to generate the initial list of lists. You could write each row to csv rather than append to final list.

import requests
from bs4 import BeautifulSoup as bs

results = []
with requests.Session() as s:

    for i in range(35899, 35909):
        r = s.get('https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i))
        soup = bs(r.content, 'lxml')
        row = [item.text for item in soup.select('.staff-title:has(em:contains("Practice Manager")) [id]')]
        if not row: row = ['no practice manager']
        practice = soup.select_one('.gp').text if soup.select_one(':has(#org-title)')  else 'No practice name'
        row.insert(0, practice)
        results.append(row)
print(results)

Not sure how you want listing out for multiple names

import requests
from bs4 import BeautifulSoup as bs
import csv

with open('output.csv', 'w', newline='') as csvfile:
    w = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)

    with requests.Session() as s:

        for i in range(35899, 35909):
            r = s.get('https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i))
            soup = bs(r.content, 'lxml')
            row = [item.text for item in soup.select('.staff-title:has(em:contains("Practice Manager")) [id]')]
            if not row: row = ['no practice manager']
            practice = soup.select_one('.gp').text if soup.select_one(':has(#org-title)')  else 'No practice name'
            row.insert(0, practice)
            w.writerow(row)

Basic Web Scraping with Python — J.D. Data Home, A basic example of Web Scraping with Python. by keeping the first entries for those teachers and return the data frame with the duplicates removed. You can also go to the full github repository to see the full script. I want to learn some web scraping and I found out puppeteer library. I chose puppeteer over other tools because I have some background in JS. I also found this website whose purpose is to be scraped. I've managed to get the info of every book in every page. Here's what I did:

I don't know if it's the best way of doing it, but i used set instead of list to remove duplicates and just before saving the file i convert the set to a list like this :

import requests
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import csv

soup = [ ]
pages = [ ]

csv_file = open('444.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Practice', 'Practice Manager'])

for i in range(35899, 35909):
   url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
   pages.append(url)

for item in pages:
   page = requests.get(item)
   soup.append(bs(page.text, 'lxml'))

business = set()
for items in soup:
   h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
   for i in h1Obj:
      tagArray = i.findChildren()
   for tag in tagArray:
      if isinstance(tag,Tag) and tag.name in 'h1':
         business.add(tag.text)
      else:
         print('no-business')


names = set()
for items in soup:
   h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
   for i in h4Obj:
      tagArray = i.findChildren()
      for tag in tagArray:
         if isinstance(tag,Tag) and tag.name in 'h4':
            names.add(tag.text)
         else:
            print('no-business')

print(business, names)
csv_writer.writerow([list(business), list(names)])
csv_file.close()

[PDF] Web Scraping with Python: Collecting Data from the Modern Web, Retrieving HTML data from a domain name. Parsing that data for target information. Storing the target information. Optionally, moving to another page to repeat� The first approach to scrape this webpage is to use Selenium web driver to call the browser, search for the elements of interest and return the results. Scraping the web page using Selenium 1. Selenium with geckodriver

Looks like the problem stems from the fact that, in some of these pages, there is no information at all, and you get a "Profile Hidden" error. I modified your code somewhat, to cover the first 5 pages. Aside from saving to file, it looks like this:

[same imports]
pages = [ ]

for i in range(35899, 35904):
   url = 'https://www.nhs.uk/Services/GP/Staff/DefaultView.aspx?id=' + str(i)
   pages.append(url)

soup = [ ]
for item in pages:
   page = requests.get(item)
   soup.append(bs(page.text, 'lxml'))

business = []
for items in soup:
       h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
       for i in h1Obj:
          tagArray = i.findChildren()
       for tag in tagArray:
          if isinstance(tag,Tag) and tag.name in 'h1':
             business.append(tag.text)


names = []
for items in soup:    
  h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
  for i in h4Obj:
      tagArray = i.findChildren()
  for tag in tagArray:
     if isinstance(tag,Tag) and tag.name in 'h4':
        names.append(tag.text)


for bus, name in zip(business,names):
    print(bus,'---',name)

The output looks like this:

Bilbrook Medical Centre --- Di Palfrey
Caversham Group Practice --- Di Palfrey
Caversham Group Practice --- Di Palfrey
The Moorcroft Medical Ctr --- Ms Kim Stanyer 
Brotton Surgery --- Mrs Gina Bayliss

Notice that only the 2nd and 3rd entries are duplicated; that is (somehow, not sure why) caused by the "Hidden Profile" in the third page. So if you modify the main blocks of the code to:

business = []
for items in soup:
   if "ProfileHiddenError.aspx" in (str(items)):
    business.append('Profile Hidden')
   else:
       h1Obj = items.select('[class^=panel]:has([class^="gp notranslate"]:contains(""))')
       for i in h1Obj:
          tagArray = i.findChildren()
       for tag in tagArray:
          if isinstance(tag,Tag) and tag.name in 'h1':
             business.append(tag.text)


names = []
for items in soup:
    if "ProfileHiddenError.aspx" in (str(items)):
        names.append('Profile Hidden')        
    elif not "Practice Manager" in str(items):
        names.append('No Practice Manager Specified')     
    else:
      h4Obj = items.select('[class^=panel]:not(p):has([class^="staff-title"]:contains("Practice Manager"))')
      for i in h4Obj:        
          tagArray = i.findChildren()
      for tag in tagArray:
         if isinstance(tag,Tag) and tag.name in 'h4':
            names.append(tag.text)


for bus, name in zip(business,names):
    print(bus,'---',name)

The output, this time is:

BBilbrook Medical Centre --- Di Palfrey
Caversham Group Practice --- No Practice Manager Specified
Profile Hidden --- Profile Hidden
The Moorcroft Medical Ctr --- Ms Kim Stanyer 
Brotton Surgery --- Mrs Gina Bayliss

Hopefully this would help you to troubleshoot the problem.

Step 2: Remove Duplicates. Once you’ve combined all of your images in a single folder, you’ll want to remove duplicates. You can use a Python script if you like but I prefer Visipics.. Step 3

Web scraping is the term for using a program to download and process content from the Web. For example, Google runs many web scraping programs to index web pages for its search engine. In this chapter, you will learn about several modules that make it easy to scrape web pages in Python.

In this tutorial, you'll walk through the main steps of the web scraping process. You'll learn how to write a script that uses Python's requests library to scrape data from a website. You'll also use Beautiful Soup to extract the specific pieces of information that you're interested in.

Luckily, there’s web-scraping! Important notes about web scraping: Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes. Make sure you are not downloading data at too rapid a rate because this may break the website.

Comments
  • Do you want just the practice manager for each practice?
  • Basically yes, but I also need it to say which practice that they are a manager of, some have multiple managers, some have none at all, so it needs to say 'no-name' for those.
  • So only practice managers and if multiple return multiple?
  • Yes, I need the practice name (business name) too, so i know where they have come from.
  • No worries. Looks like all staff with title is staff = [(item.text, item.next_sibling.next_sibling.text) for item in soup.select('[id^=staff]')]
  • That works a treat, thank you so much for your help. I just need to make the else statement work, do you have any ideas how I could get it to return 'no-name' or 'no-business' when there is nothing found under the given url?
  • Thank you for your help. It seems to have cleaned it up a lot more, but the data is now coming back incorrect, as the Caversham Group Practice should return 'no-name', as there is no manager there at all. Any suggestions? I've been talking to a rubber duck for the last few hours trying to figure out what's wrong :-( Slowly going crazy lol.
  • @MissPepper You're right! I modified the "names" block above to account for that. Let's see if this works...