Beautiful Soup | How to separate multiple attrs within <a> tags

beautiful soup documentation pdf
navigablestring
beautifulsoup get text inside tag
beautifulsoup tutorial
beautifulsoup decompose
beautifulsoup find nested tags
beautifulsoup find text
beautifulsoup find table by id

I am trying to scrape a webpage to collect Image Names & their respective asset URLs and write them to a CSV in two seperate columns. I have not been able to separate attrs out of the tags.

In BS4, I am able to run:

soup.find_all('a')

It successfully returns the below html (multiplied by the photo count on the page)

<a aria-label="SomeImageName" data-asset-id="10101010101" 
href="SomeWebsite">
<img alt="SomeImageName" 
src="https://SomeImageUrl"/>
</a>

I have tried running the following (and many other variations)

soup.find_all('a', attrs{"aria-label", "src"})

and they return

[]

Anyone know how to extract this data from the tag and write to a CSV?

Cheers!

Welcome to StackOverflow! You are having your requirements in two different elements i.e. aria-label in a and src in img. But luckily you have got img nested inside the a tag. So iterating will be simple.

Store the names and links in a list of dictionaries and with DictWriter() you can easily write them into a csv file.

import csv
img_data = []
for a_tag in soup.find_all('a'):
    data_dict = dict()
    data_dict['image_name'] = a_tag['aria-label']
    data_dict['url'] = a_tag.img['src']
    img_data.append(data_dict)

with open('urls.csv', 'w') as csvfile:
    fieldnames = ['image_name', 'url']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for data in img_data:    
        writer.writerow(data)

Hope this helps! Cheers!

Beautiful Soup Documentation, If you have questions about Beautiful Soup, or run into problems, send mail to HTML 4 defines a few attributes that can have multiple values. But actually, it's a string: the comma and newline that separate the first <a> tag from the second:. In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. In this we will try to scrap webpage from various different websites (including IMDB). We will cover beautiful soup 4, python basic tools for

Try the code below, it extracts the value of src attribute of <img> tag which is inside the <a> tag that has an attribute aria-label and write those links to a csv file

## To get the value of src attribute in the <img> tag
tags = soup.find_all('a')
src=[]
for tag in tags:
    if tag.has_attr('aria-label'):
        src.append(tag.img['src'])

##writing to a csv file
with open('csvfile.csv','w') as file:
    for line in src:
        file.write(line)
        file.write('\n')

Or you can use the csv module to write the data

import csv
with open('csvfile1.csv', "w",newline='') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(src)

Beautiful Soup documentation, findAll(name=None, attrs={}, recursive=True, There are several ways to restrict the name, and these too This code finds all the <B> Tag s in the document:. Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3. easy_install beautifulsoup4 pip install beautifulsoup4. If you don’t have easy_install or pip installed, you

Thank you everyone for the input! I still wasn't able to pull aria-label and I read on some other forums this is a BS4 issue when parsing HTML.

I was, however, able to solve this quite easily using @SmashGuy solution, and pulling the alt text description versus aria-label.

img_data = []
for img_tag in soup.find_all('img'):
    data_dict = dict()
    data_dict['image_name'] = img_tag['alt']
    data_dict['image_url'] = img_tag['src']
    img_data.append(data_dict)

And to write to CSV...

with open('BCDS1.csv', 'w', newline='') as birddata:
    fieldnames = ['image_name', 'image_url']
    writer = csv.DictWriter(birddata, fieldnames=fieldnames)
    writer.writeheader()
    for data in img_data:
        writer.writerow(data)

Thanks again for everyone's help! Cheers!

Web Scraping with Beautiful Soup, Follow the Web Requests in Python guide to learn how to make web requests CSS and JavaScript files can be created separately and linked to multiple tag and can also accept search criteria based on attributes such as:. Beautiful Soup: Trying to select tags on conflicting multiple criteria. Hot Network Questions Were the NKVD hated by the regular army like the SS was?

for images you need to find the <img> tag, <a> is markup for links.

<a aria-label="SomeImageName" data-asset-id="10101010101" href="SomeWebsite">
    <img alt="SomeImageName" src="https://SomeImageUrl"/>
</a>

you found that image because as you can see, the link tag wraps the image tag.

and that's not how dictionary syntax works, use : in attrs={} (see https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments)

so it's soup.find_all('a', attrs={'css': 'value'}) instead of soup.find_all('a', attrs{"aria-label" "SomeImageName"})

Scraping Reddit with Python and BeautifulSoup 4, In our tutorial, we'll be using Python and the BeautifulSoup 4 package to Well, it works pretty much the same way a human would read the contents of If you wanted to pass in more than one parameter, all you have to do is make for post in soup.find_all('div', attrs=attrs): print(post.attrs['data-domain']). BeautifulSoup. BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.

Using BeautifulSoup to parse HTML and extract press briefings , Extracting attributes from a tag with attrs from bs4 import BeautifulSoup mytxt = """ <h1>Hello What about the other tags in our HTML snippet? Finding multiple elements with find_all the Web doesn't matter when we're working with Beautiful Soup – we  Beautiful Soup allows you to use either exact strings or functions as arguments for filtering text in Beautiful Soup objects. Extract Attributes From HTML Elements At this point, your Python script already scrapes the site and filters its HTML for relevant job postings.

[PDF] Beautiful Soup, In this tutorial, we will show you, how to perform web scraping in Python using Beautiful. Soup 4 To isolate our working environment so as not to disturb the existing setup, let us first The multi-valued attributes in beautiful soup are shown. BeautifulSoup is a third party Python library from Crummy. The library is designed for quick turnaround projects like screen-scraping What can it do? Beautiful Soup parses anything you give it and does the tree traversal stuff for you.

beautiful-soup-networkx-notebook, Beautiful Soup is a Python library that can parse information from HTML and XML files. In [66]:. from bs4 import BeautifulSoup. Now we can start making some beautiful soup! Not all attributes have multiple values, so if there is some value that looks multi-valued but aren't, Beautiful Just separate them with a comma. In  Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. You should use the 'beautifulsoup4' package instead of this package. Development on the 3.x series of Beautiful Soup ended in 2011, and the

Comments
  • thanks for the reply! Still having an issue though- Seems its not picking up aria-label. Heres the error I am getting on line 4 KeyError: 'aria-label' Thoughts?
  • Seems some of your tags does not have aria-label attribute. you can handle by either try except or with the following code data_dict['image_name'] = a_tag['aria-label'] if 'aria-label' in str(a_tag) else ''. The later will add an empty string if the attribute is missing and the first will skip the blocks which doesn't have the attribute. P.S.: the if else is a one liner and you can add it just like that.
  • Awesome it looks like that fixed part of it, now getting the same error as in Hari's example- TypeError: 'NoneType' object is not subscriptable on line containing data_dict['url'] = a_tag.img['src'] -- I'll do some research on this error now. Do you know of a fix? ((it seems like these are some pretty common issues with BS. It's my first time using. Thanks a billion for the help!))
  • Just add similar if else there. But the condition must check the presence of the img tag there. Something like if a_tag.img else ''. And instead of thanking do an upvote to my answer which will motivate me to share knowledge more.
  • Thanks @SmashGuy I got it working. I posted an answer if you'd like to see what I ended up doing!
  • find_all and why not use the csv module?
  • @G_M, what is the difference between find_all() and findAll()?
  • @SmashGuy Link was provided... BeautifulSoup 3 vs 4
  • Thanks for the reply @Hari! I did however run into this error: TypeError Traceback (most recent call last) <ipython-input-199-9ab36d053c0c> in <module>() 3 for tag in tags: 4 if tag.has_attr('aria-label'): ----> 5 src.append(tag.img['src']) TypeError: 'NoneType' object is not subscriptable Thoughts?
  • Downvoted as the answer did not provide the appropriate solution for the question. Try understanding the questions and give answers.
  • i pointed out that he targeted tag <a> when he asked how to captured attribute <img>, which in means he would have missed any other images outside of <a> tags. There is no right appropriate solution unless targetting <a> was exactly his intention, for instance because all the images in the HTML file, which was not included, are wrapped in the <a> tag. I understood the question properly to know that he didn't know enough BS4 & HTML.