Python BeautifulSoup how can I get data for the latest selector

beautifulsoup find
beautifulsoup find by class
beautifulsoup get text inside tag
beautifulsoup find nested tags
beautifulsoup find by id
beautifulsoup find table by id
web scraping python beautifulsoup
python extract text from html tag

After sending a python HTTP request, it's response (data) has a html page which has many blocks of the ABCD . Here is one snippet

                   <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/18/2018 21:45</td>
                        <td>12/18/2018 21:46</td>
                        <td>10</td>
                        <td>10</td>
                        <td>100.0</td>
                        <td><span class="label success">Success</span></td>
                        <td>SMS</td>
                        <td>
                            <a data-id="134717" class="btn" title="Go">View</a>
                        </td>
                    </tr>

I need to retrieve the most recent data-id for ABCD (in this case 134717, and this number is dynamic). Also note there are many of those ABCD's with different dates, I want the most recent .

I can do it using regular expression and going through line by line. But I think it is better to do it with with BeautifulSoup.

I tried this it finds all ABCDs but I dont know how to get the most recent one :

    soup = BeautifulSoup(data, "html.parser")
    for i in soup.select("td.truncate"):
        #print(i.text)
        if i.text == "ABCD":
            print ("Got it ", i.text)
            id1 = soup.select_one("a.data-id")
            print (id1)
            parsed_url1 = urlparse(id1)

You'll need the dateutils parser for this one. Obviously there's no way of telling which <td> has the date in it, so you just have to iterate over all the td's in the matched tr's, and try to parse the datetime, and if the datetime parsing was a success just append it to the dates list for a specific id. After you've gained all the dates for each ID, you just max on them to find the latest.

from dateutil import parser as du_parser    
from collections import defaultdict
from bs4 import BeautifulSoup as BS

data = "<tr><td class=\"success\"></td><td class=\"truncate\">ABCD</td><td>12/18/2018 21:45</td><td>12/18/2018 21:46</td><td>10</td><td>10</td><td>100.0</td><td><span class=\"label success\">Success</span></td><td>SMS</td><td><a data-id=\"134717\" class=\"btn\" title=\"Go\">View</a></td></tr>"
b1 = BS(data, "html.parser")

td_of_interest = b1.find_all("td")
tr_that_contain_our_td = [x.parent for x in b1.find_all("td", string="ABCD")]

ids_dict = defaultdict(list)

# iterate over matched tr's to get their dates
for tr in tr_that_contain_our_td:
    extracted_id = tr.find("a")['data-id']

    for td in tr.find_all("td"):
        try:
            if len(td.contents) > 0:
                actual_date = du_parser.parse(td.contents[0])
                ids_dict[extracted_id].append(actual_date)
        except ValueError:
            pass  #nothing to do here

ids_dict = {k: max(v) for k, v in ids_dict.items()}

print(ids_dict)

Extracting an attribute value with beautifulsoup, everything you need: import urllib f = urllib. urlopen("http://58.68.130.147") s = f. read() f. $ apt-get install python3-bs4 (for Python 3) Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.

assuming html follows the same pattern:

given:

html = '''                   <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/18/2018 21:45</td>
                        <td>12/18/2018 21:46</td>
                        <td>10</td>
                        <td>10</td>
                        <td>100.0</td>
                        <td><span class="label success">Success</span></td>
                        <td>SMS</td>
                        <td>
                            <a data-id="134717" class="btn" title="Go">View</a>
                        </td>
                    </tr>


                    <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/20/2018 21:45</td>
                        <td>12/20/2018 21:46</td>
                        <td>99</td>
                        <td>99</td>
                        <td>999.0</td>
                        <td><span class="label success999">Success</span></td>
                        <td>SMS99</td>
                        <td>
                            <a data-id="9913471799" class="btn" title="Go">View</a>
                        </td>
                    </tr>

                                        <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/22/2018 21:45</td>
                        <td>12/22/2018 21:46</td>
                        <td>99</td>
                        <td>99</td>
                        <td>999.0</td>
                        <td><span class="label success999">Success</span></td>
                        <td>SMS99</td>
                        <td>
                            <a data-id="found the latest date" class="btn" title="Go">View</a>
                        </td>
                    </tr>

                                        <tr>
                        <td class="success"></td>
                        <td class="truncate">ABCD</td>
                        <td>12/21/2018 21:45</td>
                        <td>12/21/2018 21:46</td>
                        <td>99</td>
                        <td>99</td>
                        <td>999.0</td>
                        <td><span class="label success999">Success</span></td>
                        <td>SMS99</td>
                        <td>
                            <a data-id="9913471799" class="btn" title="Go">View</a>
                        </td>
                    </tr>'''

find the latest date:

import bs4
import re
import datetime                

dates_list = []

soup = bs4.BeautifulSoup(html, 'html.parser')

for i in soup.select("td.truncate"):
        #print(i.parent.text)
        match = re.search(r'\d{2}/\d{2}/\d{4}', i.parent.text)
        date = datetime.datetime.strptime(match.group(), '%m/%d/%Y').date()
        date = date.strftime('%m/%d/%Y')
        dates_list.append(date)

dates_list.sort()        
most_recent = dates_list[-1]

rows = soup.find_all('tr')
for row in rows:
    if str(most_recent) in row.text:
        id1 = row.find("a").get('data-id')  
        print (id1)

Installing Beautiful Soup, How do I extract data from a website using BeautifulSoup? Using beautifulsoup get_text() If you are using the latest BeautifulSoup and you need a single element matching a selector, there Python Beautiful Soup

If the data-id is increasing number you can select a tag with highest data-id value with max().

recentDataID = max([x.get('data-id') for x  in soup.select("a[data-id]")])
print(recentDataID)

# if you want to select the parent or `tr`
mostRecentRow = soup.select_one('a[data-id=%s]' % recentDataID).parent.parent

Extracting Data from HTML with BeautifulSoup, Beautiful Soup is a Python library for pulling data out of HTML and XML files. a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the for tags that match two or more CSS classes, you should use a CSS selector:. Python BeautifulSoup CSS Selector . How can I create a new file in Python? How to read multiple data files in python.

Beautiful Soup Documentation, To do this, right click on the web page in the browser and select inspect options to Beautiful Soup: To extract data from the HTML response. BeautifulSoup. BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.

Web Scraping with Beautiful Soup, Web scraping allows us to extract information from web pages. In this tutorial, you​'ll learn how to perform web scraping with Python and BeautifulSoup. web scraping to get the data from the web page into a format you can work with in your We can now select the html tag and its children by taking the third item in the list: I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site

Tutorial: Python Web Scraping Using BeautifulSoup –, In this article, we will learn how to extract structured information from any web page leveraging BeautifulSoup and CSS selectors. Attribute selectors allow you to select element with particular attributes values, p[data-test="foo"] will match If you want to select last p inside a section, you can also do it in  Go to the ‘for loop’ at around line 45. Take everything that involves in extracting values and adding them to ‘data’ (so, the whole code) and replace it with the ‘get_cd_attributes(cd)’.

Comments
  • I think you'll have to iterate through, otherwise how would you know if you have the most recent? I'd probably iterate and just create a dictionary with the date as the key, then the data-id as the value. Then after you have all the dates:id, get the most recent date(key).
  • This worked thank you Andrei, very elegant and correct answer
  • I tried this but didnt get the result , id1 came out as None