How to extract the href attribute after a particular th in the wikipage infobox through Selenium or lxml using Python
get wikipedia infobox python
how to extract data from wikipedia using python
python extract text from wikipedia
scrape wikipedia infobox python
python get text from website
python requests parse html
python requests get html
The problem I have is to get the href of a particular cell in the infobox on a wikipage (Please see the image below). Specifically, I would like to get the href of the 3M's official website after the table's row header of "Website". The source code is highlighted in the image. (This wiki page format is pretty regular for most firms' wiki pages. I further plan to get websites for many firms, so it's not just to collect this one..)
The things I have tried but don't work:
# selenium: driver.find_element_by_xpath("//table[@class='infoboxvcard']/tr[th/text()='Website']").get_attribute("href") # lxml: url = "https://en.wikipedia.org/wiki/3M" req = requests.get(url) store = etree.fromstring(req.text) output = store.xpath("//table[@class='infobox vcard']/tr[th/text()='Website']/td")
Code that works for a particular firm:
driver.get("https://en.wikipedia.org/wiki/3M") website = driver.find_element_by_xpath("//*[@id='mw-content-text']/div/table/tbody/tr/td/span/a").get_attribute("href")
However, since not all firms have the same number of rows, this code would not work when I loop over hundreds of firms.
Any help would be appreciated! Thanks in advance!
Screenshot from 3m wiki page:
To extract the href attribute of 3M's official website from wikipedia Selenium itself would be sufficient and you need to induce WebDriverWait for the desired element to be visible and you can use the following solution:
website = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//th[@scope='row' and text()='Website']//following::td/span/a[@class='external text']"))).get_attribute("href")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
Newest 'lxml' Questions - Page 12, lxml is a full-featured, high performance Python library for processing XML and HTML. lxml - Remove element if grand children have text How to extract the href attribute after a particular th in the wikipage infobox through Selenium or lxml � 2 Python Selenium Cannot find the href when it is on the website Feb 18 '19 2 Python Selenium Button click has no effect Mar 29 '19 1 How to extract the href attribute after a particular th in the wikipage infobox through Selenium or lxml using Python Feb 17 '19
This is a more robust xpath:
website = driver.find_element_by_xpath('//*[@class="url"]/a').get_attribute("href")
If you know the text you can use:
website = driver.find_element_by_link_text('3M.com').get_attribute("href")
Hope this helps you!
Fetching text from Wikipedia's Infobox in Python, It can be described as structured document containing a set of attribute–value pairs, In this article, I'll tell how to extract contents of the Wikipedia's Infobox. 1)lxml :lxml is the most feature-rich and easy-to-use library for processing XML like headers, form data, multipart files, and parameters via simple Python libraries .
What you could do you can store all link_text in excel sheet and fetch the string from excel and assign in a variable like I have assign for an example.Then use my below code it should work.
wb_link_text="3M.com" wb_ele_href =driver.find_element_by_xpath("//a[text()[contains(.,'" + wb_link_text +"')]]").get_attribute("href") print(wb_ele_href)
Let me know if that helps.
Selenium / lxml - 绵阳斯懿新材料科技有限公司, How to extract the href attribute after a particular th in the wikipage infobox through Selenium or lxml using Python. The problem I have is to get the href of a �
Web scraping using Python: requests and lxml, Traversing HTML and extracting data from it with lxml It facilitates extracting the text, attribute values or HTML for a particular element. Most of the URLs found in href attributes are relative to the page we found them in. words like “of” and “ for”) in the Wikipedia or Wikidata API to associate the names with locations.
Web Scraping Wikipedia Tables using BeautifulSoup and Python , Web Scraping Wikipedia Tables using BeautifulSoup and Python We begin by reading the source code for a given web page and soup = BeautifulSoup( website_url,'lxml') To do that we create a list Countries so that we can extract the name of countries from the link and append it to the list countries.
Pyppeteer, the snake charmer. Or how to remotely control a browser , With the incorporation of “headless” modes to Firefox and Chrome, even that is not necessary. detail, Selenium is a series of technologies to control the browser remotely, Let's use Python and extract some information from Wikipedia. We will use the XPath selector //table[@class="infobox"]/tbody/tr[th�
- So you are looking for a more robust xpath?
- @MosheSlavin yes. I want to put this xpath in a loop so it would work for many firm’s wikipage. The one works won’t work for this firm: en.m.wikipedia.org/wiki/Abbott_Laboratories
- So has the xpath in the answer helped?
- @MosheSlavin yep! It works except wiki has a different structure. But that’s pretty solvable! Many thanks!
- Your solution is also pretty cool!!! As Moshe Slavin's solution, it works pretty well except wiki has a different structure. But that’s pretty solvable! Thanks so much!
- @QiaoWang Perhaps my answer was based on your verbatim after the table's row header of "Website" :)
- The issue I met is this part "td/span/a". For some firms it is just "td/a". In that case I need to manipulate a little bit. Sorry for my typo in last comment. You’re correct... i specially asked "after xxx". Your code is the most direct one.
- Your answer is better practice in selenium using
WebDriverWait! +1, yet I like my XPATH better... what do you think?
3M.comwill be native to
https://en.wikipedia.org/wiki/3Mbut OP seems to be looking for a generic solution.
//*[@class="url"]/alooks good enough but OP again seems to be specifically looking for after the table's row header of "Website" which seems not within the initial viewport, so for a safer bet you need WebDriverWait.