How to get innerHTML of a node using scrapy Selector?
scrapy css selector attribute
from scrapy selector import selector
scrapy find element by class name
scrapy print response
Suppose there are some html fragments like:
<a> text in a <b>text in b</b> <c>text in c</c> </a> <a> <b>text in b</b> text in a <c>text in c</c> </a>
In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!
Here's what I managed to do:
from scrapy.selector import Selector sel = Selector(text = html_string) for node in sel.css('a *::text'): print node.extract()
html_string is a variable holding the html in your question, this code produces the following output:
text in a text in b text in c text in b text in a text in c
a *::text() matches all the text nodes which are descendents of
Selectors, Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements: >>> response.css('title::text').get() 'Example website'. As you can see, When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.
You can use XPath's
string() function on the elements you select:
$ python >>> import scrapy >>> selector = scrapy.Selector(text="""<a> ... text in a ... <b>text in b</b> ... <c>text in c</c> ... </a> ... <a> ... <b>text in b</b> ... text in a ... <c>text in c</c> ... </a>""", type="html") >>> for link in selector.css('a'): ... print link.xpath('string(.)').extract() ... [u'\n text in a\n text in b\n text in c\n'] [u'\n text in b\n text in a\n text in c\n'] >>>
Use Scrapy to Extract Data From HTML Tags, Scrapy - Selectors - When you are scraping the web pages, you need to extract a If you want to extract the first element, then use the method .extract_first(), The parameter ui has a property called selected which is a reference to the selected dom element, you can call innerHTML on that element. Your code $('.ui-selected').innerHTML tries to return the innerHTML property of a jQuery wrapper element for a dom element with class ui-selected
in scrapy 1.5, you can use
/* to get innerhtml.
content = response.xpath('//div[@class="viewbox"]/div[@class="content"]/*').extract_first()
User kuixiong, There are many useful methods in response object, in the code below, we use the xpath method to extract info for us. #If we want to get html node selector. A string representing a CSS selector. It must be compliant with CSSselector's supported selectors. get. Part of the selected element(s) to retrieve. 'text': the DOM equivalent of Node.textContent. 'html': gets the content including html tags. The equivalent of Element.innerHTML.
set innerhtml text Code Example, Use the XPath syntax to select elements on this web page Before we look into other ways to reach a specific HTML node using XPath, let's start by looking closer at how nodes are The Scrapy documentation has more on the topic. Please see the below html markup. How can I use the xpath selector in Scrapy to pull content from the col-sm-7 class name in div?. I want to extract this text: Infortrend EonNAS Pro 850X 8-bay Tower NAS with 10GbE
Selecting content on a web page with XPath, XPath, designed to extract data from XML documents, and CSS selectors, (lmxl, Selenium, Scrapy -- with the notable exception of BeautifulSoup) are compatible with both. This selects the root element, the <html> tag. The innerHTML property can be used to examine the current HTML source of the page, including any changes that have been made since the page was initially loaded. Reading the HTML contents of an element. Reading innerHTML causes the user agent to serialize the HTML or XML fragment comprised of the element's descendants. The resulting string is
- This is great, but I managed to make it by sel.css("a").extract() and then using regex to exclude those html tags
- @kuixiong Great! Note that parsing HTML with regex is generally not considered a good practice. If you control that HTML and it is simple enough, go ahead and use regex. Otherwise, consider relying on specialized tools.
- The solution collects the text, not the innerHTML.
- This is the best and safest solution.
- This will only extract the first node in .content, use extract() with a ''.join to get the full innerhtml as a string.