How to get innerHTML of a node using scrapy Selector?

scrapy css extract text
scrapy css selector attribute
from scrapy selector import selector
scrapy shell
scrapy find element by class name
scrapy yield
scrapy datetime
scrapy print response

Suppose there are some html fragments like:

<a>
   text in a
   <b>text in b</b>
   <c>text in c</c>
</a>
<a>
   <b>text in b</b>
   text in a
   <c>text in c</c>
</a>

In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!

Here's what I managed to do:

from scrapy.selector import Selector

sel = Selector(text = html_string)

for node in sel.css('a *::text'):
    print node.extract()

Assuming that html_string is a variable holding the html in your question, this code produces the following output:

   text in a

text in b


text in c




text in b

   text in a

text in c

The selector a *::text() matches all the text nodes which are descendents of a nodes.

Selectors, Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-​elements: >>> response.css('title::text').get() 'Example website'. As you can see,  When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.

You can use XPath's string() function on the elements you select:

$ python
>>> import scrapy
>>> selector = scrapy.Selector(text="""<a>
...    text in a
...    <b>text in b</b>
...    <c>text in c</c>
... </a>
... <a>
...    <b>text in b</b>
...    text in a
...    <c>text in c</c>
... </a>""", type="html")
>>> for link in selector.css('a'):
...     print link.xpath('string(.)').extract()
... 
[u'\n   text in a\n   text in b\n   text in c\n']
[u'\n   text in b\n   text in a\n   text in c\n']
>>> 

Use Scrapy to Extract Data From HTML Tags, Scrapy - Selectors - When you are scraping the web pages, you need to extract a If you want to extract the first element, then use the method .extract_first(),  The parameter ui has a property called selected which is a reference to the selected dom element, you can call innerHTML on that element. Your code $('.ui-selected').innerHTML tries to return the innerHTML property of a jQuery wrapper element for a dom element with class ui-selected

try this

response.xpath('//a/node()').extract()

Scrapy - Selectors, Top network posts. 7 How to get innerHTML of a node using scrapy Selector? View more network posts →. Keeping a low profile. This user hasn't posted yet. Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript, SQL, PHP, Python, Bootstrap, Java and XML.

in scrapy 1.5, you can use /* to get innerhtml. example:

content = response.xpath('//div[@class="viewbox"]/div[@class="content"]/*').extract_first()

User kuixiong, There are many useful methods in response object, in the code below, we use the xpath method to extract info for us. #If we want to get html node  selector. A string representing a CSS selector. It must be compliant with CSSselector's supported selectors. get. Part of the selected element(s) to retrieve. 'text': the DOM equivalent of Node.textContent. 'html': gets the content including html tags. The equivalent of Element.innerHTML.

Scrapy Tutorial #7: How to use XPath with Scrapy, Get code examples like "set innerhtml text" instantly right from your google search set innner html javascript · change element innerHTML css. Scrapy selectors are instances of Selector class constructed by passing either TextResponse object or markup as an unicode string (in text argument). Usually there is no need to construct Scrapy selectors manually: response object is available in Spider callbacks, so in most cases it is more convenient to use response.css() and response.xpath

set innerhtml text Code Example, Use the XPath syntax to select elements on this web page Before we look into other ways to reach a specific HTML node using XPath, let's start by looking closer at how nodes are The Scrapy documentation has more on the topic. Please see the below html markup. How can I use the xpath selector in Scrapy to pull content from the col-sm-7 class name in div?. I want to extract this text: Infortrend EonNAS Pro 850X 8-bay Tower NAS with 10GbE

Selecting content on a web page with XPath, XPath, designed to extract data from XML documents, and CSS selectors, (lmxl​, Selenium, Scrapy -- with the notable exception of BeautifulSoup) are compatible with both. This selects the root element, the <html> tag. The innerHTML property can be used to examine the current HTML source of the page, including any changes that have been made since the page was initially loaded. Reading the HTML contents of an element. Reading innerHTML causes the user agent to serialize the HTML or XML fragment comprised of the element's descendants. The resulting string is

Comments
  • This is great, but I managed to make it by sel.css("a").extract() and then using regex to exclude those html tags
  • @kuixiong Great! Note that parsing HTML with regex is generally not considered a good practice. If you control that HTML and it is simple enough, go ahead and use regex. Otherwise, consider relying on specialized tools.
  • The solution collects the text, not the innerHTML.
  • This is the best and safest solution.
  • This will only extract the first node in .content, use extract() with a ''.join to get the full innerhtml as a string.