Why doesn't xpath work when processing an XHTML document with lxml (in python)?

why doesn't t
why doesn't it work
why doesn't my
why doesn't he want me
why doesn't youtube work
why doesn't he love me
why doesn't glue stick to the inside of the bottle
why doesn't voldemort have a nose

I am testing against the following test document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
        <title>hi there</title>
    </head>
    <body>
        <img class="foo" src="bar.png"/>
    </body>
</html>

If I parse the document using lxml.html, I can get the IMG with an xpath just fine:

>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]

However, if I parse the document as XML and try to get the IMG tag, I get an empty result:

>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]

I can navigate to the element directly:

>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>

But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:

>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]

But that xpath is, again, obviously not useful for parsing arbitrary documents.

Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.

So, what am I missing?

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

Try this:

>>> tree.getroot().xpath(
...     "//xhtml:img", 
...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
...     )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]

Why doesn't the Leaning Tower of Pisa fall over?, If playback doesn't begin shortly, try restarting your device. Your browser does not currently Duration: 5:06 Posted: Dec 3, 2019 Robert De Niro on Tuesday trashed Donald Trump’s floundering handling of the coronavirus pandemic, slamming the president as a “lunatic” who “doesn’t even care how many people die” so long as he wins the 2020 election.

XPath considers all unprefixed names to be in "no namespace".

In particular the spec says:

"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "

See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.

Hope this helped.

Cheers,

Dimitre Novatchev

Why doesn't the US use the metric system?, Why doesn't the US use the metric system? By Benjamin Plackett - Live Science Contributor 5 days ago. Answer: it's been too long since the last revolution. Why doesn’t the U.S. use the metric system? Phil Lewis/shutterstock As of today, the entire world has adopted the metric system, with the exception of the United States, Myanmar, and Liberia.

If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.

In your case it would be like

from lxml import objectify
root = objectify.parse(url) #also available: fromstring

You can access the nodes as

root.html
body = root.html.body
for img in body.img: #Assuming all images are within the body tag

While it might not be of great help in html, it can be highly useful in well structured xml.

For more info, check out http://lxml.de/objectify.html

IT Doesn't Matter, IT Doesn't Matter � Spend Less. Rigorously evaluate expected returns from IT investments. � Follow, Don't Lead. Delay IT investments to significantly cut costs and� Why does he keep coming back if he doesn’t want a relationship? So if a guy doesn’t want you, why doesn’t he just let you go? Why does he keep reappearing (and always just when you start moving on)? Well … it’s not always so cut and dry. He may not want a relationship with you, but that doesn’t mean he has zero feelings for you.

Why Doesn't China Deploy Fighter Jets to the Spratly Islands? – The , Why Doesn't China Deploy Fighter Jets to the Spratly Islands? In this Friday, April 21, 2017, file photo, an airstrip, structures, and buildings on� 2017-08-05T04:05:12Z. Teller shared the origin story of why he doesn’t speak during his act. He said, “When I was a teenager, I rebelled against the idea of Magic Patter because it always

Why doesn't medical school prioritize social justice?, Why doesn't medical school prioritize social justice? If academic medical institutions are serious about equity, work in the community should be� Figuring out why children are so unaffected could lead to breakthroughs in understanding how and why the virus sickens and kills other age groups, said Frank Esper, a pediatric infectious disease

If your computer doesn't recognize your iPhone, iPad, or iPod , If you connect your device to your computer with a USB cable and your computer doesn't recognize your iPhone, iPad, or iPod, get help. “Why doesn't my hydrangea bloom?" It wasn't until I became a Master Gardener that I ever thought about this question. I have always been familiar with the four main types of hydrangeas: the peegee (Hydrangea paniculata 'Grandiflora') tree that came with my parents' house in 1946; the 'Hills of Snow' variety (Hydrangea arborescens) between my parents' and the neighbor's backyard; my Uncle

Comments
  • Quoting from codespeak.net/lxml/xpathxslt.html <<Optionally, you can provide a namespaces keyword argument, which should be a dictionary mapping the namespace prefixes used in the XPath expression to namespace URIs>>
  • If you want to search with compact xpath expressions in the default namespace of the root element, you can use a trick that works for xhtml or other schemas, something like: nsmap = {'h': tree.getroot().nsmap[None]}; elem.xpath('//h:img', namespaces=nsmap — that makes it easy to write the query compactly.