undefinedfix
Sign in

How to write XPath filter elements

noket edited in Thu, 22 Dec 2022

Python novice, the problem is more elementary, you guys light spray. You need to crawl some data. If you have an XPath problem, please consult us. As shown in the following HTML code, there is < span class = "media caption"__ Text "> < / span > tag will get its internal text, otherwise it will get the internal text of < figcaptation > < / figcaptation >, but it must be filtered out < span class =" off screen "></span>

The HTML code is as follows:

<figcaption class="media-caption">
    <span class="off-screen">Image caption</span> 
    <span class="media-caption__text"> &#32445;&#32422;&#24066;&#26159;&#32654;&#22269;&#30123;&#24773;&#30340;&#8220;&#38663;&#20013;&#8221;&#12290;    </span>
</figcaption>

perhaps

<figcaption class="media-with-caption__caption">
    <span class="off-screen"></span>     
    &#22833;&#19994;&#20013;&#30340;&#32654;&#22269;&#38738;&#24180;&#65306;&#27882;&#27700;&#12289;&#24656;&#24807;&#19982;&#19981;&#23433;
</figcaption>
2 Replies
user2176971
commented on Thu, 22 Dec 2022

Why not code logic... It's ugly in XPath

//figcaption/span[@class="media-caption__text"][count(//figcaption/span[@class="media-caption__text"]) > 0]/text()[normalize-space()]|//figcaption[count(//figcaption/span[@class="media-caption__text"]) = 0]/text()[normalize-space()]
jet_black82
commented on Thu, 22 Dec 2022
from lxml import etree
text = '''
<figcaption class="media-caption">
<span class="off-screen">Image caption</span>
<span class="media-caption__text"> &#32445;&#32422;&#24066;&#26159;&#32654;&#22269;&#30123;&#24773;&#30340;&#8220;&#38663;&#20013;&#8221;&#12290; </span>
</figcaption>
<figcaption class="media-with-caption__caption">
<span class="off-screen"></span>
&#22833;&#19994;&#20013;&#30340;&#32654;&#22269;&#38738;&#24180;&#65306;&#27882;&#27700;&#12289;&#24656;&#24807;&#19982;&#19981;&#23433;
</figcaption>
'''
html = etree.HTML(text)
result = html.xpath('//figcaption//text()[normalize-space()]')
print(result)