Python novice, the problem is more elementary, you guys light spray. You need to crawl some data. If you have an XPath problem, please consult us. As shown in the following HTML code, there is < span class = "media caption"__ Text "> < / span > tag will get its internal text, otherwise it will get the internal text of < figcaptation > < / figcaptation >, but it must be filtered out < span class =" off screen "></span>
The HTML code is as follows:
<figcaption class="media-caption">
<span class="off-screen">Image caption</span>
<span class="media-caption__text"> 纽约市是美国疫情的“震中”。 </span>
</figcaption>
perhaps
<figcaption class="media-with-caption__caption">
<span class="off-screen"></span>
失业中的美国青年:泪水、恐惧与不安
</figcaption>
Why not code logic... It's ugly in XPath
//figcaption/span[@class="media-caption__text"][count(//figcaption/span[@class="media-caption__text"]) > 0]/text()[normalize-space()]|//figcaption[count(//figcaption/span[@class="media-caption__text"]) = 0]/text()[normalize-space()]
from lxml import etree
text = '''
<figcaption class="media-caption">
<span class="off-screen">Image caption</span>
<span class="media-caption__text"> 纽约市是美国疫情的“震中”。 </span>
</figcaption>
<figcaption class="media-with-caption__caption">
<span class="off-screen"></span>
失业中的美国青年:泪水、恐惧与不安
</figcaption>
'''
html = etree.HTML(text)
result = html.xpath('//figcaption//text()[normalize-space()]')
print(result)