Sign in

How does a crawler get a formatted article and save it to word

user282 edited in Fri, 29 Jul 2022

Problems encountered

Need to crawl articles on the site, similar to Baidu Library, articles with format, and there are forms and so on. I can only get plain text,


My thoughts

It is troublesome to read CSS style to set the format of each line.


Is there any good way? Is there a tool to convert word from HTML available?

2 Replies
commented on Sat, 30 Jul 2022

Only word? Can save PDF directly with headless browser

commented on Sat, 30 Jul 2022

First of all, the content of Baidu Library with tables mentioned by the questioner is not a real table. The essence is still that the < p > tag displays the text content, and the table is just a background image-- Text - every data you grab corresponds to an HTML tag, so you can set different formats when writing to word according to the HTML tag, for example: < p > < / P >: represents a paragraph, you can set the default indentation when writing to word < H1 > < / H1 >: represents 1-point bold font, you can set bold and font size when writing to word, and so on. You can package different tags in advance, and you can write to the text at the same time According to your format to write the following picture is my own once grab content, format and pictures have, and keep the position of the picture