本文最后更新于:December 3, 2021 pm
JSoup 是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
目录
简单的了解一下JSoup。此篇只是简单的记录怎么使用,总体来说是很简单的。
依赖配置
| <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3</version> </dependency>
|
被爬取网页源码目录
这里只是给出被爬取网页源码的大致样式目录,方便对照代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| <div id="post_list" class="post-list"> <article class="post-item"> <section class="post-item-body"> <div class="post-item-text"> <a class="post-item-title" href="" target="_blank"></a> <p class="post-item-summary"> <a href=""><img src="" class="avatar" alt="头像"></a> </p> </div> </section> </article> <article class="post-item"> <section class="post-item-body"> <div class="post-item-text"> <a class="post-item-title" href="" target="_blank"></a> <p class="post-item-summary"> <a href=""><img src="" class="avatar" alt="头像"></a> </p> </div> </section> </article> <article class="post-item">...</article> <article class="post-item">...</article> <article class="post-item">...</article> </div>
|
1.方法一
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| package com.tothefor.crawer;
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;
import java.io.IOException;
public class TestWebMagic { private static Document doc;
public static void main(String[] args) { try { doc = Jsoup.connect("https://www.cnblogs.com/").get(); } catch (IOException e) { e.printStackTrace(); } Blogs(); }
public static void Blogs() { Elements tests = doc.select("div#post_list>article.post-item"); for (Element test : tests) { String txt = test.select("section.post-item-body>div.post-item-text>a.post-item-title").text(); System.out.println("标题:" + txt); String href = test.select("section.post-item-body>div.post-item-text>p.post-item-summary>a").attr("href"); System.out.println("链接:" + href); System.out.println("---------------------------"); } } }
|
2.方法二
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
| package cn.itcast.crawer;
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;
import java.net.URL;
public class StudyJsoup { public static void main(String[] args) throws Exception { String url = "https://www.cnblogs.com/"; Document document = Jsoup.parse(new URL(url),13000); Element element = document.getElementById("post_list"); System.out.println(element.html()); System.out.println("============================================================"); Elements elements = element.getElementsByTag("article"); for(Element el : elements){ System.out.println("------------------------------------------"); String title = el.getElementsByClass("post-item-title").eq(0).text(); System.out.println(title); String link = el.getElementsByClass("post-item-title").eq(0).attr("href"); System.out.println(link); }
} }
|