JSoup快速入門

Jsoup是用於解析HTML，就類似XML解析器用於解析XML。 Jsoup它解析HTML成爲真實世界的HTML。它與jquery選擇器的語法非常相似，並且非常靈活容易使用以獲得所需的結果。在本教程中，我們將介紹很多Jsoup的例子。

能用Jsoup實現什麼？

從URL，文件或字符串中刮取並解析HTML
查找和提取數據，使用DOM遍歷或CSS選擇器
操縱HTML元素，屬性和文本
根據安全的白名單清理用戶提交的內容，以防止XSS攻擊
輸出整潔的HTML

安裝-運行時依賴關係

您可以使用下面的maven依賴項將Jsoup jar包含到項目中。

<dependency>
  <!-- jsoup HTML parser library @ http://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.10.2</version>
</dependency>

JSoup應用的主要類

雖然完整的類庫中有很多類，但大多數情況下，下面給出3個類是我們需要重點了解的。

1. org.jsoup.Jsoup類

Jsoup類是任何Jsoup程序的入口點，並將提供從各種來源加載和解析HTML文檔的方法。

Jsoup類的一些重要方法如下：

方法

描述

static Connection connect(String url)

創建並返回URL的連接。

static Document parse(File in, String charsetName)

將指定的字符集文件解析成文檔。

static Document parse(String html)

將給定的html代碼解析成文檔。

static String clean(String bodyHtml, Whitelist whitelist)

從輸入HTML返回安全的HTML，通過解析輸入HTML並通過允許的標籤和屬性的白名單進行過濾。

2. org.jsoup.nodes.Document類

該類表示通過Jsoup庫加載HTML文檔。可以使用此類執行適用於整個HTML文檔的操作。

Element類的重要方法可以參見 - http://jsoup.org/apidocs/org/jsoup/nodes/Document.html 。

3. org.jsoup.nodes.Element類

HTML元素是由標籤名稱，屬性和子節點組成。使用Element類，您可以提取數據，遍歷節點和操作HTML。

Element類的重要方法可參見 - http://jsoup.org/apidocs/org/jsoup/nodes/Element.html 。

應用實例

現在我們來看一些使用Jsoup API處理HTML文檔的例子。

1. 載入文件

從URL加載文檔，使用Jsoup.connect()方法從URL加載HTML。

try
{
    Document document = Jsoup.connect("http://www.yiibai.com").get();
    System.out.println(document.title());
} 
catch (IOException e) 
{
    e.printStackTrace();
}

2. 從文件加載文檔

使用Jsoup.parse()方法從文件加載HTML。

try
{
    Document document = Jsoup.parse( new File( "D:/temp/index.html" ) , "utf-8" );
    System.out.println(document.title());
} 
catch (IOException e) 
{
    e.printStackTrace();
}

3. 從String加載文檔

使用Jsoup.parse()方法從字符串加載HTML。

try
{
    String html = "<html><head><title>First parse</title></head>"
                    + "<body><p>Parsed HTML into a doc.</p></body></html>";
    Document document = Jsoup.parse(html);
    System.out.println(document.title());
} 
catch (IOException e) 
{
    e.printStackTrace();
}

4. 從HTML獲取標題

如上圖所示，調用document.title()方法獲取HTML頁面的標題。

try
{
    Document document = Jsoup.parse( new File("C:/Users/xyz/Desktop/yiibai-index.html"), "utf-8");
    System.out.println(document.title());
} 
catch (IOException e) 
{
    e.printStackTrace();
}

5. 獲取HTML頁面的Fav圖標

假設favicon圖像將是HTML文檔的<head>部分中的第一個圖像，您可以使用下面的代碼。

String favImage = "Not Found";
try {
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
    Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
    if (element == null) 
    {
        element = document.head().select("meta[itemprop=image]").first();
        if (element != null) 
        {
            favImage = element.attr("content");
        }
    } 
    else
    {
        favImage = element.attr("href");
    }
} 
catch (IOException e) 
{
    e.printStackTrace();
}
System.out.println(favImage);

6. 獲取HTML頁面中的所有鏈接

要獲取網頁中的所有鏈接，請使用以下代碼。

try
{
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
    Elements links = document.select("a[href]");  
    for (Element link : links) 
    {
         System.out.println("link : " + link.attr("href"));  
         System.out.println("text : " + link.text());  
    }
} 
catch (IOException e) 
{
    e.printStackTrace();
}

7. 獲取HTML頁面中的所有圖像

要獲取網頁中顯示的所有圖像，請使用以下代碼。

try
{
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
    Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
    for (Element image : images) 
    {
        System.out.println("src : " + image.attr("src"));
        System.out.println("height : " + image.attr("height"));
        System.out.println("width : " + image.attr("width"));
        System.out.println("alt : " + image.attr("alt"));
    }
} 
catch (IOException e) 
{
    e.printStackTrace();
}

8. 獲取URL的元信息

元信息包括Google等搜索引擎用來確定網頁內容的索引爲目的。它們以HTML頁面的HEAD部分中的一些標籤的形式存在。要獲取有關網頁的元信息，請使用下面的代碼。

try
{
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");

    String description = document.select("meta[name=description]").get(0).attr("content");  
    System.out.println("Meta description : " + description);  

    String keywords = document.select("meta[name=keywords]").first().attr("content");  
    System.out.println("Meta keyword : " + keywords);  
} 
catch (IOException e) 
{
    e.printStackTrace();
}

9. 在HTML頁面中獲取表單屬性

在網頁中獲取表單輸入元素非常簡單。使用唯一ID查找FORM元素; 然後找到該表單中存在的所有INPUT元素。

Document doc = Jsoup.parse(new File("c:/temp/yiibai-index.html"),"utf-8");  
Element formElement = doc.getElementById("loginForm");  

Elements inputElements = formElement.getElementsByTag("input");  
for (Element inputElement : inputElements) {  
    String key = inputElement.attr("name");  
    String value = inputElement.attr("value");  
    System.out.println("Param name: "+key+" \nParam value: "+value);  
}

10. 更新元素的屬性/內容

只要您使用上述方法找到您想要的元素; 可以使用Jsoup API來更新這些元素的屬性或innerHTML。例如，想更新文檔中存在的「rel = nofollow」的所有鏈接。

try
{
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai.com.html"), "utf-8");
    Elements links = document.select("a[href]");  
    links.attr("rel", "nofollow");
} 
catch (IOException e) 
{
    e.printStackTrace();
}

10. 消除不信任的HTML(以防止XSS)

假設在應用程序中，想顯示用戶提交的HTML片段。例如用戶可以在評論框中放入HTML內容。這可能會導致非常嚴重的問題，如果您允許直接顯示此HTML。用戶可以在其中放入一些惡意腳本，並將用戶重定向到另一個髒網站。

爲了清理這個HTML，Jsoup提供Jsoup.clean()方法。此方法期望HTML格式的字符串，並將返回清潔的HTML。要執行此任務，Jsoup使用白名單過濾器。 jsoup白名單過濾器通過解析輸入HTML(在安全的沙盒環境中)工作，然後遍歷解析樹，只允許將已知安全的標籤和屬性(和值)通過清理後輸出。

它不使用正則表達式，這對於此任務是不合適的。

清潔器不僅用於避免XSS，還限制了用戶可以提供的元素的範圍：您可以使用文本，強元素，但不能構造div或表元素。

String dirtyHTML = "<p><a href='http://www.yiibai.com/' onclick='sendCookiesToMe()'>Link</a></p>";

String cleanHTML = Jsoup.clean(dirtyHTML, Whitelist.basic());

System.out.println(cleanHTML);

執行後輸出結果如下 -

<p><a href="http://www.yiibai.com/" rel="nofollow">Link</a></p>

JSoup教程