摘要

文章内容是本人基于官方文档以及源码的学习，在学习过程中记录整理。

爬虫框架在爬取目标网站的过程中，最重要的工作是对目标网页内容的解析。Colly框架主要支持2种网页标记语言的回调解析，针对这两种不同的网页语言，Colly框架则使用了不同的解析库：

回调方式	解析语言	解析库	选择器类型
OnHTML	HTML	goquery	css选择器
OnXML	HTML	htmlquery	xpath选择器
	XML	xmlquery	xpath选择器

package colly

import (
  ...
	"github.com/PuerkitoBio/goquery"
	"github.com/antchfx/htmlquery"
	"github.com/antchfx/xmlquery"
	...
)

`goquery`

goquery为Go语言提供的一个语法和特性类似于 jQuery的库。它基于 HTML解析库net/html和 CSS库cascadia开发。

由于net/html解析器返回的是节点信息而不是功能齐全的DOM树，因此无法处理jQuery的有状态函数。
由于net/html解析器需要UTF-8编码，因此goquery也需要确保解析的是UTF-8的HTML内容

Colly框架中OnHTML回调函数就是基于 goquery的。

官方文档

安装

$ go get github.com/PuerkitoBio/goquery

# （可选）运行单元测试：
$ cd $GOPATH/src/github.com/PuerkitoBio/goquery
$ go test

#（可选）运行基准测试（警告：它会运行几分钟）：
$ cd $GOPATH/src/github.com/PuerkitoBio/goquery
$ go test -bench=".*"

简单使用

package main

import (
  "fmt"
  "log"
  "net/http"

  "github.com/PuerkitoBio/goquery"
)

func ExampleScrape() {
  // Request the HTML page.
  res, err := http.Get("http://metalsucks.net")
  if err != nil {
    log.Fatal(err)
  }
  defer res.Body.Close()
  if res.StatusCode != 200 {
    log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
  }

  // Load the HTML document
  doc, err := goquery.NewDocumentFromReader(res.Body)
  if err != nil {
    log.Fatal(err)
  }

  // Find the review items
  doc.Find(".left-content article .post-title").Each(func(i int, s *goquery.Selection) {
		// For each item found, get the title
		title := s.Find("a").Text()
		fmt.Printf("Review %d: %s\n", i, title)
	})
}

func main() {
  ExampleScrape()
}

语法

# 通过标签名查找
doc.Find("a")
# 通过class查找
doc.Find(".post-title-link")
# 通过id查找
doc.Find("#link")
# 通过属性查找
doc.FInd("a[href]")
doc.FInd("a[href='/2022/06/09/go/go-colly/']")
# 组合查找
doc.Find("a .post-title-link #link a[href]")
# 子标签查找
doc.Find("header > h1 > a")

Colly框架中基于goquery的源码

func (c *Collector) handleOnHTML(resp *Response) error {
	if len(c.htmlCallbacks) == 0 || !strings.Contains(strings.ToLower(resp.Headers.Get("Content-Type")), "html") {
		return nil
	}
	doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(resp.Body))
	if err != nil {
		return err
	}
	if href, found := doc.Find("base[href]").Attr("href"); found {
		resp.Request.baseURL, _ = resp.Request.URL.Parse(href)
	}
	for _, cc := range c.htmlCallbacks {
		i := 0
		doc.Find(cc.Selector).Each(func(_ int, s *goquery.Selection) {
			for _, n := range s.Nodes {
				e := NewHTMLElementFromSelectionNode(resp, s, n, i)
				i++
				if c.debugger != nil {
					c.debugger.Event(createEvent("html", resp.Request.ID, c.ID, map[string]string{
						"selector": cc.Selector,
						"url":      resp.Request.URL.String(),
					}))
				}
				cc.Function(e)
			}
		})
	}
	return nil
}

在Colly框架中使用

package main

import (
	"fmt"
	"github.com/gocolly/colly/v2"
	"strings"
)

func main() {
	url := "https://c.isme.pub"

	c := colly.NewCollector(
		colly.MaxDepth(1),
	)
	
	// 解析文章url
	c.OnHTML("h1 a[href]", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		if strings.HasPrefix(e.Request.AbsoluteURL(link), url) {
			text := strings.ReplaceAll(strings.ReplaceAll(e.Text, "\n", ""), " ", "")
			fmt.Printf("Link found: %q -> %s\n", text, e.Request.AbsoluteURL(link))
		}
	})
	// 翻页
	c.OnHTML("nav a[href]", func(e *colly.HTMLElement) {
		next := e.Attr("rel")
		if next == "next" {
			link := e.Attr("href")
			if strings.HasPrefix(e.Request.AbsoluteURL(link), url) {
				fmt.Printf("Page found: %s\n", e.Request.AbsoluteURL(link))
				c.Visit(e.Request.AbsoluteURL(link))
			}
		}
	})

	c.Visit(url)
}

`htmlquery`

htmlquery是一个用于 HTML的 XPath查询包，可以通过 XPath表达式从 HTML文档中提取数据或求值。

htmlquery内置基于LRU的查询对象缓存功能，该功能将缓存最近使用的 XPATH查询字符串。启用查询缓存可以避免每次查询重新编译 XPath表达式。

XPath(1.0/2.0) 语法

Go 的 XPath 查询包

姓名	描述
html查询	HTML 文档的 XPath 查询包
xml查询	XML 文档的 XPath 查询包
json查询	JSON 文档的 XPath 查询包

安装

1	go get github.com/antchfx/htmlquery

常用函数

// 解析指定html
func Parse(r io.Reader) (*html.Node, error)：
// 根据xpath语法解析html
func Find(top *html.Node, expr string) []*html.Node：
// 返回第一个元素
func FindOne(top *html.Node, expr string) *html.Node：
// 返回开始标记和结束标记中间的文本
func InnerText(n *html.Node) string：
// 返回指定属性的内容
func SelectAttr(n *html.Node, name string) (val string)：
// 返回包含表前面的文本内容
func OutputHTML(n *html.Node, self bool) string：

简单使用

// 找到匹配的标签
nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
	panic(`not a valid XPath expression.`)
}

// 从url中加载html
doc, err := htmlquery.LoadURL("http://example.com/")

// 从文件中加载
filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)

// 从字符串中加载
s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))

// 根据标签名查找，找到所有a标签
list := htmlquery.Find(doc, "//a")

// 根据属性查找，找到所有具有href的a标签
list := htmlquery.Find(doc, "//a[@href]")	

// 找到所有具有href的a标签，并且只返回href内容
list := htmlquery.Find(doc, "//a/@href")	
for _ , n := range list{
	fmt.Println(htmlquery.SelectAttr(n, "href")) // output @href value
}

// 找到a标签的第三个元素
a := htmlquery.FindOne(doc, "//a[3]")

// 找到a标签后找到子img元素并打印sre属性
a := htmlquery.FindOne(doc, "//a")
img := htmlquery.FindOne(a, "//img")
fmt.Prinln(htmlquery.SelectAttr(img, "src")) // output @src value

// 统计所有img标签的数量
expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)

官方例子

func main() {
	doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
	if err != nil {
		panic(err)
	}
	// Find all news item.
	list, err := htmlquery.QueryAll(doc, "//ol/li")
	if err != nil {
		panic(err)
	}
	for i, n := range list {
		a := htmlquery.FindOne(n, "//a")
		if a != nil {
		    fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
		}
	}
}

常见问题

Find()vs QueryAll()，哪个更好？
- Find两者QueryAll都做同样的事情，搜索所有匹配的 html 节点。Find如果您给出一个错误的 XPath 查询，将会发生panics，但会QueryAll为您返回一个err。
可以为下一个查询保存我的查询表达式对象吗？
- 可以。QuerySelectorandQuerySelectorAll方法，可以接受查询表达式对象。缓存一个查询表达式对象（或重用）将避免重新编译 XPath 查询表达式，提高您的查询性能
如何禁用缓存？
- 1
  htmlquery.DisableSelectorCache = true

Colly框架中基于htmlquery的源码

func (c *Collector) handleOnXML(resp *Response) error {
	if len(c.xmlCallbacks) == 0 {
		return nil
	}
	contentType := strings.ToLower(resp.Headers.Get("Content-Type"))
	isXMLFile := strings.HasSuffix(strings.ToLower(resp.Request.URL.Path), ".xml") || strings.HasSuffix(strings.ToLower(resp.Request.URL.Path), ".xml.gz")
	if !strings.Contains(contentType, "html") && (!strings.Contains(contentType, "xml") && !isXMLFile) {
		return nil
	}

	if strings.Contains(contentType, "html") {
		doc, err := htmlquery.Parse(bytes.NewBuffer(resp.Body))
		if err != nil {
			return err
		}
		if e := htmlquery.FindOne(doc, "//base"); e != nil {
			for _, a := range e.Attr {
				if a.Key == "href" {
					resp.Request.baseURL, _ = resp.Request.URL.Parse(a.Val)
					break
				}
			}
		}

		for _, cc := range c.xmlCallbacks {
			for _, n := range htmlquery.Find(doc, cc.Query) {
				e := NewXMLElementFromHTMLNode(resp, n)
				if c.debugger != nil {
					c.debugger.Event(createEvent("xml", resp.Request.ID, c.ID, map[string]string{
						"selector": cc.Query,
						"url":      resp.Request.URL.String(),
					}))
				}
				cc.Function(e)
			}
		}
	} else if strings.Contains(contentType, "xml") || isXMLFile {
		doc, err := xmlquery.Parse(bytes.NewBuffer(resp.Body))
		if err != nil {
			return err
		}

		for _, cc := range c.xmlCallbacks {
			xmlquery.FindEach(doc, cc.Query, func(i int, n *xmlquery.Node) {
				e := NewXMLElementFromXMLNode(resp, n)
				if c.debugger != nil {
					c.debugger.Event(createEvent("xml", resp.Request.ID, c.ID, map[string]string{
						"selector": cc.Query,
						"url":      resp.Request.URL.String(),
					}))
				}
				cc.Function(e)
			})
		}
	}
	return nil
}

在Colly框架中使用

package main

import (
	"fmt"
	"github.com/gocolly/colly/v2"
	"strings"
)

func main() {
	url := "https://c.isme.pub"

	c := colly.NewCollector(
		colly.MaxDepth(1),
	)

	// 解析文章url
	c.OnXML("//*[@id=\"posts\"]/article/div/header/h1/a", func(e *colly.XMLElement) {
		link := e.Attr("href")
		if strings.HasPrefix(e.Request.AbsoluteURL(link), url) {
			text := strings.ReplaceAll(strings.ReplaceAll(e.Text, "\n", ""), " ", "")
			fmt.Printf("Link found: %q -> %s\n", text, e.Request.AbsoluteURL(link))
		}
	})
	// 翻页
	c.OnXML("//*[@id=\"content\"]/nav/a", func(e *colly.XMLElement) {
		next := e.Attr("rel")
		if next == "next" {
			link := e.Attr("href")
			if strings.HasPrefix(e.Request.AbsoluteURL(link), url) {
				fmt.Printf("Page found: %s\n", e.Request.AbsoluteURL(link))
				c.Visit(e.Request.AbsoluteURL(link))
			}
		}
	})

	c.Visit(url)
}

xpath 语法

xpath

基础语法

node : Selects all child elements with nodeName of node.
* : Selects all child elements.
@attr : Selects the attribute attr.
@* : Selects all attributes.
node() : Matches an org.w3c.dom.Node.
text() : Matches a org.w3c.dom.Text node.
comment() : Matches a comment.
. : Selects the current node.
.. : Selects the parent of current node.
/ : Selects the document node.
a[expr] : Select only those nodes matching a which also satisfy the expression expr.
a[n] : Selects the nth matching node matching a When a filter’s expression is a number, XPath selects based on position.
a/b : For each node matching a, add the nodes matching b to the result.
a//b : For each node matching a, add the descendant nodes matching b to the result.
//b : Returns elements in the entire document matching b.
a|b : All nodes matching a or b, union operation(not boolean or).
(a, b, c) : Evaluates each of its operands and concatenates the resulting sequences, in order, into a single result sequence
(a/b) : Selects all matches nodes as grouping set.

节点

child::* : The child axis selects children of the current node.
descendant::* : The descendant axis selects descendants of the current node. It is equivalent to ‘//‘.
descendant-or-self::* : Selects descendants including the current node.
attribute::* : Selects attributes of the current element. It is equivalent to @*
following-sibling::* : Selects nodes after the current node.
preceding-sibling::* : Selects nodes before the current node.
following::* : Selects the first matching node following in document order, excluding descendants.
preceding::* : Selects the first matching node preceding in document order, excluding ancestors.
parent::* : Selects the parent if it matches. The ‘..’ pattern from the core is equivalent to ‘parent::node()’.
ancestor::* : Selects matching ancestors.
ancestor-or-self::* : Selects ancestors including the current node.
self::* : Selects the current node. ‘.’ is equivalent to ‘self::node()’.

表达式

The gxpath supported three types: number, boolean, string.

path : Selects nodes based on the path.
a = b : Standard comparisons.
- a = b True if a equals b.
- a != b True if a is not equal to b.
- a < b True if a is less than b.
- a <= b True if a is less than or equal to b.
- a > b True if a is greater than b.
- a >= b True if a is greater than or equal to b.
a + b : Arithmetic expressions.
- - a Unary minus
- a + b Add
- a - b Substract
- a * b Multiply
- a div b Divide
- a mod b Floating point mod, like Java.
a or b : Boolean or operation.
a and b : Boolean and operation.
(expr) : Parenthesized expressions.
fun(arg1, ..., argn) : Function calls:

方法	是否支持
`boolean()`	✓
`ceiling()`	✓
`choose()`	✗
`concat()`	✓
`contains()`	✓
`count()`	✓
`current()`	✗
`document()`	✗
`element-available()`	✗
`ends-with()`	✓
`false()`	✓
`floor()`	✓
`format-number()`	✗
`function-available()`	✗
`generate-id()`	✗
`id()`	✗
`key()`	✗
`lang()`	✗
`last()`	✓
`local-name()`	✓
`matches()`	✓
`name()`	✓
`namespace-uri()`	✓
`normalize-space()`	✓
`not()`	✓
`number()`	✓
`position()`	✓
`replace()`	✓
`reverse()`	✓
`round()`	✓
`starts-with()`	✓
`string()`	✓
`string-length()`	✓
`substring()`	✓
`substring-after()`	✓
`substring-before()`	✓
`sum()`	✓
`system-property()`	✗
`translate()`	✓
`true()`	✓
`unparsed-entity-url()`	✗

HTML和XML的区别和联系

	HTML	XML
联系	HTML的全称为超文本标记语言（Hyper Text Markup Language），是一种标记语言。它包括一系列标签．通过这些标签可以将网络上的文档格式统一，使分散的Internet资源连接为一个逻辑整体。	可扩展标记语言（Extensible Markup Language），标准通用标记语言的子集，简称XML。是一种用于标记电子文件使其具有结构性的标记语言。
可扩展性方面	HTML不允许用户自行定义他们自己的标识或属性	在XML中，用户能够根据需要自行定义新的标识及属性名，以便更好地从语义上修饰数据。
结构性方面	HTML不支持深层的结构描述	XML的文件结构嵌套可以复杂到任意程度，能表示面向对象的等级层次。
可校验性方面	HTML没有提供规范文件以支持应用软件对HTML文件进行结构校验	XML文件可以包括一个语法描述，使应用程序可以对此文件进行结构校验。
作用不同	html 主要设计用来显示数据以及更好的显示数据，无法描述数据、可读性差、搜索时间长等	XML最初的设计目的是为了EDI(Electronic Data Interchange，电子数据交换)，确切地说是为EDI提供一个标准数据格式，用来传输数据。
大小写	html不区分大小写	xml区分大小写
省略标签	html有时可以省略尾标签	xml不能省略任何标签，严格按照嵌套首尾结构。
自闭标签		只有xml中有自闭标签（没有内容的标签，只有属性。）`<a class='abc'/>`
属性值	在html中属性名可以不带属性值	xml必须带属性值
引号	html中可以不加引号	在xml中属性必须用引号括起来
标记	html的标记都是固定的，不能自定义	xml没有固定标记