pyquery：一个类似于jquery的Python库

pyquery可以使你在xml文档上做jquery查询，它的API尽可能地类似于jquery。pyquery使用lxml执行快速的xml和html操作。

这并非（至少目前还不是）一个生成javascript代码或者与javascript代码做交互的库。pyquery的作者只是由于非常喜欢jquery的API因而将其用python实现。

该项目目前托管在Github仓库中并且处于活跃开发状态。作者可以为任何想要贡献源码的开发者赋予push权限，并且会对其做的变更做回顾。如果你想要贡献源码，可以发Email给项目作者。

项目的Bug可以通过Github Issue Tracker进行提交。

快速入门

你可以使用PyQuery类从一个字符串，一个lxml文档，一个文件或者一个url钟载入一个xml文档：

>>> from pyquery import PyQuery as pq
>>> from lxml import etree
>>> import urllib
>>> d = pq("<html></html>")
>>> d = pq(etree.fromstring("<html></html>"))
>>> d = pq(url=your_url)
>>> d = pq(url=your_url,
...        opener=lambda url, **kw: urlopen(url).read())
>>> d = pq(filename=path_to_html_file)

现在，d就相当于jquery里的$：

>>> d("#hello")
[<p#hello.hello>]
>>> p = d("#hello")
>>> print(p.html())
Hello world !
>>> p.html("you know <a href='http://python.org/'>Python</a> rocks")
[<p#hello.hello>]
>>> print(p.html())
you know <a href="http://python.org/">Python</a> rocks
>>> print(p.text())
you know Python rocks

你也可以使用某些jQuery中可用而并非css标准的伪类，诸如 :first :last :even :odd :eq :lt :gt :checked :selected :file:等

>>> d('p:first')
[<p#hello.hello>]

参见http://pyquery.rtfd.org/查看全部文档

CSS

你可以像这样添加、切换、移除CSS：

>>> p.addClass("toto")
[<p#hello.hello.toto>]
>>> p.toggleClass("titi toto")
[<p#hello.hello.titi>]
>>> p.removeClass("titi")
[<p#hello.hello>]

或者操作CSS样式：

>>> p.css("font-size", "15px")
[<p#hello.hello>]
>>> p.attr("style")
'font-size: 15px'
>>> p.css({"font-size": "17px"})
[<p#hello.hello>]
>>> p.attr("style")
'font-size: 17px'

使用更加Pythonic的方式完成同样的功能 (‘_’ 字符转换为 ‘-‘)：

>>> p.css.font_size = "16px"
>>> p.attr.style
'font-size: 16px'
>>> p.css['font-size'] = "15px"
>>> p.attr.style
'font-size: 15px'
>>> p.css(font_size="16px")
[<p#hello.hello>]
>>> p.attr.style
'font-size: 16px'
>>> p.css = {"font-size": "17px"}
>>> p.attr.style
'font-size: 17px'

使用伪类：

:button
匹配所有按钮输入元素和按钮元素 Matches all button input elements and the button element
:checkbox
匹配所有复选框输入元素 Matches all checkbox input elements
:checked
匹配选中的元素，下标从0开始 Matches odd elements, zero-indexed
:child
右边是左边的直接子元素 right is an immediate child of left
:contains()
包含元素 Matches all elements that contain the given text
:descendant
右边是左边的子元素、孙元素或者更远的后继元素 right is a child, grand-child or further descendant of left
:disabled
匹配所有被禁用的元素 Matches all elements that are disabled
:empty
匹配所有不包括任何其他元素的元素 Match all elements that do not contain other elements
:enabled
匹配所有启用的元素 Matches all elements that are enabled
:eq()
使用下标匹配 Matches a single element by its index
:even
从下标0开始，匹配所有偶数元素 Matches even elements, zero-indexed
:file
匹配所有文件类型的输入元素 Matches all input elements of type file
:first
匹配第一个被选择的元素 Matches the first selected element
:gt()
匹配下标大于指定值的元素 Matches all elements with an index over the given one
:header
匹配所有标题元素 Matches all header elelements (h1, ..., h6)
:image
匹配所有图像输入元素 Matches all image input elements
:input
匹配所有输入元素 Matches all input elements
:last
匹配最后一个选择的元素 Matches the last selected element
:lt()
匹配所有下标小于指定值的元素 Matches all elements with an index below the given one
:odd
匹配奇元素，下标从0开始 Matches odd elements, zero-indexed
:parent
匹配所有包含其他元素的元素 Match all elements that contain other elements
:password
匹配所有密码输入元素 Matches all password input elements
:radio
匹配单选按钮输入元素 Matches all radio input elements
:reset
匹配所有重置输入元素 Matches all reset input elements
:selected
匹配所有被选中的元素 Matches all elements that are selected
:submit
匹配所有提交输入元素 Matches all submit input elements
:text¶
匹配所有文本输入元素 Matches all text input elements

操作

你也可以向标签的尾部追加元素：

>>> d = pq('<p class="hello" id="hello">you know Python rocks</p>')
>>> d('p').append(' check out <a href="http://reddit.com/r/python"><span>reddit</span></a>')
[<p#hello.hello>]
>>> print(d)
<p class="hello" id="hello">you know Python rocks check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

或者加至开头：

>>> p = d('p')
>>> p.prepend('check out <a href="http://reddit.com/r/python">reddit</a>')
[<p#hello.hello>]
>>> print(p.html())
check out <a href="http://reddit.com/r/python">reddit</a>you know ...

在其他元素之前或者之后追加元素：

>>> d = pq('<html><body><div id="test"><a href="http://python.org">python</a> !</div></body></html>')
>>> p.prependTo(d('#test'))
[<p#hello.hello>]
>>> print(d('#test').html())
<p class="hello" ...

在其他元素之后插入元素：

>>> p.insertAfter(d('#test'))
[<p#hello.hello>]
>>> print(d('#test').html())
<a href="http://python.org">python</a> !

或者插入其他元素之前：

>>> p.insertBefore(d('#test'))
[<p#hello.hello>]
>>> print(d('body').html())
<p class="hello" id="hello">...

对每个元素做一些事情：

>>> p.each(lambda i, e: pq(e).addClass('hello2'))
[<p#hello.hello.hello2>]

移除一个元素：

>>> d = pq('<html><body><p id="id">Yeah!</p><p>python rocks !</p></div></html>')
>>> d.remove('p#id')
[<html>]
>>> d('p#id')
[]

移除选中元素的内容：

>>> d('p').empty()
[<p>]

你可以获得修改后的html内容：

>>> print(d)
<html><body><p/></body></html>

你可以生成html片段：

>>> from pyquery import PyQuery as pq
>>> print(pq('<div>Yeah !</div>').addClass('myclass') + pq('<b>cool</b>'))
<div class="myclass">Yeah !</div><b>cool</b>

移除所有命名空间：

>>> d = pq('<foo xmlns="http://example.com/foo"></foo>')
>>> d
[<{http://example.com/foo}foo>]
>>> d.remove_namespaces()
[<foo>]

遍历

一些jQuery遍历方法也可以支持。这里有几个例子。

你可以使用字符串选择器过滤选择列表：

>>> d = pq('<p id="hello" class="hello"><a/></p><p id="test"><a/></p>')
>>> d('p').filter('.hello')
[<p#hello.hello>]

可以使用eq选择器选中单个元素：

>>> d('p').eq(0)
[<p#hello.hello>]

你可以找出嵌套元素：

>>> d('p').find('a')
[<a>, <a>]
>>> d('p').eq(1).find('a')
[<a>]

也支持使用end从一级遍历中跳出：

>>> d('p').find('a').end()
[<p#hello.hello>, <p#test>]
>>> d('p').eq(0).end()
[<p#hello.hello>, <p#test>]
>>> d('p').filter(lambda i: i == 1).end()
[<p#hello.hello>, <p#test>]

网络 Scraping

pyquery也可以从一个url载入html文档：

>>> pq(your_url)
[<html>]

缺省使用的是python的urllib。

如果安装了requests就使用requests。你可以使用大部分requests的参数。

>>> pq(your_url, headers={'user-agent': 'pyquery'})
[<html>]

>>> pq(your_url, {'q': 'foo'}, method='post', verify=True)
[<html>]

pyquery – PyQuery完整API参见：http://pyquery.readthedocs.org/en/latest/api.html

pyquery.ajax – PyQuery AJAX 扩展

如果安装了WebOb（它并不是pyquery的依赖项目），你可以查询一些wsgi app。在本例中，测试app在/处返回一个简单的输入，在/submit处返回一个提交按钮： IN this example the test app returns a simple input at / and a submit button at /submit:

>>> d = pq('<form></form>', app=input_app)
>>> d.append(d.get('/'))
[<form>]
>>> print(d)
<form><input name="youyou" type="text" value=""/></form>

app在新节点中也可用： The app is also available in new nodes:

>>> d.get('/').app is d.app is d('form').app
True

你也可以请求另外一个路径：

>>> d.append(d.get('/submit'))
[<form>]
>>> print(d)
<form><input name="youyou" type="text" value=""/><input type="submit" value="OK"/></form>

如果安装了restkit，你就可以直接从一个HostProxy app获取url：

>>> a = d.get(your_url)
>>> a
[<html>]

你可以获取到app的响应：

>>> print(a.response.status)
200 OK

小贴士 Tips

你可以使链接转化为绝对链，在屏幕抓取时还会比较有用： You can make links absolute which can be usefull for screen scrapping:

>>> d = pq(url=your_url, parser='html')
>>> d('form').attr('action')
'/form-submit'
>>> d.make_links_absolute()
[<html>]

使用不同的解析器

缺省情况下，pyquery使用lxml xml解析器并且如果它不能工作的话，继续尝试lxml.html中的html解析器。xml解析器在解析xhtml页面时可能出现一些问题，因为解析器不会抛出一个错误，而是给出一个不能用的树。 The xml parser can sometimes be problematic when parsing xhtml pages because the parser will not raise an error but give an unusable tree (on w3c.org for example).

你也可以显式地声明使用哪一个解析器：

>>> pq('<html><body><p>toto</p></body></html>', parser='xml')
[<html>]
>>> pq('<html><body><p>toto</p></body></html>', parser='html')
[<html>]
>>> pq('<html><body><p>toto</p></body></html>', parser='html_fragments')
[<p>]

html和html_fragments解析器都在lxml.html当中。

本文链接：http://bookshadow.com/weblog/2014/10/02/pyquery-introduction/
请尊重作者的劳动成果，转载请注明出处！书影博客保留对文章的所有权利。

周一	周二	周三	周四	周五	周六	周日
2014年9月				2014年11月
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31