Python 研究(Dive Into Python)

lastwinner · 发表于 2006-7-15 20:10

8.9. 全部放在一起
到了该将迄今为止我们已经学过并用得不错的东西放在一起的时候了。我希望您专心些。

例 8.20. translate 函数, 第 1 部分

def translate(url, dialectName="chef"

:
import urllib
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
  这个 translate 函数有一个可选参数 dialectName，它是一个字符串，指出我们将使用的方言。一会我们就会看到它是如何使用的。
  嘿，等一下，在这个函数中有一个 import 语句！它在 Python 中完全合法。您已经习惯了在一个程序的前面看到 import 语句，它意味着导入的模块在程序的任何地方都是可用的。但您也可以在一个函数中导入模块，这意味着导入的模块只能在函数中使用。如果您有一个只能用在一个函数中的模块，这是一个简便的方法，使您的代码更模块化。 (当发现您周末的加班已经变成了一个 800行的艺术作品，并且决定将其分割成一打可重用的模块时，您会感谢它的。)
  现在我们得到了给定的URL的原始资料。

lastwinner · 发表于 2006-7-15 20:11

例 8.21. translate 函数, 第 2 部分: 奇妙而又奇妙
parserName = "%sDialectizer" % dialectName.capitalize()
parserClass = globals()[parserName]
parser = parserClass()
  capitalize 是一个我们以前未曾见过的字符串方法；它只是将一个字符串的第一个字母变成大写，将其它的字母强制变成小写。与某个字符串格式化合在一起使用后，我们就得到了一种方言的名字，接着将它转化为相应的方言变换器类的名字。如果 dialectName 是字符串 'chef'，parserName 将是字符串 'ChefDialectizer'。
  我们有了一个字符串形式 (parserName) 的类名称，还有一个 dictionary (globals()) 形式的全局名字空间。合起来后，我们可以得到一个以前面字符串命名的类的引用。 (回想一下，类是对象，并且它们可以象其它对象一样赋值给一个变量。) 如果 parserName 是字符串 'ChefDialectizer'，parserClass 将是类 ChefDialectizer。
  最后，我们拥有了一个类对象 (parserClass)，接着我们想要生成这个类的一个实例。好，我们已经知道如何去做了: 象函数一样调用类。这个类保存在一个局部变量中的事实完全不会有什么影响；我们只是象函数一样调用这个局部变量，取出这个类的一个实例。如果 parserClass 是类 ChefDialectizer，parser 将是类 ChefDialectizer 的一个实例。

怎么这么麻烦？毕竟只有三个 Dialectizer 类；为什么不只使用一个 case 语句？ (噢，在 Python 中不存在 case 语句，但为什么不只使用一组 if 语句呢？) 理由之一是: 可扩展性。这个 translate 函数完全不用关心我们定义了多少个方言变换器类。设想一下，如果我们明天定义了一个新的 FooDialectizer 类，把 'foo' 作为 dialectName 传给 translate ， translate 也能工作。

甚至会更好，设想将 FooDialectizer 放进一个独立的模块中，使用 from module import 将其导入。我们已经知道了，这样会将它包含在 globals() 中，所以不用修改 translate ，它仍然可以正确运行，尽管 FooDialectizer 位于一个独立的文件中。

现在设想一下方言的名字是从程序外面的某个地方来的，也许是从一个数据库中，或从一个表格中的用户输入的值中。您可以使用任意多的服务端 Python 脚本架构来动态地生成网页；这个函数将接收在页面请求的查询字符串中的一个 URL 和一个方言名字 (两个都是字符串) ，接着输出 “翻译” 后的网页。

最后，设想一下，使用了一种插件架构的 Dialectizer 框架。您可以将每个 Dialectizer 类放在分别放在独立的文件中，在 dialect.py 中只留下 translate 函数。假定一种统一的命名模式，这个 translate 函数能够动态地从合适的文件中导入合适的类，除了方言名字外什么都不用给出。 (虽然您还没有看过动态导入，但我保证在后面的一章中会涉及到它。) 如果要加入一种新的方言，您只要在插件目录下加入一个以合适的名字命名的文件 (象 foodialect.py，它包含了 FooDialectizer 类) 。使用方言名 'foo' 来调用这个 translate 函数，将会查找 foodialect.py 模块，导入 FooDialectizer 类，这样就行了。

lastwinner · 发表于 2006-7-15 20:11

例 8.22. translate 函数, 第 3 部分
parser.feed(htmlSource)
parser.close()
return parser.output()
  毕竟那只是假设，这个似乎会非常令人讨厌，但这个 feed 函数执行了全部的转换工作。我们拥有存在于单个字符串中的全部 HTML 源代码，所以我们只需要调用 feed 一次。然而，您可以按您的需要经常调用 feed，分析器将不停地进行分析。所以如果我们担心内存的使用 (或者我们已经知道了将要处理非常巨大的 HTML 页面) ，我们可以在一个循环中调用它，即我们读出一点 HTML 字节，就将其送进分析器。结果会是一样的。
  因为 feed 维护着一个内部缓冲区，当您完成时，应该总是调用分析器的 close 方法 (那怕您象我们做的一样，一次就全部送出) 。否则您可能会发现，输出丢掉了最后几个字节。
  回想一下，output 是我们在 BaseHTMLProcessor 上定义的函数，用来将所有缓冲的输出片段连接起来并且以单个字符串返回。

象这样，我们已经 “翻译” 了一个网页，除了给出一个 URL 和一种方言的名字外，什么都没有给出。

lastwinner · 发表于 2006-7-15 20:11

进一步阅读

您可能会认为我正在拿服务端脚本编程开玩笑。在我发现这个基于 web 的方言转换器之前，的确是这样认为的。不幸的是，看不到它的源代码。

lastwinner · 发表于 2006-7-15 20:12

8.10. 小结
Python 向您提供了一个强大工具，sgmllib.py，可以通过将 HTML 结构转变为一种对象模型来进行处理。可以以许多不同的方式来使用这个工具。

对 HTML 进行分析，搜索特别的东西
汇集结果，如 URL lister
按结构的方式对其进行修改，如属性引用
将 HTML 转换为其它的东西，通过对文本进行处理，同时保留标记，如 Dialectizer
学过了这些例子之后，您应该无障碍地完成下面的事情:

使用 locals() 和 globals() 来访问名字空间
使用基于 dictionary 替换的字符串格式化

lastwinner · 发表于 2006-7-15 20:34

第 9 章 XML 处理
9.1. 概览
9.2. 包
9.3. XML 解析
9.4. Unicode
9.5. 搜索元素
9.6. 访问元素属性
9.7. Segue
9.1. 概览
下面两章是关于 Python 中 XML 处理的。如果你已经知道一个 XML 文档的样子，比如它是由结构化标记构成的，这些标记形成了层次模型的元素，等等这些知识都是有帮助的。如果你不明白这些，这里有很多 XML 教程能够解释这些基础知识。

如果你对XML不是很感兴趣，你还是应该读一下这些章节，它们涵盖了不少重要的主题比如 Python 包，Unicode，命令行参数以及如何使用 getattr 进行方法分发。

Being a philosophy major is not required, although if you have ever had the misfortune of being subjected to the writings of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer science.

处理 XML 有两种基本的方式。一种叫做 SAX（“Simple API for XML”），它的工作方式是，一次读出一点 XML 内容，然后对发现的每一个元素调用一个方法。（如果你读了第 8 章 HTML 处理，这应该听起来很熟悉，因为这是 sgmllib 工作的方式。）另一种方式叫做 DOM （“Document Object Model”），它的工作方式是，一次性读入整个 XML 文档，然后使用 Python 类创建一个内部表示形式（以树结构进行连接）。Python 拥有这两种解析方式的标准模块，但是本章只涉及 DOM。

下面是一个完整的 Python 程序，它根据 XML 格式定义的上下文无关语法生成伪随机输出。如果你不明白是什么意思，不用担心，下面两章中将会深入的检视这个程序的输入和输出。

lastwinner · 发表于 2006-7-15 20:34

例 9.1. kgp.py
如果您还没有下载本书附带的例子程序, 可以下载本程序和其他例子程序。

"""Kant Generator for Python

Generates mock philosophy based on a context-free grammar

Usage: python kgp.py [options] [source]

Options:
 -g ..., --grammar=... use specified grammar file or URL
 -h, --help show this help
 -d show debugging information while parsing

Examples:
 kgp.py generates several paragraphs of Kantian philosophy
 kgp.py -g husserl.xml generates several paragraphs of Husserl
 kpg.py "<xref id='paragraph'/>" generates a paragraph of Kant
 kgp.py template.xml reads from template.xml to decide what to generate
"""
from xml.dom import minidom
import random
import toolbox
import sys
import getopt

_debug = 0

class NoSourceError(Exception): pass

class KantGenerator:
"""generates mock philosophy based on a context-free grammar"""

def __init__(self, grammar, source=None):
 self.loadGrammar(grammar)
 self.loadSource(source and source or self.getDefaultSource())
 self.refresh()

def _load(self, source):
 """load XML input source, return parsed XML document

 - a URL of a remote XML file ("http://diveintopython.org/kant.xml"

- a filename of a local XML file ("~/diveintopython/common/py/kant.xml"

- standard input ("-"

      - the actual XML document, as a string
      """
      sock = toolbox.openAnything(source)
      xmldoc = minidom.parse(sock).documentElement
      sock.close()
      return xmldoc

def loadGrammar(self, grammar):
      """load context-free grammar"""
      self.grammar = self._load(grammar)
      self.refs = {}
      for ref in self.grammar.getElementsByTagName("ref"

:
 self.refs[ref.attributes["id"].value] = ref

def loadSource(self, source):
 """load source"""
 self.source = self._load(source)

def getDefaultSource(self):
 """guess default source of the current grammar

 The default source will be one of the <ref>s that is not
 cross-referenced. This sounds complicated but it's not.
 Example: The default source for kant.xml is
 "<xref id='section'/>", because 'section' is the one <ref>
 that is not <xref>'d anywhere in the grammar.
 In most grammars, the default source will produce the
 longest (and most interesting) output.
 """
 xrefs = {}
 for xref in self.grammar.getElementsByTagName("xref"

:
 xrefs[xref.attributes["id"].value] = 1
 xrefs = xrefs.keys()
 standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
 if not standaloneXrefs:
 raise NoSourceError, "can't guess source, and no source specified"
 return '<xref id="%s"/>' % random.choice(standaloneXrefs)

def reset(self):
 """reset parser"""
 self.pieces = []
 self.capitalizeNextWord = 0

def refresh(self):
 """reset output buffer, re-parse entire source file, and return output

 Since parsing involves a good deal of randomness, this is an
 easy way to get new output without having to reload a grammar file
 each time.
 """
 self.reset()
 self.parse(self.source)
 return self.output()

def output(self):
 """output generated text"""
 return "".join(self.pieces)

def randomChildElement(self, node):
 """choose a random child element of a node

 This is a utility method used by do_xref and do_choice.
 """
 choices = [e for e in node.childNodes
 if e.nodeType == e.ELEMENT_NODE]
 chosen = random.choice(choices)
 if _debug:
 sys.stderr.write('%s available choices: %s\n' % \
 (len(choices), [e.toxml() for e in choices]))
 sys.stderr.write('Chosen: %s\n' % chosen.toxml())
 return chosen

def parse(self, node):
 """parse a single XML node

 A parsed XML document (from minidom.parse) is a tree of nodes
 of various types. Each node is represented by an instance of the
 corresponding Python class (Element for a tag, Text for
 text data, Document for the top-level document). The following
 statement constructs the name of a class method based on the type
 of node we're parsing ("parse_Element" for an Element node,
 "parse_Text" for a Text node, etc.) and then calls the method.
 """
 parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
 parseMethod(node)

def parse_Document(self, node):
 """parse the document node

 The document node by itself isn't interesting (to us), but
 its only child, node.documentElement, is: it's the root node
 of the grammar.
 """
 self.parse(node.documentElement)

def parse_Text(self, node):
 """parse a text node

 The text of a text node is usually added to the output buffer
 verbatim. The one exception is that sets
 a flag to capitalize the first letter of the next word. If
 that flag is set, we capitalize the text and reset the flag.
 """
 text = node.data
 if self.capitalizeNextWord:
 self.pieces.append(text[0].upper())
 self.pieces.append(text[1:])
 self.capitalizeNextWord = 0
 else:
 self.pieces.append(text)

def parse_Element(self, node):
 """parse an element

 An XML element corresponds to an actual tag in the source:
 <xref id='...'>, , <choice>, etc.
 Each element type is handled in its own method. Like we did in
 parse(), we construct a method name based on the name of the
 element ("do_xref" for an <xref> tag, etc.) and
 call the method.
 """
 handlerMethod = getattr(self, "do_%s" % node.tagName)
 handlerMethod(node)

def parse_Comment(self, node):
 """parse a comment

 The grammar can contain XML comments, but we ignore them
 """
 pass

def do_xref(self, node):
 """handle <xref id='...'> tag

 An <xref id='...'> tag is a cross-reference to a <ref id='...'>
 tag. <xref id='sentence'/> evaluates to a randomly chosen child of
 <ref id='sentence'>.
 """
 id = node.attributes["id"].value
 self.parse(self.randomChildElement(self.refs[id]))

def do_p(self, node):
 """handle tag

 The tag is the core of the grammar. It can contain almost
 anything: freeform text, <choice> tags, <xref> tags, even other
 tags. If a "class='sentence'" attribute is found, a flag
 is set and the next word will be capitalized. If a "chance='X'"
 attribute is found, there is an X% chance that the tag will be
 evaluated (and therefore a (100-X)% chance that it will be
 completely ignored)
 """
 keys = node.attributes.keys()
 if "class" in keys:
 if node.attributes["class"].value == "sentence":
 self.capitalizeNextWord = 1
 if "chance" in keys:
 chance = int(node.attributes["chance"].value)
 doit = (chance > random.randrange(100))
 else:
 doit = 1
 if doit:
 for child in node.childNodes: self.parse(child)

def do_choice(self, node):
 """handle <choice> tag

 A <choice> tag contains one or more tags. One tag
 is chosen at random and evaluated; the rest are ignored.
 """
 self.parse(self.randomChildElement(node))

def usage():
print __doc__

def main(argv):
grammar = "kant.xml"
try:
 opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
 usage()
 sys.exit(2)
for opt, arg in opts:
 if opt in ("-h", "--help"

:
         usage()
         sys.exit()
      elif opt == '-d':
         global _debug
         _debug = 1
      elif opt in ("-g", "--grammar"

:
grammar = arg

source = "".join(args)

k = KantGenerator(grammar, source)
print k.output()

if __name__ == "__main__":
main(sys.argv[1:])

lastwinner · 发表于 2006-7-15 20:35

例 9.2. toolbox.py
"""Miscellaneous utility functions"""

def openAnything(source):
"""URI, filename, or string --> stream

This function lets you define parsers that take any input source
(URL, pathname to local or network file, or actual data as a string)
and deal with it in a uniform manner. Returned object is guaranteed
to have all the basic stdio read methods (read, readline, readlines).
Just .close() the object when you're done with it.

Examples:
>>> from xml.dom import minidom
>>> sock = openAnything("http://localhost/kant.xml"

>>> doc = minidom.parse(sock)
>>> sock.close()
>>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml"

>>> doc = minidom.parse(sock)
>>> sock.close()
>>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>"

>>> doc = minidom.parse(sock)
>>> sock.close()
"""
if hasattr(source, "read"

:
      return source

if source == '-':
      import sys
      return sys.stdin

# try to open with urllib (if source is http, ftp, or file URL)
import urllib
try:
      return urllib.urlopen(source)
except (IOError, OSError):
      pass

# try to open with native open function (if source is pathname)
try:
      return open(source)
except (IOError, OSError):
      pass

# treat source as string
import StringIO
return StringIO.StringIO(str(source))
独立运行程序 kgp.py ，它会解析 kant.xml 中默认的基于 XML 的语法，并以康德的风格打印出几段有哲学价值的段落来。

lastwinner · 发表于 2006-7-15 20:35

例 9.3. Sample output of kgp.py
[you@localhost kgp]$ python kgp.py
   As is shown in the writings of Hume, our a priori concepts, in
reference to ends, abstract from all content of knowledge; in the study
of space, the discipline of human reason, in accordance with the
principles of philosophy, is the clue to the discovery of the
Transcendental Deduction.  The transcendental aesthetic, in all
theoretical sciences, occupies part of the sphere of human reason
concerning the existence of our ideas in general; still, the
never-ending regress in the series of empirical conditions constitutes
the whole content for the transcendental unity of apperception.  What
we have alone been able to show is that, even as this relates to the
architectonic of human reason, the Ideal may not contradict itself, but
it is still possible that it may be in contradictions with the
employment of the pure employment of our hypothetical judgements, but
natural causes (and I assert that this is the case) prove the validity
of the discipline of pure reason.  As we have already seen, time (and
it is obvious that this is true) proves the validity of time, and the
architectonic of human reason, in the full sense of these terms,
abstracts from all content of knowledge.  I assert, in the case of the
discipline of practical reason, that the Antinomies are just as
necessary as natural causes, since knowledge of the phenomena is a
posteriori.
The discipline of human reason, as I have elsewhere shown, is by
its very nature contradictory, but our ideas exclude the possibility of
the Antinomies.  We can deduce that, on the contrary, the pure
employment of philosophy, on the contrary, is by its very nature
contradictory, but our sense perceptions are a representation of, in
the case of space, metaphysics.  The thing in itself is a
representation of philosophy.  Applied logic is the clue to the
discovery of natural causes.  However, what we have alone been able to
show is that our ideas, in other words, should only be used as a canon
for the Ideal, because of our necessary ignorance of the conditions.

[...snip...]当然这是胡言乱语。噢，不完全是胡言乱语。它在句法和语法上都是正确的（尽管非常罗嗦－－康德可不是你们所说的踩得到点上的那种人）。其中一些实际上是正确的（或者至少康德可能会认同的事情），其中一些则明显是错误的，大部分只是语无伦次。但所有内容都是符合康德的风格。

让我重复一遍，如果你现在或曾经主修哲学专业，这会非常、非常有趣。

关于这个程序的有趣之处在于没有一点内容是属于康德的。所有的内容都来自于上下文无关语法文件kant.xml。如果你要程序使用不同的语法文件（可以在命令行中指定），输出信息将完全不同。

lastwinner · 发表于 2006-7-15 21:16

例 9.4. kgp.py 的简单输出
[you@localhost kgp]$ python kgp.py -g binary.xml
00101001
[you@localhost kgp]$ python kgp.py -g binary.xml
10110100

在本章后面的内容中，你将近距离的观察语法文件的结构。现在，你只要知道语法文件定义了输出信息的结构，而 kgp.py 程序读取语法规则并随机确定哪些单词插入哪里。

[参考文档] Python 研究(Dive Into Python)

浏览过的版块