【用Python写爬虫】获取html的方法

howklp · 发表于 2008-6-13 16:21

【用Python写爬虫】获取html的方法【一】：使用urllib

个人觉得，Python是一种让编程人员非常自在的语言。脚本性，实时性，开源性..........无不信手拈来。用Python书写爬虫更是如此。

在此处没有语法介绍，没有hello world.....，只有应用，只有代码

  # -*- coding: UTF-8 -*-
  import urllib

  ' 获取web页面内容并返回'
  def getWebPageContent(url):
    f = urllib.urlopen(url)
    data = f.read()
    f.close()
  return data

  url = 'http://www.itpub.net'
  content = getWebPageContent(url)
  print content

[ 本帖最后由 howklp 于 2008-6-13 16:34 编辑 ]

howklp · 发表于 2008-6-13 16:30

# Pycurl参考地址：http://pycurl.sourceforge.net/
# Pycurl下载地址：http://pycurl.sourceforge.net/download/pycurl-7.18.1.tar.gz

# -*-coding: UTF-8 -*-
import pycurl
import
StringIO

def
getURLContent_pycurl(url):

c = pycurl.Curl()

c.setopt(pycurl.URL,url)

b = StringIO.StringIO()

c.setopt(pycurl.WRITEFUNCTION, b.write)

c.setopt(pycurl.FOLLOWLOCATION, 1)

c.setopt(pycurl.MAXREDIRS, 5)

# 代理

#c.setopt(pycurl.PROXY, 'http://11.11.11.11:8080')

#c.setopt(pycurl.PROXYUSERPWD, 'aaa:aaa')

c.perform()

return b.getvalue()

url = 'http://www.itpub.net'
content = getURLContent_pycurl(url)
print
content

[ 本帖最后由 howklp 于 2008-6-13 16:33 编辑 ]