【用Python写爬虫】获取html的方法

Federation · 发表于 2009-5-31 16:10

收藏~~

InfoSpherer · 发表于 2009-6-1 09:49

收藏~~~

bubill · 发表于 2009-6-19 21:48

获取的web页面内容保存在哪里？

bubill · 发表于 2009-6-19 22:31

# -*- coding: utf-8 -*-
#WebPageContent.py for Python 2.5.4

import urllib
'''Grabing WebPageContent'''

def getWebPageContent(url):
   f = urllib.urlopen(url)
   data = f.read()
   f.close()
   return data

url = 'http://www.itpub.com'
content = getWebPageContent(url)

#将抓取的网页保存到WebPageContent.txt文件中

WebPageContent = open('G:\\WebPageContent.txt', 'a')
print >>WebPageContent, content
WebPageContent.close()

#指定的是utf-8,可是输出文件后变成ANSI啦，不知道怎么搞！？

47009356 · 发表于 2009-6-28 17:08

收藏~~~

lkwt1982 · 发表于 2009-7-24 17:14

看来python 还是很强大的，值得学习下！

laou2008 · 发表于 2009-8-1 23:03

学习

omencathay · 发表于 2009-8-13 10:50

原帖由 bubill 于 2009-6-19 22:31 发表
# -*- coding: utf-8 -*-
#WebPageContent.py for Python 2.5.4

import urllib
'''Grabing WebPageContent'''

def getWebPageContent(url):
   f = urllib.urlopen(url)
   data = f.read()
   f.close()
   return data

url = 'http://www.itpub.com'
content = getWebPageContent(url)

#将抓取的网页保存到WebPageContent.txt文件中

WebPageContent = open('G:\\WebPageContent.txt', 'a')
print >>WebPageContent, content
WebPageContent.close()

#指定的是utf-8,可是输出文件后变成ANSI啦，不知道怎么搞！？

easy_install 一个chartdet包
用chardet.detect(content)看一下编码，然后content.decode(chardet.detect).encode("utf8")转码之后存到文件中

cnkiller · 发表于 2009-9-7 10:56

呵呵，学习了

wxwxw2 · 发表于 2014-11-22 07:15

简洁！thanks

【用Python写爬虫】获取html的方法

浏览过的版块