scrapy的优点

scrapy的优点

scrapy的优点:

采取可读性更强的xpath代替正则
强大的统计和log系统
同时在不同的url上爬行
支持shell方式,方便独立调试
写middleware,方便写一些统一的过滤器
通过管道的方式存入数据库

下面是基于ubuntu 10.4下的开发:
1 在文件/etc/apt/sources.list中加入以下命令

1
deb http://archive.scrapy.org/ubuntu lucid main
2 运行如下命令

1
$sudo curl -s http://archive.scrapy.org/ubuntu/archive.key | sudo apt-key add -<span id=”more-1019″></span>
3 更新版本及安装

1
$sudo aptitude update
2
$sudo aptitude install scrapy-0.12
据安装文件安装好scrapy之后。
1 当前目录下创建一个名为dmoz的项目

1
$scrapy startproject dmoz
将会创建如下目录和文件:
————————-

dmoz/
scrapy.cfg
dmoz/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py

————————–
* scrapy.cfg: the project configuration file
* dmoz/: the project’s python module, you’ll later import your code from here.
* dmoz/items.py: the project’s items file.
* dmoz/pipelines.py: the project’s pipelines file.
* dmoz/settings.py: the project’s settings file.
* dmoz/spiders/: a directory where you’ll later put your spiders.
2 定义items.py文件,这个文件是要被抓取的暂存字段,号称为数据容器。
例如:items.py
from scrapy.item import Item, Field

class DmozItem(Item):

title = Field()

link = Field()

desc = Field()
3 爬行器文件放在spiders文件夹中。
这里可以定义多个爬行器。
3.1 继承BaseSpider类,如:
from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from dmoz.items import DmozItem

class DmozSpider(BaseSpider):

name = “dmoz.org”

allowed_domains = [“dmoz.org”]

start_urls = [

“http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”,

“http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/”

]
def parse(self, response):

hxs = HtmlXPathSelector(response)

sites = hxs.select(‘//ul/li’)

items = []

for site in sites:

item = DmozItem()

item[‘title’] = site.select(‘a/text()’).extract()

item[‘link’] = site.select(‘a/@href’).extract()

item[‘desc’] = site.select(‘text()’).extract()

items.append(item)

return items
3.2 继承CrawlSpider类,可以使用Rule
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class MySpider(CrawlSpider):
name = ‘example.com’
allowed_domains = [‘example.com’]
start_urls = [‘http://www.example.com’]

rules = (
# Extract links matching ‘category.php’ (but not matching ‘subsection.php’)
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=(‘category\.php’, ), deny=(‘subsection\.php’, ))),

# Extract links matching ‘item.php’ and parse them with the spider’s method parse_item
Rule(SgmlLinkExtractor(allow=(‘item\.php’, )), callback=’parse_item’),
)

def parse_item(self, response):
self.log(‘Hi, this is an item page! %s’ % response.url)

hxs = HtmlXPathSelector(response)
item = Item()
item[‘id’] = hxs.select(‘//td[@id=”item_id”]/text()’).re(r’ID: (\d+)’)
item[‘name’] = hxs.select(‘//td[@id=”item_name”]/text()’).extract()
item[‘description’] = hxs.select(‘//td[@id=”item_description”]/text()’).extract()
return item
4 在shell中
运行选择器XPATH:

$scrapy shell url
运行爬行器:

$scrapy crawl scrapy_name
5 XPath
.select(‘…’) 选择 …为’/html’ ‘//title/text()’ //a[contains(@href,’image’)]/img/@src’
.extract() 输出
.re(‘…’) 正则表达式
6 存储数据
6.1 少量数据:
有json jsonlines csv xml。

$scrapy crawl dmoz.org –set FEED_URI=item.json –set FEED_FORMAT=json
FEED_URI:
LOCAL FILESYSEM:file:///tmp/export.csv
FTP:ftp://user:pass@ftp.example.com/path/to/export.csv
S3:S3://mybucket/path/to/exort.csv
Standard output: stdout:…
参数:%(time)s %(name)s : ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
6.2 大量数据:
可以使用管道pipelines.py文件。
7 管道文件
管道文件不仅可以作为数据库存储文件还可以作为过滤器。
在setting.py文件中加入:

ITEM_PIPELINES = [‘dmoz.pipelines.DmozPipelines’]
例如:pipelines.py文件

from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1.15
def process_item(self, item, spider):
if item[‘price’]:
if item[‘price_excludes_vat’]:
item[‘price’] = item[‘price’] * self.vat_factor
return item
else:
raise DropItem(“Missing price in %s” % item)

 

原文:http://blog.libears.com/2011-06-11/python/%E5%9B%9E%E9%A1%BEscrapy

发表评论

电子邮件地址不会被公开。 必填项已用*标注