Python Spider Study Note – Week Four/Unit Eleven/Basic Use of Scrapy

中文版本: http://today2tmr.com/2017/07/18/python-爬虫学习笔记-第四周单元11scrapy爬虫基本使用/

First Example for Scrapy

Save http://python123.io/ws/demo.html to demo.html.

Steps:

  • STEP 1: Create Scrapy project scrapy startproject python123demo

     

    • python123demo/: Outer directory
      • scrapy.cfg: Configuration file for Scrapy. Run spider on the server,configure corresponding interface. No need to change in this e.g.
      • python123demo/: Customed code under Scrapy framework
        • __init__.py: Initial script
        • items.py: Template for Items(inherited class)
        • middlewares.py: Template for Middlewares(inherited class)
        • pipelines.py: Template for Pipelines(inherited class)
        • settings.py: Configuration file for Scrapy
        • spiders/: Directory of template for Spiders(inherited class),including all spiders in the project
          • __init__.py: Inital file, no need to modify
          • __pycache__/: Cache directory, no need to modify

  • STEP 2: Generate Scrapy spider scrapy genspider demo python123.io

     

    • Content of demo.py

     

    • Inherited from scrapy.Spider
    • start_urls: start url for spider
    • parse: method to anapyze page. Used to handle response, content analyzed forms dictionary and find new URL requests
  • STEP 3: Configure spider,write demo.py

 

  • STEP 4: Run spider, obtain page. scrapy crawl demo

Full Code of demo.py

 

Use of Keyword yield

yield <==> generator

  • Generator is a function to generate values continuously.
  • Sentence including yield is a generator.
  • Generator generates one value each time,the function is freezed,and generates new value when it is waken-up.
  • Local variable when waken-up is the same as the former.

e.g.

 

Normally:

 

  • Pros of generator
    1. Save more space,generator generates one value each invocation.
    2. Faster speed to response
    3. More flexible to use

Basic Use of Scrapy

  • STEP 1: Create a project and template of Spider
  • STEP 2: Write Spider
  • STEP 3: Write Item Pipeline
  • STEP 4: Optimize statics of configuration

Classes in Scrapy

  • Request Class
    • class scrapy.http.Request()
    • one HTTP request
    • generated by Spider,excuted by Downloader
Attribute or Method Explanation
.url URL for Request
.method method of Request,’GET’ ‘POST’
.headers dictionary-like header of Request
.body content of Request, string
.meta extended info, used during transmission in Scrapy
.copy() copy this Request
  • Response Class
    • class scrapy.http.Response()
    • one HTTP response
    • generated by Downloader,handled by Spider
Attribute or Method Explanation
.url URL for response
.status HTTP status code,default is 200(success)
.headers header info of Response
.body content of Response,string
.flags a series of flags
.request corresponding Request
.copy() copy this Response
  • Item Class
    • class scrapy.item.Item()
    • Content extracted from HTML
    • generated by Spider,handled by Item Pipelines
    • dictionary-like, could be operated by operations of dictionary

Ways to Extract Info in Scrapy

Mainly used in Spider module

  • Beautiful Soup
  • lxml
  • re
  • XPath Selector
  • CSS Selector

CSS Selector

  • <HTML>.css('a::attr(href)').extract()
  • Obtain info by name and attribute of tag

Unit Summary

  • Example of Scrapy and directory
  • yield and generator
  • Request class、Response class、Item class
  • Basic use of CSS Selector

One Reply to “Python Spider Study Note – Week Four/Unit Eleven/Basic Use of Scrapy”

Leave a Reply

Your email address will not be published.