Scrapyでホットクックのレシピ一覧をクローリングしてみた

ホットクックの伝道者エヴァンジェリストのゆーやです。今回は、ホットクックのレシピの全制覇に挑戦しようと画策中です。この挑戦を遂行する為に、レシピの一覧が必要なのですが、一覧でまとまったコンテンツは探してもありません。

そこで、ホットクックさんの公式HPから、PythonのScrapyというフレームワークを使って一覧を作ってみる事に挑戦してみました。

こんな方におすすめ

Webページから特定の情報を抽出したい方

Scrapyのインストール

まずは、Scrapyのインストールからですが、こちらは次の記事を参照してください。

: Scrapyのインストール方法
久しぶりにPythonのScrapyを使ってスクレイピングしようと思ったら、意外にインストールが大変でした。同じ様にハマる人が一定数居そうな気がしたので、インストール方法を共有します。

Projectの作成

まずは、次のコマンドを実行することで、Projectを作成します。

scrapy startproject hotcook

1	scrapy startproject hotcook

このコマンドを実行すると、実行ディレクトリ化に次のようなディレクトリ構成が出来あがります。

hotcook
├── scrapy.cfg
└── hotcook
    ├── __init__.py
    ├── __pycache__
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── __pycache__

hotcook

├── scrapy.cfg

└── hotcook

├── __init__.py

├── __pycache__

├── items.py

├── middlewares.py

├── pipelines.py

├── settings.py

└── spiders

├── __init__.py

└── __pycache__

Spiderの作成

Projectを作り終わったら、Spiderを作っていきます。SpiderはWebサイトを巡回し情報を収集するためのメインのプログラムです。

上記のディレクトリのhotcookディレクトリに移動してから、ホットクックさんの公式HPであるcook-healsio.jpを指定し、次のコマンドを実行しました。

cd hotcook
scrapy genspider recipe cook-healsio.jp

1 2	cd hotcook scrapy genspider recipe cook-healsio.jp

このコマンドでspiders配下に、recipe.pyというファイルが出来ます。

hotcook
├── scrapy.cfg
└── hotcook
    ├── __init__.py
    ├── __pycache__
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        ├── __pycache__
        └── <span class="st-mymarker-s">recipe.py</span>

hotcook

├── scrapy.cfg

└── hotcook

├── __init__.py

├── __pycache__

├── items.py

├── middlewares.py

├── pipelines.py

├── settings.py

└── spiders

├── __init__.py

├── __pycache__

└── <span class="st-mymarker-s">recipe.py</span>

recipe.pyにコーディング

recipe.pyを開くと次のようなコードが生成されています。

import scrapy


class RecipeSpider(scrapy.Spider):
    name = 'recipe'
    allowed_domains = ['cook-healsio.jp']
    start_urls = ['http://cook-healsio.jp/']

    def parse(self, response):
        pass

import scrapy

class RecipeSpider(scrapy.Spider):

name = 'recipe'

allowed_domains = ['cook-healsio.jp']

start_urls = ['http://cook-healsio.jp/']

def parse(self, response):

pass

ここに対して、自分の取得したい処理を書いていきます。

import scrapy
from hotcook.items import HotcookItem

class RecipeSpider(scrapy.Spider):
    name = 'recipe'
    allowed_domains = ['cook-healsio.jp']
    start_urls = ['http://cook-healsio.jp/hotcook/HW24C/recipes/']

    def parse(self, response):
        for i in range(30):
          yield scrapy.Request('http://cook-healsio.jp/hotcook/HW24C/recipes?page=' + str(i + 1), self.recipe_paging_list)

    def recipe_paging_list(self, response):
        links = response.css('.recipe_list a::attr("href")').extract()

        for link in links:
          yield scrapy.Request('http://cook-healsio.jp' + link, self.recipe_content)

    def recipe_content(self, response):
        item = HotcookItem()
        item['recipe_no'] = response.url.split("/")[-1]
        print(item['recipe_no'])
        item['recipe_name'] = response.css('.mv_ttl::text').extract()
        print(item['recipe_name'])
        item['materials'] = response.xpath('//table/tbody/tr/td/text()').extract()
        print(item['materials'])

        pass

import scrapy

from hotcook.items import HotcookItem

class RecipeSpider(scrapy.Spider):

name = 'recipe'

allowed_domains = ['cook-healsio.jp']

start_urls = ['http://cook-healsio.jp/hotcook/HW24C/recipes/']

def parse(self, response):

for i in range(30):

yield scrapy.Request('http://cook-healsio.jp/hotcook/HW24C/recipes?page=' + str(i + 1), self.recipe_paging_list)

def recipe_paging_list(self, response):

links = response.css('.recipe_list a::attr("href")').extract()

for link in links:

yield scrapy.Request('http://cook-healsio.jp' + link, self.recipe_content)

def recipe_content(self, response):

item = HotcookItem()

item['recipe_no'] = response.url.split("/")[-1]

print(item['recipe_no'])

item['recipe_name'] = response.css('.mv_ttl::text').extract()

print(item['recipe_name'])

item['materials'] = response.xpath('//table/tbody/tr/td/text()').extract()

print(item['materials'])

pass

このrecipe.pyのコードを少しずつ説明していきます。

parse処理

    def parse(self, response):
        for i in range(30):
          yield scrapy.Request('http://cook-healsio.jp/hotcook/HW24C/recipes?page=' + str(i + 1), self.recipe_paging_list)

def parse(self, response):

for i in range(30):

yield scrapy.Request('http://cook-healsio.jp/hotcook/HW24C/recipes?page=' + str(i + 1), self.recipe_paging_list)

ホットクックのレシピ一覧ページは、次のように１～30のページにまたがって存在しています。処理としては簡単で、末尾のpage=の部分を1～30までループしてアクセスしていきます。

recipe_paging_list

parse処理のyieldから呼び出している recipe_paging_list が次の処理です。1～30ページに分かれているページには、それぞれ12のレシピのURLが掲載されています。この12のレシピのURLを取得します。

    def recipe_paging_list(self, response):
        links = response.css('.recipe_list a::attr("href")').extract()

        for link in links:
          yield scrapy.Request('http://cook-healsio.jp' + link, self.recipe_content)

def recipe_paging_list(self, response):

links = response.css('.recipe_list a::attr("href")').extract()

for link in links:

yield scrapy.Request('http://cook-healsio.jp' + link, self.recipe_content)

この部分が、一番わかりにくい部分かもしれません。

links = response.css('.recipe_list a::attr("href")').extract() では、 recipe_list というclassの中にある a href タグを取得しています。

ここで取得できるURLは相対パスなので、 'http://cook-healsio.jp' + link でつなぎ合わせて、次なるクローリングをするURLを生成しています。

タグからのデータ取得は、 resoponse.css か response.xpath を使うのですが、どうやって設定すれば良いかは、次のサイトが参考になりました。

scrapyでよく使うxpath, cssのセレクタ

python.civic-apps.com

ここで取得したURLを用いてさらにクローリングを進めます。

recipe_content

この処理では、個別の料理のレシピのページの中から、レシピ番号とレシピ名称、材料を取得する処理を記載しています。

    def recipe_content(self, response):
        item = HotcookItem()
        item['recipe_no'] = response.url.split("/")[-1]
        print(item['recipe_no'])
        item['recipe_name'] = response.css('.mv_ttl::text').extract()
        print(item['recipe_name'])
        item['materials'] = response.xpath('//table/tbody/tr/td/text()').extract()
        print(item['materials'])

        pass

def recipe_content(self, response):

item = HotcookItem()

item['recipe_no'] = response.url.split("/")[-1]

print(item['recipe_no'])

item['recipe_name'] = response.css('.mv_ttl::text').extract()

print(item['recipe_name'])

item['materials'] = response.xpath('//table/tbody/tr/td/text()').extract()

print(item['materials'])

pass

URLの末尾のセクションがレシピ番号だったので、 response.url.split("/")[-1] でこれを取得しています。
レシピ名はmv_ttlというclassで囲われていたので、 response.css('.mv_ttl::text').extract() で取得しました。
材料はtableタグで囲われていたので、 response.xpath('//table/tbody/tr/td/text()').extract() で取得しました。

ここでitemに値を設定する為に、 item.py の設定をしました。

item.pyの設定

items.py のファイルに、上記の値を格納する為の変数を定義しました。

import scrapy


class HotcookItem(scrapy.Item):
    # define the fields for your item here like:
    recipe_no = scrapy.Field()
    recipe_name = scrapy.Field()
    materials = scrapy.Field()

    pass

import scrapy

class HotcookItem(scrapy.Item):

# define the fields for your item here like:

recipe_no = scrapy.Field()

recipe_name = scrapy.Field()

materials = scrapy.Field()

pass

i tem.py を使う為には、 recipe.py のimport句に一行追加する必要があるのですが、これを忘れていて、 name 'HotcookItem' is not defined エラーがずっと消えなくて少しハマりました。

import scrapy

<span class="st-mymarker-s">from hotcook.items import HotcookItem</span>

import scrapy

<span class="st-mymarker-s">from hotcook.items import HotcookItem</span>

結局がんばったものの、このitemはあまり使わずに、printコマンドでコンソール上に出力される結果から一覧を取得しました。

settings.pyの設定

これらを記載した後は、実行用のコンフィグをsetting.pyに記載していきます。

設定値はコメントアウトされているので、使う部分だけ「#」を外して使います。私はダウンロードの時間を1秒に設定しました。

DOWNLOAD_DELAY = 1

1	DOWNLOAD_DELAY = 1

Spiderの実行

上記のコーディングがうまくいっていると、次のcrawlコマンドでクローリングが実行出来ます。

scrapy crawl recipe

1	scrapy crawl recipe

処理がうまく実行出来ているコトが確認出来たら、次のコマンドでテキストに吐き出しました。

scrapy crawl recipe --nolog >> output.txt

1	scrapy crawl recipe --nolog >> output.txt

出力結果はこんな感じです。

ここまで出力されていたら、後はExcelさんでいかようにも加工できるのでここで満足しました。

本当はpiplineとかを使ってDBにデータを保存したりするみたいなのですが、その必要性と余力がなかったので実施しませんでした。

まとめ

今回は、PythonのScrapyというフレームワークでホットクックのレシピをクローリングで収集してみました。

Scrapyは色々なファイルをメンテしなきゃいけないのと、タグからのデータの取得方法の正規表現がわかりづらいといったところが躓きポイントかなと思いました。

いきなり難しい処理を実践すると挫折する確率が高めなので、簡単な処理から始める事をお勧めします。

といったところで、今回はここまでです。

ご閲覧ありがとうございました。
ではでは(^^)/