一、网页信息
二、检查网页,找出目标内容
data:image/s3,"s3://crabby-images/42587/425875360367408b2deb236dd8137b1339da9ef8" alt=""
三、根据网页格式写正常爬虫代码
data:image/s3,"s3://crabby-images/1608a/1608ac514284b150a705427680cd66dcc9ce28d7" alt=""
from bs4 import BeautifulSoup
import requestsheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
}
url = 'http://tuijian.hao123.com/'
response = requests.get(url=url,headers=headers)
response.encoding='utf-8'soup = BeautifulSoup(response.text, 'html.parser')
list_div = soup.find('div', class_='v2-nav')
ul_tags = list_div.find_all('ul')[0]
li_tags = ul_tags.find_all('li')for li in li_tags:a_tag = li.find('a')if a_tag:title = a_tag.texthref = a_tag['href']if title in ["娱乐", "体育", "财经", "科技", "历史"]:print(f"{title}: {href}")
四、创建Scrapy项目haohao
1.进入相关目录中,执行:scrapy startproject haohao
data:image/s3,"s3://crabby-images/593c0/593c075643590e04ea49b6fae9ef8d782e376c8a" alt=""
data:image/s3,"s3://crabby-images/f2aa6/f2aa6c176b57f469f373b999820642c473b5fd2b" alt=""
2.创建结果
data:image/s3,"s3://crabby-images/a78c3/a78c32712d224e2f66668704d3e353051686f6c1" alt=""
五、创建爬虫项目haotuijian.py
1.进入相关目录中,执行:scrapy genspider haotuijian http://tuijian.hao123.com/
data:image/s3,"s3://crabby-images/9ac50/9ac5023bed46cdec36a3f1084af4386536526902" alt=""
data:image/s3,"s3://crabby-images/6e3f6/6e3f6a7b15d9e2d716102edec05e5b86c628d380" alt=""
2.执行结果,目录中出现haotuijian.py文件
data:image/s3,"s3://crabby-images/67d61/67d61129fdc647ab0f2199995068ee488ed1d142" alt=""
六、写爬虫代码和配置相关文件
1.haotuijian.py文件代码
import scrapy
from bs4 import BeautifulSoup
from ..items import HaohaoItemclass HaotuijianSpider(scrapy.Spider):name = 'haotuijian'allowed_domains = ['tuijian.hao123.com']start_urls = ['http://tuijian.hao123.com/']def parse(self, response):soup = BeautifulSoup(response.text, 'html.parser')list_div = soup.find('div', class_='v2-nav')ul_tags = list_div.find_all('ul')[0]li_tags = ul_tags.find_all('li')for li in li_tags:a_tag = li.find('a')if a_tag:title = a_tag.texthref = a_tag['href']if title in ["娱乐", "体育", "财经", "科技", "历史"]:item = HaohaoItem() # 创建一个HaohaoItem实例来传输保存数据item['title'] = titleitem['href'] = hrefyield item
2.items.py文件代码
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass HaohaoItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()href = scrapy.Field()
3.pipelines.py文件代码(保存数据到Mongodb、Mysql、Excel中)
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from pymongo import MongoClient
import openpyxl
import pymysql#保存到mongodb中
class HaohaoPipeline:def __init__(self):self.client = MongoClient('mongodb://localhost:27017/')self.db = self.client['qiangzi']self.collection = self.db['hao123']self.data = []def close_spider(self, spider):if len(self.data) > 0:self._write_to_db()self.client.close()def process_item(self, item, spider):self.data.append({'title': item['title'],'href': item['href'],})if len(self.data) == 100:self._write_to_db()self.data.clear()return itemdef _write_to_db(self):self.collection.insert_many(self.data)self.data.clear()#保存到mysql中
class MysqlPipeline:def __init__(self):self.conn = pymysql.connect(host='localhost',port=3306,user='root',password='789456MLq',db='pachong',charset='utf8mb4')self.cursor = self.conn.cursor()self.data = []def close_spider(self,spider):if len(self.data) > 0:self._writer_to_db()self.conn.close()def process_item(self, item, spider):self.data.append((item['title'],item['href']))if len(self.data) == 100:self._writer_to_db()self.data.clear()return itemdef _writer_to_db(self):self.cursor.executemany('insert into haohao (title,href)''values (%s,%s)',self.data)self.conn.commit()#保存到excel中
class ExcelPipeline:def __init__(self):self.wb = openpyxl.Workbook()self.ws = self.wb.activeself.ws.title = 'haohao'self.ws.append(('title','href'))def open_spider(self,spider):passdef close_spider(self,spider):self.wb.save('haohao.xlsx')def process_item(self,item,spider):self.ws.append((item['title'], item['href']))return item
4.settings.py文件配置
data:image/s3,"s3://crabby-images/4424e/4424e243998e5f54db4638855d184b872aef829d" alt=""
data:image/s3,"s3://crabby-images/e318c/e318ca2e6520e62f8faf592439839009d187ee39" alt=""
data:image/s3,"s3://crabby-images/26377/26377effd783b7d78fcbd1c5793bf56060d762b5" alt=""
七、运行代码
1.进入相关目录,执行:scrapy crawl haotuijian
data:image/s3,"s3://crabby-images/8e8c6/8e8c6fc34c0e1b1497f4ef31a894c9fcb41e308e" alt=""
2.执行过程
data:image/s3,"s3://crabby-images/17cb5/17cb539279a7357fbd23afec0757cb048e3827e2" alt=""
3.执行结果
(1) haohao.excel
data:image/s3,"s3://crabby-images/a54c7/a54c79cc7395f33ce7ba89db80d1cb5e173b1bef" alt=""
data:image/s3,"s3://crabby-images/2ba2b/2ba2b63793b0d71d31945c49a61339fda4c3aef1" alt=""
(2) Mysql:haohao (需提前创建表)
data:image/s3,"s3://crabby-images/46dc2/46dc233a04e70275c57f1bedc6216ceffebd44d3" alt=""
(3)Mongodb: hao123
data:image/s3,"s3://crabby-images/2e79d/2e79d8a84abff9ccb71a5d42308cf3101d0bcaba" alt=""
八、知识补充
1.创建main.py文件,并编写代码
data:image/s3,"s3://crabby-images/e615d/e615d3dd0fe46c30d78414866434e2241adb4973" alt=""
2.直接运行main.py文件
data:image/s3,"s3://crabby-images/3ede8/3ede8c811a9e58022fb83204ee2668ea4c90af68" alt=""
3.运行结果与使用指令运行结果相同(只不过运行过程变成了红色,但可以像普通python代码一样可以随时暂停)
data:image/s3,"s3://crabby-images/b62ce/b62ceb313777e285d1bf39647f78deca4189c871" alt=""