Python爬取京东

Selenium原本是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。现在很多爬虫工程师为了绕过反爬虫的机制，都选择selenium。由于selenium的原理是唤起浏览器操作，因而代价就是爬虫非常慢。

本实验将介绍使用selenium爬取与解析页面数据。

1、附件一（chrome_Xpath）
2、附件二（http://chromedriver.storage.googleapis.com/index.html [看谷歌版本号选最近的] ）

3、关键概念

# selenium.webdriver.support.expected_conditions:是Selenium的一个子模块
# 作用：可以对网页上元素是否存在，可点击等等进行判断，一般用于断言或与WebDriverWait配合使用。
# 下面是expected_conditions与WebDriverWait配合使用实例
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
driver = webdriver.Chrome()
driver.get('https://www.baidu.com')

# 等待10s，等待过程中如果定位到元素，就直接执行后续的代码，反之等待10s后报错误信息

element = WebDriverWait(driver,10).until(EC.visibility_of(driver.find_element(By.ID,'kw')))
element.send_keys( '新梦想软件测试' )

# expected_conditions模块用法汇总
# 判断当前页面的title是否精确等于预期，返回布尔值
WebDriverWait(driver,10).until(EC.title_is("百度一下，你就知道"))
# 判断当前页面的title是否包含预期字符串，返回布尔值
WebDriverWait(driver,10).until(EC.title_contains('new'))
# 判断当前页面的url是否精确等于预期，返回布尔值
WebDriverWait(driver,10).until(EC.url_contains('https://www.baidu.com'))
# 判断当前页面的url是否包含预期字符串，返回布尔值
WebDriverWait(driver,10).until(EC.url_contains('baidu'))
# 判断当前页面的url是否满足字符串正则表达式匹配，返回布尔值
WebDriverWait(driver,10).until(EC.url_matches('.+baidu.+'))
# 判断元素是否出现，只要有一个元素出现，返回元素对象
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,'kw')))
# 判断元素是否可见，返回元素对象
WebDriverWait(driver,10).until(EC.visibility_of(driver.find_element(By.ID,'kw')))
# 判断元素是否包含指定文本，返回布尔值
WebDriverWait(driver,10).until(EC.text_to_be_present_in_element((By.NAME,'tj_trnews'),'新闻'))
# 判断该frame是否可以switch进去，如果可以的话，返回True并且switch进去
WebDriverWait(driver,10,).until(EC.frame_to_be_available_and_switch_to_it(By.xpath,'//iframe'))
# 判断某个元素是否可见并且是可点击的，如果是的就返回这个元素，否则返回False
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.NAME,'tj_trnews')))
# 判断某个元素是否被选中,一般用在下拉列表
WebDriverWait(driver,10).until(EC.element_to_be_selected(driver.find_element(By.xpath,'//input[@type="checkbox"]')))
# 判断页面上是否存在alert,如果有就切换到alert并返回alert的内容
WebDriverWait(driver,10).until(EC.alert_is_present())
# 以上整理大家要注意参数和返回值，部分参数是元素对象，部分是locator的元组，如(By.NAME,'tj_trnews')

4、环境搭配

附件下载地址：http://chromedriver.storage.googleapis.com/index.html

环境变量：

1、京东网址为：https://www.jd.com/

在crome浏览器输入京东网址，进入京东网址后，在搜索框里输入“笔记本”三个字，点击搜索按钮，url变为：https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC&enc=utf-8&wq=%E7%AC%94%E8%AE%B0%E6%9C%AC&pvid=6159d4eb2d204c45852f8a6e30d87f38

我们很容易发现该网页是一个动态加载的网页，因为刚打开网页时只会显示 30 个商品的信息，可是当我们向下拖动网页时，它会再次加载剩下 30 个商品的信息，这时候我们可以通过 selenium 模拟浏览器下拉网页的过程，获取网站全部商品的信息

使用快捷键 Ctrl+Shift+I 打开开发者工具，然后使用快捷键 Ctrl+Shift+C 打开元素选择工具，此时用鼠标点击网页中的“下一页”，就会在源代码中自动定位到相应的位置，发现没有链接，源码如下

<a class="pn-next" onclick="SEARCH.page(3, true)" href="javascript:;" title="使用方向键右键也可翻到下一页哦！">
<em>下一页</em>
<i>></i>
</a>

但是有属性 “ onclick="SEARCH.page(3, true)" href="javascript:;" ”，其中href="javascript:;"表示在触发默认动作时，执行一段JavaScript代码，而 javascript:; 表示什么都不执行，这样点击时就没有任何反应；其中onclick="SEARCH.page(3, true)"表示onclick方法负责执行js函数SEARCH.page(3, true)。这时我们选择使用 selenium 模拟浏览器的翻页行为较好，即下拉网页至底部可以发现有一个 “下一页”的按钮，我们只需获取并点击该元素即可实现翻页。通过构造 URL 来获取每一个网页的内容也是可行的，只是有点麻烦，需要看懂js函数SEARCH.page(3, true)。另外，我们发现该网站搜索结果一共有 100  个网页

获取数据

因为我们使用的selenium 模拟浏览器，可以直接获取页面数据，无需管请求是POST还是GET，也无需管是否要加请求参数或headers等。只需解析selenium 获取的页面数据即可

我们只需要解析每一个网页来获取我们需要的数据，具体包括（这里使用 selenium 选择元素）：

商品ID：browser.find_elements_by_xpath('//li[@data-sku]')，用于构造商品的链接地址；

商品价格：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[2]/strong/i')；

商品名称：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[3]/a/em')；

评论人数：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[4]/strong')。

分析完毕，下面可以编写代码实现需求。

首先导入工具类

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.common.exceptions
import json
import csv
import time

新类JdSpider

JdSpider类定义了以下方法：

def open_file(self):让用户确认保存为什么格式的文本文件（txt、json、csv），并打开文件，以便后面写入数据；

def open_browser(self):创建驱动，定义隐式、显示等待的时间；

def init_variable(self):初始化属性变量，data变量存储页面解析出来的数据，isLast用来判断是否是最后一页，count用来存储爬取的页数；

def parse_page(self):使用selenium.webdriver.support.expected_conditions解析出数据：商品ID、商品价格、商品名称、评论人数；

def turn_page(self):使用selenium 模拟浏览器在翻页翻页京东搜索“笔记本”结果，从而获取单个页面的数据；

def write_to_file(self):把data变量中的数据写入到文本文件中；

def close_file(self):数据爬取完毕后，关闭文本文件

def crawl(self):数据爬取完毕后，关闭驱动，即Chrome浏览器

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import selenium.common.exceptions
import json
import csv
import time


class JdSpider():
    # 建立文件，用户可以选择json txt csv三种格式
    def open_file(self):
        self.fm = input('请输入文件保存格式：json,txt,csv--')
        while self.fm != 'json' and self.fm != 'txt' and self.fm != 'csv':
            self.fm = input('输入错误，请重新输入：json,txt,csv')
        if self.fm == 'txt':
            self.fd = open('Jd.txt', 'w', encoding='utf-8')
        elif self.fm == 'json':
            self.fd = open('Jd.json', 'w', encoding='utf-8')
        elif self.fm == 'csv':
            self.fd = open('Jd.csv', 'w', encoding='utf-8')

    # 打开浏览器
    def open_browser(self):
        self.browser = webdriver.Chrome()
        self.browser.implicitly_wait(10)
        self.wait = WebDriverWait(self.browser, 10)

    # 存储打包数据
    def init_variable(self):
        self.data = zip()
        self.isLast = False
        self.count = 0

    # 获取页面内容
    def parse_page(self):
        try:
            skus = self.wait.until(EC.presence_of_all_elements_located((By.XPATH, '//li[@class="gl-item"]')))
            skus = [item.get_attribute('data-sku') for item in skus]
            links = ['https://item.jd.com/{sku}.html'.format(sku=item) for item in skus]
            prices = self.wait.until(
                EC.presence_of_all_elements_located((By.XPATH, '//div[@class="gl-i-wrap"]/div[2]/strong/i')))
            prices = [item.text for item in prices]
            names = self.wait.until(
                EC.presence_of_all_elements_located((By.XPATH, '//div[@class="gl-i-wrap"]/div[3]/a/em')))
            names = [item.text for item in names]
            comments = self.wait.until(
                EC.presence_of_all_elements_located((By.XPATH, '//div[@class="gl-i-wrap"]/div[4]/strong')))
            comments = [item.text for item in comments]
            self.data = zip(links, prices, names, comments)
        except selenium.common.exceptions.TimeoutException:
            print('timeoutexception')
            self.parse_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('staleElementReference')
            self.browser.refresh()

    # 换页
    def turn_page(self):
        try:
            self.wait.until(EC.element_to_be_clickable((By.XPATH, '//a[@class="pn-next"]'))).click()
            time.sleep(1)
            self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
            time.sleep(2)
        except selenium.common.exceptions.NoSuchElementException:
            self.isLast = True
        except selenium.common.exceptions.TimeoutException:
            if self.count > 100:
                self.isLast = True
            else:
                print('Turn_page:timeout')
                self.turn_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print("turn_page:stale")
            self.browser.refresh()

    # 存取信息到文件中
    def write_to_file(self):
        if self.fm == 'txt':
            for item in self.data:
                self.fd.write('------------------------------\n')
                self.fd.write('link:' + str(item[0]) + '\n')
                self.fd.write('link:' + str(item[1]) + '\n')
                self.fd.write('link:' + str(item[2]) + '\n')
                self.fd.write('link:' + str(item[3]) + '\n')
        if self.fm == 'json':
            temp = ('link', 'price', 'name', 'comment')
            for item in self.data:
                json.dump(dict(zip(temp, item)), self.fd, ensure_ascii=False)
        if self.fm == 'csv':
            writer = csv.writer(self.fd)
            for item in self.data:
                writer.writerow(item)

    # 关闭文件
    def close_file(self):
        self.fd.close()

    # 关闭浏览器
    def close_browser(self):
        self.browser.quit()

    # 开始爬取页面内容
    def crawl(self):
        self.open_file()
        self.open_browser()
        self.init_variable()
        print("开始爬取")
        self.browser.get('https://www.jd.com/')
        self.browser.find_element(By.ID, "key").send_keys('笔记本')
        time.sleep(1)
        self.browser.find_element(By.XPATH, '//*[@id="search"]/div/div[2]/button').click()

        time.sleep(1)
        self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        time.sleep(2)
        while not self.isLast:
            self.count += 1
            print('正在爬取第' + str(self.count) + '页')
            self.parse_page()
            self.write_to_file()
            self.turn_page()
        self.close_file()
        self.close_browser()
        print('爬取结束')


# 主方法
if __name__ == '__main__':
    spider = JdSpider()
    spider.crawl()

3、关键概念

# selenium.webdriver.support.expected_conditions:是Selenium的一个子模块
# 作用：可以对网页上元素是否存在，可点击等等进行判断，一般用于断言或与WebDriverWait配合使用。
# 下面是expected_conditions与WebDriverWait配合使用实例
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
driver = webdriver.Chrome()
driver.get('https://www.baidu.com')

# 等待10s，等待过程中如果定位到元素，就直接执行后续的代码，反之等待10s后报错误信息

element = WebDriverWait(driver,10).until(EC.visibility_of(driver.find_element(By.ID,'kw')))
element.send_keys( '新梦想软件测试' )

# expected_conditions模块用法汇总
# 判断当前页面的title是否精确等于预期，返回布尔值
WebDriverWait(driver,10).until(EC.title_is("百度一下，你就知道"))
# 判断当前页面的title是否包含预期字符串，返回布尔值
WebDriverWait(driver,10).until(EC.title_contains('new'))
# 判断当前页面的url是否精确等于预期，返回布尔值
WebDriverWait(driver,10).until(EC.url_contains('https://www.baidu.com'))
# 判断当前页面的url是否包含预期字符串，返回布尔值
WebDriverWait(driver,10).until(EC.url_contains('baidu'))
# 判断当前页面的url是否满足字符串正则表达式匹配，返回布尔值
WebDriverWait(driver,10).until(EC.url_matches('.+baidu.+'))
# 判断元素是否出现，只要有一个元素出现，返回元素对象
WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,'kw')))
# 判断元素是否可见，返回元素对象
WebDriverWait(driver,10).until(EC.visibility_of(driver.find_element(By.ID,'kw')))
# 判断元素是否包含指定文本，返回布尔值
WebDriverWait(driver,10).until(EC.text_to_be_present_in_element((By.NAME,'tj_trnews'),'新闻'))
# 判断该frame是否可以switch进去，如果可以的话，返回True并且switch进去
WebDriverWait(driver,10,).until(EC.frame_to_be_available_and_switch_to_it(By.xpath,'//iframe'))
# 判断某个元素是否可见并且是可点击的，如果是的就返回这个元素，否则返回False
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.NAME,'tj_trnews')))
# 判断某个元素是否被选中,一般用在下拉列表
WebDriverWait(driver,10).until(EC.element_to_be_selected(driver.find_element(By.xpath,'//input[@type="checkbox"]')))
# 判断页面上是否存在alert,如果有就切换到alert并返回alert的内容
WebDriverWait(driver,10).until(EC.alert_is_present())
# 以上整理大家要注意参数和返回值，部分参数是元素对象，部分是locator的元组，如(By.NAME,'tj_trnews')

4、环境搭配

附件下载地址：http://chromedriver.storage.googleapis.com/index.html

环境变量：

1、京东网址为：https://www.jd.com/

在crome浏览器输入京东网址，进入京东网址后，在搜索框里输入“笔记本”三个字，点击搜索按钮，url变为：https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC&enc=utf-8&wq=%E7%AC%94%E8%AE%B0%E6%9C%AC&pvid=6159d4eb2d204c45852f8a6e30d87f38

我们很容易发现该网页是一个动态加载的网页，因为刚打开网页时只会显示 30 个商品的信息，可是当我们向下拖动网页时，它会再次加载剩下 30 个商品的信息，这时候我们可以通过 selenium 模拟浏览器下拉网页的过程，获取网站全部商品的信息

使用快捷键 Ctrl+Shift+I 打开开发者工具，然后使用快捷键 Ctrl+Shift+C 打开元素选择工具，此时用鼠标点击网页中的“下一页”，就会在源代码中自动定位到相应的位置，发现没有链接，源码如下

<a class="pn-next" onclick="SEARCH.page(3, true)" href="javascript:;" title="使用方向键右键也可翻到下一页哦！">
<em>下一页</em>
<i>></i>
</a>

但是有属性 “ onclick="SEARCH.page(3, true)" href="javascript:;" ”，其中href="javascript:;"表示在触发默认动作时，执行一段JavaScript代码，而 javascript:; 表示什么都不执行，这样点击时就没有任何反应；其中onclick="SEARCH.page(3, true)"表示onclick方法负责执行js函数SEARCH.page(3, true)。这时我们选择使用 selenium 模拟浏览器的翻页行为较好，即下拉网页至底部可以发现有一个 “下一页”的按钮，我们只需获取并点击该元素即可实现翻页。通过构造 URL 来获取每一个网页的内容也是可行的，只是有点麻烦，需要看懂js函数SEARCH.page(3, true)。另外，我们发现该网站搜索结果一共有 100  个网页

获取数据

因为我们使用的selenium 模拟浏览器，可以直接获取页面数据，无需管请求是POST还是GET，也无需管是否要加请求参数或headers等。只需解析selenium 获取的页面数据即可

我们只需要解析每一个网页来获取我们需要的数据，具体包括（这里使用 selenium 选择元素）：

商品ID：browser.find_elements_by_xpath('//li[@data-sku]')，用于构造商品的链接地址；

商品价格：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[2]/strong/i')；

商品名称：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[3]/a/em')；

评论人数：browser.find_elements_by_xpath('//div[@class="gl-i-wrap"]/div[4]/strong')。

分析完毕，下面可以编写代码实现需求。

首先导入工具类

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.common.exceptions
import json
import csv
import time

新类JdSpider

JdSpider类定义了以下方法：

def open_file(self):让用户确认保存为什么格式的文本文件（txt、json、csv），并打开文件，以便后面写入数据；

def open_browser(self):创建驱动，定义隐式、显示等待的时间；

def init_variable(self):初始化属性变量，data变量存储页面解析出来的数据，isLast用来判断是否是最后一页，count用来存储爬取的页数；

def parse_page(self):使用selenium.webdriver.support.expected_conditions解析出数据：商品ID、商品价格、商品名称、评论人数；

def turn_page(self):使用selenium 模拟浏览器在翻页翻页京东搜索“笔记本”结果，从而获取单个页面的数据；

def write_to_file(self):把data变量中的数据写入到文本文件中；

def close_file(self):数据爬取完毕后，关闭文本文件

def crawl(self):数据爬取完毕后，关闭驱动，即Chrome浏览器

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import selenium.common.exceptions
import json
import csv
import time


class JdSpider():
    # 建立文件，用户可以选择json txt csv三种格式
    def open_file(self):
        self.fm = input('请输入文件保存格式：json,txt,csv--')
        while self.fm != 'json' and self.fm != 'txt' and self.fm != 'csv':
            self.fm = input('输入错误，请重新输入：json,txt,csv')
        if self.fm == 'txt':
            self.fd = open('Jd.txt', 'w', encoding='utf-8')
        elif self.fm == 'json':
            self.fd = open('Jd.json', 'w', encoding='utf-8')
        elif self.fm == 'csv':
            self.fd = open('Jd.csv', 'w', encoding='utf-8')

    # 打开浏览器
    def open_browser(self):
        self.browser = webdriver.Chrome()
        self.browser.implicitly_wait(10)
        self.wait = WebDriverWait(self.browser, 10)

    # 存储打包数据
    def init_variable(self):
        self.data = zip()
        self.isLast = False
        self.count = 0

    # 获取页面内容
    def parse_page(self):
        try:
            skus = self.wait.until(EC.presence_of_all_elements_located((By.XPATH, '//li[@class="gl-item"]')))
            skus = [item.get_attribute('data-sku') for item in skus]
            links = ['https://item.jd.com/{sku}.html'.format(sku=item) for item in skus]
            prices = self.wait.until(
                EC.presence_of_all_elements_located((By.XPATH, '//div[@class="gl-i-wrap"]/div[2]/strong/i')))
            prices = [item.text for item in prices]
            names = self.wait.until(
                EC.presence_of_all_elements_located((By.XPATH, '//div[@class="gl-i-wrap"]/div[3]/a/em')))
            names = [item.text for item in names]
            comments = self.wait.until(
                EC.presence_of_all_elements_located((By.XPATH, '//div[@class="gl-i-wrap"]/div[4]/strong')))
            comments = [item.text for item in comments]
            self.data = zip(links, prices, names, comments)
        except selenium.common.exceptions.TimeoutException:
            print('timeoutexception')
            self.parse_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print('staleElementReference')
            self.browser.refresh()

    # 换页
    def turn_page(self):
        try:
            self.wait.until(EC.element_to_be_clickable((By.XPATH, '//a[@class="pn-next"]'))).click()
            time.sleep(1)
            self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
            time.sleep(2)
        except selenium.common.exceptions.NoSuchElementException:
            self.isLast = True
        except selenium.common.exceptions.TimeoutException:
            if self.count > 100:
                self.isLast = True
            else:
                print('Turn_page:timeout')
                self.turn_page()
        except selenium.common.exceptions.StaleElementReferenceException:
            print("turn_page:stale")
            self.browser.refresh()

    # 存取信息到文件中
    def write_to_file(self):
        if self.fm == 'txt':
            for item in self.data:
                self.fd.write('------------------------------\n')
                self.fd.write('link:' + str(item[0]) + '\n')
                self.fd.write('link:' + str(item[1]) + '\n')
                self.fd.write('link:' + str(item[2]) + '\n')
                self.fd.write('link:' + str(item[3]) + '\n')
        if self.fm == 'json':
            temp = ('link', 'price', 'name', 'comment')
            for item in self.data:
                json.dump(dict(zip(temp, item)), self.fd, ensure_ascii=False)
        if self.fm == 'csv':
            writer = csv.writer(self.fd)
            for item in self.data:
                writer.writerow(item)

    # 关闭文件
    def close_file(self):
        self.fd.close()

    # 关闭浏览器
    def close_browser(self):
        self.browser.quit()

    # 开始爬取页面内容
    def crawl(self):
        self.open_file()
        self.open_browser()
        self.init_variable()
        print("开始爬取")
        self.browser.get('https://www.jd.com/')
        self.browser.find_element(By.ID, "key").send_keys('笔记本')
        time.sleep(1)
        self.browser.find_element(By.XPATH, '//*[@id="search"]/div/div[2]/button').click()

        time.sleep(1)
        self.browser.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        time.sleep(2)
        while not self.isLast:
            self.count += 1
            print('正在爬取第' + str(self.count) + '页')
            self.parse_page()
            self.write_to_file()
            self.turn_page()
        self.close_file()
        self.close_browser()
        print('爬取结束')


# 主方法
if __name__ == '__main__':
    spider = JdSpider()
    spider.crawl()

{{titleitem}} {{titleitem}}

{{titleitem}} {{titleitem}}