python爬虫之利用Selenium+Requests爬取拉勾网

一、前言

利用selenium+requests访问页面爬取拉勾网招聘信息

二、分析url

观察页面可知，页面数据属于动态加载所以现在我们通过抓包工具，获取数据包

观察其url和参数

url=\"https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false\"
参数：
city=%E5%8C%97%E4%BA%AC  ==》城市
first=true  ==》无用
pn=1  ==》页数
kd=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90  ==》商品关键词

所以我们要想实现全站爬取，需要有city和页数

三、获取所有城市和页数

我们打开拉勾网，观察后发现，他的数据并不是完全展示的，比如说在城市筛选选择全国仅仅只显示30页但总页数是远远大于30页的；我又选择北京发现是30页又选择北京下的海淀区又是30页，可能我们无法把数据全部的爬取，但我们可以尽可能的将数据多的爬取

我们为了获取全站数据，必然离不开的有两个参数一个是城市一个是页数，所以我们利用selenium自动化去获取所有城市和对应页数

def City_Page(self):
    City_Page={}
    url=\"https://www.lagou.com/jobs/allCity.html?keyword=%s&px=default&companyNum=0&isCompanySelected=false&labelWords=\"%(self.keyword)
    self.bro.get(url=url)
    sleep(30)
    print(\"开始获取城市及其最大页数\")
    if \"验证系统\" in self.bro.page_source:
        sleep(40)
    html = etree.HTML(self.bro.page_source)
    city_urls = html.xpath(\'//table[@class=\"word_list\"]//li/input/@value\')
    for city_url in city_urls:
        try:
            self.bro.get(city_url)
            if \"验证系统\" in self.bro.page_source:
                sleep(40)
            city=self.bro.find_element_by_xpath(\'//a[@class=\"current_city current\"]\').text
            page=self.bro.find_element_by_xpath(\'//span[@class=\"span totalNum\"]\').text
            City_Page[city]=page
            sleep(0.5)
        except:
            pass
    self.bro.quit()
    data = json.dumps(City_Page)
    with open(\"city_page.json\", \'w\', encoding=\"utf-8\")as f:
        f.write(data)
    return City_Page

四、生成params参数

我们有了每个城市对应的最大页数，就可以生成访问页面所需的参数

def Params_List(self):
    with open(\"city_page.json\", \"r\")as f:
        data = json.loads(f.read())
    Params_List = []
    for a, b in zip(data.keys(), data.values()):
        for i in range(1, int(b) + 1):
            params = {
                \'city\': a,
                \'pn\': i,
                \'kd\': self.keyword
            }
            Params_List.append(params)
    return Params_List

五、获取数据

最后我们可以通过添加请求头和使用params url来访问页面获取数据

def Parse_Data(self,params):
    url = \"https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false\"
    header={
        \'referer\': \'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&fromSearch=true&suginput=\',
        \'user-agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36\',
        \'cookie\':\'\'
    }
    try:
        text = requests.get(url=url, headers=header, params=params).text
        if \"频繁\" in text:
            print(\"操作频繁，已被发现 当前为第%d个params\"%(i))
        data=json.loads(text)
        result=data[\"content\"][\"positionResult\"][\"result\"]
        for res in result:
            with open(\".//lagou1.csv\", \"a\",encoding=\"utf-8\") as f:
                writer = csv.DictWriter(f, res.keys())
                writer.writerow(res)
        sleep(1)
    except Exception as e:
        print(e)
        pass