Job Search Engine


861 浏览 5 years, 1 month

2.14 用户触发数据爬虫定制 (1) - 用户数据的输入和显示

版权声明: 转载请注明出处 http://www.codingsoho.com/

用户触发数据爬虫定制 (1) - 用户数据的输入和显示

研究一下猎聘搜索关键字

下面是最简单搜索:关键字 software

https://www.liepin.com/zhaopin/?industries=&subIndustry=&dqs=&salary=&jobKind=&pubTime=&compkind=&compscale=&industryType=&searchType=1&clean_condition=&isAnalysis=&init=1&sortFlag=15&flushckid=0&fromSearchBtn=1&headckid=181ea51414511b11&d_headId=297e843e68fa1dc1781b19720ffee3dd&d_ckId=297e843e68fa1dc1781b19720ffee3dd&d_sfrom=search_fp_nvbar&d_curPage=0&d_pageSize=40&siTag=1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw&key=software

[https://www.liepin.com/zhaopin/](https://www.liepin.com/zhaopin/)?
industries=
subIndustry=
dqs=
salary=
jobKind=
pubTime=
compkind=
compscale=
industryType=
searchType=1
clean_condition=
isAnalysis=
init=1
sortFlag=15
flushckid=0
fromSearchBtn=1
headckid=181ea51414511b11
d_headId=297e843e68fa1dc1781b19720ffee3dd
d_ckId=297e843e68fa1dc1781b19720ffee3dd
d_sfrom=search_fp_nvbar
d_curPage=0
d_pageSize=40
siTag=1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw
key=software

全部选项搜素

第一页链接 :

https://www.liepin.com/zhaopin/?isAnalysis=&dqs=020010010&pubTime=7&salary=50$100&subIndustry=001004&industryType=industry_01&compscale=030&key=software&init=-1&searchType=1&headckid=82952d5916bbe9a0&flushckid=1&compkind=020&fromSearchBtn=2&sortFlag=15&ckid=54979e566e902845&jobKind=2&industries=010&clean_condition=&siTag=-foQupVsrPkdeHiGETnvuQ~pDiv2mpWvb7HnzSpHT_esw&d_sfrom=search_prime&d_ckId=bd38bd9ff2e5fb9d18c83d9c7bd719bd&d_curPage=0&d_pageSize=40&d_headId=a85a1b9b951a94a0b3e0a15495800d6f

[https://www.liepin.com/zhaopin/](https://www.liepin.com/zhaopin/)?
isAnalysis=
dqs=020010010
pubTime=7
salary=50$100
subIndustry=001004
industryType=industry_01
compscale=030
key=software
init=-1
searchType=1
headckid=82952d5916bbe9a0
flushckid=1
compkind=020
fromSearchBtn=2
sortFlag=15
ckid=54979e566e902845
jobKind=2
industries=010
clean_condition=
siTag=-foQupVsrPkdeHiGETnvuQ~pDiv2mpWvb7HnzSpHT_esw
d_sfrom=search_prime
d_ckId=bd38bd9ff2e5fb9d18c83d9c7bd719bd
d_curPage=0
d_pageSize=40
d_headId=a85a1b9b951a94a0b3e0a15495800d6f

第二页链接:

https://www.liepin.com/zhaopin/?isAnalysis=&dqs=020010010&pubTime=7&salary=50%24100&subIndustry=001004&industryType=industry_01&compscale=030&key=software&init=-1&searchType=1&headckid=82952d5916bbe9a0&compkind=020&fromSearchBtn=2&sortFlag=15&ckid=f4cf188239c69b4e&degradeFlag=1&jobKind=2&industries=010&clean_condition=&siTag=-foQupVsrPkdeHiGETnvuQ~2GEH_2GU37EZKEJJnBMLPg&d_sfrom=search_prime&d_ckId=970b4ac9a9280ef771b4f820be64bb63&d_curPage=0&d_pageSize=40&d_headId=a85a1b9b951a94a0b3e0a15495800d6f&curPage=1

最后一页链接:

https://www.liepin.com/zhaopin/?isAnalysis=&dqs=020010010&pubTime=7&salary=50%24100&subIndustry=001004&industryType=industry_01&compscale=030&key=software&init=-1&searchType=1&headckid=82952d5916bbe9a0&compkind=020&fromSearchBtn=2&sortFlag=15&ckid=f4cf188239c69b4e&degradeFlag=1&jobKind=2&industries=010&clean_condition=&siTag=-foQupVsrPkdeHiGETnvuQ~2GEH_2GU37EZKEJJnBMLPg&d_sfrom=search_prime&d_ckId=970b4ac9a9280ef771b4f820be64bb63&d_curPage=1&d_pageSize=40&d_headId=a85a1b9b951a94a0b3e0a15495800d6f&curPage=24

分析下来各个query关键字的含义如下:

init=1 是否只是基本搜索,如果用了高级搜素,则为-1
key=software 搜索关键字
industryType= 行业
industries= 细分
subIndustry = ?
dqs= 城市 / 地区
salary= 薪资
jobKind=职位类型
pubTime=发布时间
compkind= 企业性质
compscale=企业规模
searchType=1
clean_condition=
flushckid=0
fromSearchBtn=1 从上面搜索,2 从下一页
sortFlag=15
ckid=
d_ckId=297e843e68fa1dc1781b19720ffee3dd
headckid=181ea51414511b11
d_headId=297e843e68fa1dc1781b19720ffee3dd
siTag=1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw
d_sfrom=search_fp_nvbar, search_prime
d_curPage=0
d_pageSize=40
isAnalysis=

创建表单

最初的计划是可以填写所有的搜索字段,后来发现猎聘做过转换,所以换个思路,直接输入超链接进去,根据超链接搜索。所以只有两个字段,邮箱地址和超链接地址

from django import forms
from django.utils.translation import ugettext_lazy as _

class JobScrapRegisterForm(forms.Form):
    mail = forms.EmailField(label=_('mail'), required=True)
    # keyword = forms.CharField(label=_('keyword'), required=False)
    # industry = forms.CharField(label=_('industry'), required=False)
    # region = forms.CharField(label=_('region'), required=False)
    # salary = forms.CharField(label=_('salary'), required=False)
    # published = forms.CharField(label=_('published'), required=False)
    # position_type = forms.CharField(label=_('position type'), required=False)
    # company_size = forms.CharField(label=_('company size'), required=False)
    # company_type = forms.CharField(label=_('company type'), required=False)
    link = forms.CharField(label=_('href'), widget=forms.Textarea, required=True)

视图

表单视图

这一节我们引入表单视图,它从FormView继承而来

授权混入类 Mixin
class LoginRequiredMixin(object):
    @classmethod
    def as_view(self, *args, **kwargs):
        view = super(LoginRequiredMixin, self).as_view(*args, **kwargs)
        return login_required(view)
    @method_decorator(login_required)
    def dispatch(self, request, *args, **kwargs):
        return super(LoginRequiredMixin, self).dispatch(request, *args, **kwargs)

class JobScrapRegisterView(LoginRequiredMixin, FormView):
链接的解析和构建
def parse_lielin_search_link(href):
    keywords_extracted = {}
    keywords = href.split("?")
    if len(keywords) == 1:
        return keywords_extracted
    keywords = keywords[1].split("&")
    for _ in keywords:
        try:
            keyword = _.split("=")
            if len(keyword) == 1:
                keywords_extracted[keyword[0]] = ''
            else:
                keywords_extracted[keyword[0]] = keyword[1]
        except:
            pass
    return keywords_extracted
def build_liepiin_search_link(keywords_dict):
    base_url = "[https://www.liepin.com/zhaopin/](https://www.liepin.com/zhaopin/)"
    fields = [
        "industryType",
        "jobKind",
        "sortFlag",
        "degradeFlag",
        "industries",
        "salary",
        "compscale",
        "key",
        "clean_condition",
        "headckid",
        "d_pageSize",
        "siTag",
        "d_headId",
        "d_ckId",
        "d_sfrom",
        "d_curPage"
    ]
    url = base_url + "?"
    for field in fields:
        url += (field + "=" + keywords_dict.get(field, "") + "&")
    url += ("curPage" + "0")
    return url
完整代码
class JobScrapRegisterView(LoginRequiredMixin, FormView):
    form_class = JobScrapRegisterForm
    template_name = 'jse/job_scrap_register_form.html'
    def post(self, *args, **kwargs):
        # form = self.form_class(self.request.POST or None, self.request.FILES or None)
        form = self.get_form()
        if form.is_valid():
            href = form.cleaned_data.get('link', None)            
            if href:
                # print parse_lielin_search_link(href)
                context = self.get_context_data()
                href_dict = parse_lielin_search_link(href)
                build_href = build_liepiin_search_link(href_dict)
                context['href_dict'] = href_dict
                context['build_href'] = build_href
                return self.render_to_response(context)
            return HttpResponseRedirect(self.get_success_url()) # not necessary, leave it to base class
        else:
            return self.form_invalid(form)
        return super(JobScrapRegisterView, self).post(*args, **kwargs)
    def get_success_url(self, *args, **kwargs):
        return self.request.get_full_path()

首先分析表单内容,如果有效,则解析数据,完成后一般会跳转到success_url,这儿我希望它停留在当前页面,所以使用self.render_to_response渲染当前URL

模板

两部分内容,左边是表单,右边是显示的解析字典结果

jse/job_scrap_register_form.html

{% extends "base.html" %}

{% load staticfiles %}
{% load crispy_forms_tags %}

{% block content %}
<div class="row">
  <div class="col-sm-6">
    <form method="post" action="">{% csrf_token %}
      {{form|crispy}}
      <input type="submit" name="submit" class="btn btn-primary">
    </form>
  </div>
  <div class="col-sm-6">
    <table class="table table-bordered">
      <caption>Job Search Query</caption>
      <thead>
        <tr>
          <th>Key</th>
          <th>Value</th>
        </tr>
      </thead>
      <tbody>
        {% for key, value in href_dict.items %}
        <tr>
          <td>{{key}}</td>
          <td>{{value}}</td>
        </tr>
        {% endfor %}
      </tbody>
    </table>
    <a href="{{build_href}}">生成链接</a>    
  </div>
</div>
{% endblock %}