Job Search Engine
1241 浏览 5 years, 8 months
2.15 用户触发数据爬虫定制 (2) - AJAX爬虫程序通信
版权声明: 转载请注明出处 http://www.codingsoho.com/用户触发数据爬虫定制 (2) - AJAX爬虫程序通信
一开始想简单点通信那边搭个flask,但是报错
Failed to load [http://127.0.0.1](http://127.0.0.1):5000/?href=https%3A%2F%2Fwww.liepin.com%2Fzhaopin%2F%3FindustryType%3Dindustry_01%26jobKind%3D2%26sortFlag%3D15%26degradeFlag%3D1%26industries%3D010%26salary%3D50%2524100%26compscale%3D030%26key%3Dsoftware%26clean_condition%3D%26headckid%3D5eca8655d27b54bd%26d_pageSize%3D40%26siTag%3D-foQupVsrPkdeHiGETnvuQ~2GEH_2GU37EZKEJJnBMLPg%26d_headId%3Da85a1b9b951a94a0b3e0a15495800d6f%26d_ckId%3D970b4ac9a9280ef771b4f820be64bb63%26d_sfrom%3Dsearch_prime%26d_curPage%3D0%26curPage0&action=query_scrab_status: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin '[http://127.0.0.1:8000](http://127.0.0.1:8000)' is therefore not allowed access.
懒得去找flask的解决方案,我还是采用了django,并用了django的解决方案
针对爬虫程序,同样创建一个django服务,并添加django-and-cors应用来处理跨域访问
问题描述:启动django服务时出现WinError 10013] 的错误
“Error: [WinError 10013] 以一种访问权限不允许的方式做了一个访问套接字的尝试。”
问题原因:8000端口被占用了
解决办法:默认启动的服务端口是8000,启动时修改该端口即可
(env35) E:\Computer\virtualenv\webscrapping\src\liepin\lipin_scrap>netstat -ano|findstr 8080
TCP [0.0.0.0:8080](0.0.0.0:8080) [0.0.0.0:0](0.0.0.0:0) LISTENING 2228
TCP [::]:8080 [::]:0 LISTENING 2228
(env35) E:\Computer\virtualenv\webscrapping\src\liepin\lipin_scrap>
(env35) E:\Computer\virtualenv\webscrapping\src\liepin\lipin_scrap>tasklist |findstr 2228
TNSLSNR.EXE 2228 Services 0 18,000 K
taskkill /pid 8124 /F
第一条命令找出8080端口对应的PID进程为2228,第二条命令找出进程对应的详细信息,可以看到这个进程对应的是TNSLSNR.EXE,我们可以用taskkill pid的方式把它关掉,关掉就可以运行Django程序了,或者改一个端口也可以。
JSE应用json操作
后台
字段转json并渲染
def ajax(request):
from django.http import HttpResponse
response_data = {}
response_data['result'] = 'failed'
response_data['message'] = 'You messed up'
return HttpResponse(json.dumps(response_data), content_type="application/json")
前端
<script type="text/javascript">
function register_scrab() {
event.preventDefault();
$.ajax({
url: "[http://127.0.0.1](http://127.0.0.1):8082/",
type: "GET",
dataType: "json",
data : {
href: $('#href_id').val(),
keywords: $('#keywords_id').val(),
action:"register_job_scrab"
},
beforeSend : function(){
// console.log('beforeSend');
if ($('#href_id').val() == ""){
alert('You need to set href');
return false;
}
},
success: function(s) {
console.log(s);
$(".response").html(s.result + " " + s.message)
return;
},
error: function() {
console.log('error')
return;
},
complete: function() {
return;
},
timeout: 5000,
})
}
function query_scrab() {
event.preventDefault();
$.ajax('[http://127.0.0.1](http://127.0.0.1):8082/',{
// url: "[http://127.0.0.1](http://127.0.0.1):5000/",
type: "GET",
dataType: "json",
data : {
href: $('#href_id').val(),
keywords: $('#keywords_id').val(),
action:"query_scrab_status"
},
beforeSend : function(){
// console.log('beforeSend');
if ($('#href_id').val() == ""){
alert('You need to set href');
return false;
}
},
success: function(s) {
console.log(s);
$(".response").html(s.result + " " + s.message)
return;
},
error: function() {
console.log('error')
return;
},
complete: function() {
return;
},
timeout: 5000,
})
}
</script>
对应表达如下
<form method='get' action=''>
<label>{% trans "href" %}: </label>
<input id="href_id" name="review_id" value="{{build_href}}">
<input id="keywords_id" type="hidden" name="keywords" value="{{url_query}}">
<input class = 'btn btn-primary' type='submit' value='Submit' onclick="register_scrab()" />
<input class = 'btn btn-info' type='submit' value='Query' onclick="query_scrab()" />
</form>
关键字将作为搜索的键值
下面两个url使用的差别
$.ajax({
url: "{{ajax_host}}", // (1)
$.ajax('{{ajax_host}}',{ // (2)
// url: "[http://127.0.0.1](http://127.0.0.1):5000/",
刚开始用[127.0.0.1:5000](127.0.0.1:5000)
时没看到差别,后面用跨域地址[http://120.55.59.0:8082](http://120.55.59.0:8082)
时发现了问题。
- (1) url 是相对地址,如果用jse.codingsoho.com访问,最终它会变成http://jse.codingsoho.com/120.55.59.0:8082
- (2) 绝对地址,保持原样
爬虫应用后台处理
针对不同的场景,用不同的action处理
def ajax(request):
response_data = {}
response_data['result'] = 'failed'
response_data['message'] = 'invalid input'
if request.method.lower() == "get":
href = request.GET.get("href", None)
keywords = request.GET.get("keywords", None)
action = request.GET.get("action", None)
if "query_scrab_status" == action:
response_data['result'] = 'success'
response_data['message'] = 'last sync @ ?'
elif "register_job_scrab" == action:
response_data['result'] = 'success'
response_data['message'] = 'registration confirmed'
return HttpResponse(json.dumps(response_data), content_type="application/json")
以上针对不同的action有不同的分支处理,这一节将通信打通了,具体的逻辑在下一章解决。