# 转录因子结合位点预测

> 今天想要分析转录因子结合位点的数据，使用[http://gene-regulation.com/cgi-bin/pub/programs/patch/bin/patch.cgi网站进行预测，但由于该网站不提供文件上传的功能，同时每次进行处理的数据量有上限，想到我又得点击几百次鼠标的操作就头大；于是参考别人的博客写下一个爬虫](http://gene-regulation.com/cgi-bin/pub/programs/patch/bin/patch.cgi%E7%BD%91%E7%AB%99%E8%BF%9B%E8%A1%8C%E9%A2%84%E6%B5%8B%EF%BC%8C%E4%BD%86%E7%94%B1%E4%BA%8E%E8%AF%A5%E7%BD%91%E7%AB%99%E4%B8%8D%E6%8F%90%E4%BE%9B%E6%96%87%E4%BB%B6%E4%B8%8A%E4%BC%A0%E7%9A%84%E5%8A%9F%E8%83%BD%EF%BC%8C%E5%90%8C%E6%97%B6%E6%AF%8F%E6%AC%A1%E8%BF%9B%E8%A1%8C%E5%A4%84%E7%90%86%E7%9A%84%E6%95%B0%E6%8D%AE%E9%87%8F%E6%9C%89%E4%B8%8A%E9%99%90%EF%BC%8C%E6%83%B3%E5%88%B0%E6%88%91%E5%8F%88%E5%BE%97%E7%82%B9%E5%87%BB%E5%87%A0%E7%99%BE%E6%AC%A1%E9%BC%A0%E6%A0%87%E7%9A%84%E6%93%8D%E4%BD%9C%E5%B0%B1%E5%A4%B4%E5%A4%A7%EF%BC%9B%E4%BA%8E%E6%98%AF%E5%8F%82%E8%80%83%E5%88%AB%E4%BA%BA%E7%9A%84%E5%8D%9A%E5%AE%A2%E5%86%99%E4%B8%8B%E4%B8%80%E4%B8%AA%E7%88%AC%E8%99%AB)

脚本下载地址（<https://github.com/zpliu1126/booknote/raw/master/Script/patch.tar.gz>)

**用法**

依赖python3，需要安装tqdm第三方库用于显示进度条

```
//第一步使用login.py获取cookie文件
python3 login.py >cookie.txt
//第二步使用patch.py文件进行POSt请求爬取数据
python patch.py 基因fasta序列文件 输出结果文件
```

​ :warning:注意的是patch.py文件中需要指定cookie.txt文件的路径

* **:jack\_o\_lantern: 首先使用python库urllib、http.cookiejar来登录网站**

  ```python
  import urllib.error, urllib.request, urllib.parse
  import http.cookiejar

  LOGIN_URL = 'http://gene-regulation.com/login'
  values = {'user':'账号','password':'密码'}
  postdata = urllib.parse.urlencode(values).encode()
  user_agent = r'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36' \
               r' (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'
  headers = {'User-Agent':user_agent, 'Connection':'keep-alive'}
  #将cookie保存在本地，并命名为cookie.txt
  cookie_filename = 'cookie.txt'
  cookie_aff = http.cookiejar.MozillaCookieJar(cookie_filename)
  handler = urllib.request.HTTPCookieProcessor(cookie_aff)
  opener = urllib.request.build_opener(handler)

  request = urllib.request.Request(LOGIN_URL, postdata, headers)
  try:
      response = opener.open(request)
  except urllib.error.URLError as e:
      print(e.reason)

  cookie_aff.save(ignore_discard=True, ignore_expires=True) 
  # 保存信息到cookie中
  ```
* **:checkered\_flag:使用cookie文件信息进行登录**

  ```python
  #get_url为使用cookie所登陆的网址，该网址必须先登录才可
  get_url = 'http://gene-regulation.com/cgi-bin/pub/programs/patch/bin/patch.cgi'
  # 使用cookie文件进行登录
  cookie_filename = 'cookie.txt'
  cookie_aff = http.cookiejar.MozillaCookieJar(cookie_filename)
  cookie_aff.load(cookie_filename,ignore_discard=True,ignore_expires=True)
  handler = urllib.request.HTTPCookieProcessor(cookie_aff)
  opener = urllib.request.build_opener(handler)
  #构造请求头，伪装成浏览器
  user_agent = r'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36' \
               r' (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'
  headers = {'User-Agent':user_agent, 'Connection':'keep-alive'}
  # 构造post请求post表单
  searchvalue={"Status": "First",
              'searchName':'default',
                  'usr_seq':'default',
                  "seqStat": "DEL",
                  'sequenceName':'default.seq',
                  "site_opt": "OUR",
                  'group':'plants',
                  'minLen': 8,
                  'mismatch':1,
                  'penalty':100,
                  'boundary':87.5}
  #初始化序列信息
  searchvalue['theSequence']=''
  ```
* **:carousel\_horse:关于post请求如何查看键值对**

  ![批量爬取网站数据](https://2517010162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LrsPId8uR5c88w6ybRs%2F-LrsPbPPVoasswJBa4KY%2F-LrsPgTCt7Gokxfq_d8K%2FXUOYU0KC6IOMEY3C8N4G-768x384.png?generation=1571830812568082\&alt=media)
* **:diamond\_shape\_with\_a\_dot\_inside: ​开始从文件中读取数据进行post请求**

  ```python
  #读取基因序列信息，赋值给字典                
  genelist=fastaread(sys.argv[1])
  #创建输出文件句柄
  patchout=open(sys.argv[2],'w')
  # 用于记录序列数目
  flag=0
  # 循环添加序列信息
  for gene in tqdm(genelist.keys(),desc="request is doing"):
      flag+=1
      #拼接字符串操作
      searchvalue['theSequence']='%s%s%s%s%s%s' % (searchvalue['theSequence'],">",gene," \n",genelist[gene],"\n")
      # 每隔200个序列发起一次请求，但是最后还会剩下不能够整除200的一些序列
      if(flag%200==0):
          # 对post内容进行编码
          searchtdata = urllib.parse.urlencode(searchvalue).encode()
          #使用cookie登陆get_url
          get_request = urllib.request.Request(get_url,searchtdata,headers=headers)
          get_response = opener.open(get_request)
          # 创建 BeautifulSoup对象
          soup=BeautifulSoup(get_response.read().decode(),features="html.parser")
          #BeautifulSoup 解析说明文档https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh
          for form in  soup.find_all("form",action="/cgi-bin/pub/programs/patch/bin/files.cgi"):
              #查找所有的提交表单
              if(form.previous_sibling.previous_sibling.previous_sibling.string!=None):
                  patchout.write(">"+form.previous_sibling.previous_sibling.previous_sibling.string+"\n")
              else:
                  print("当前的"+flag+"存在没有的结果！")
              # 根据兄弟节点获得每个序列id和结果信息
              pre=form.next_sibling.next_sibling
              # 如果没有结果，它的string就为no result find
              if(pre.string!=None):
                  patchout.write(pre.string+"\n")
              else:
                  for string in pre.strings:
                      patchout.write(string)
                  patchout.write("\n")
          searchvalue['theSequence']=''
          #推迟1s，发送请求
          time.sleep(1);
          # 当只剩下最后89不能够满足整除的条件时
      elif(flag==len(genelist)):
          # 对post内容进行编码
          searchtdata = urllib.parse.urlencode(searchvalue).encode()
          #使用cookie登陆get_url
          get_request = urllib.request.Request(get_url,searchtdata,headers=headers)
          get_response = opener.open(get_request)
          # 创建 BeautifulSoup对象
          soup=BeautifulSoup(get_response.read().decode(),features="html.parser")
          #BeautifulSoup 解析说明文档https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh
  ```
* **:bride\_with\_veil:对HTML文件进行解析**

  ```python
  for form in  soup.find_all("form",action="/cgi-bin/pub/programs/patch/bin/files.cgi"):
              #查找所有的提交表单
              patchout.write(">"+form.previous_sibling.previous_sibling.previous_sibling.string+"\n")
              # 根据兄弟节点获得每个序列id和结果信息
              pre=form.next_sibling.next_sibling
              # 如果没有结果，它的string就为no result find
              if(pre.string!=None):
                  patchout.write(pre.string+"\n")
              else:
                  for string in pre.strings:
                      patchout.write(string)
                  patchout.write("\n")
          searchvalue['theSequence']=''
          #推迟1s，发送请求
          time.sleep(1);
      else:
          continue
  patchout.close()
  ```

  ![解析后的文件](https://2517010162-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LrsPId8uR5c88w6ybRs%2F-LrsPbPPVoasswJBa4KY%2F-LrsPgTEqpiwmD_JgcrQ%2FNKIM6H6M429A0%40JJI%24%60H~EN.png?generation=1571830812259730\&alt=media)
* 将解析得到的网络文件，转化为一定的格式

  ```bash
  sed 's/Scanning sequence: //g'  Ga_gene.txt| awk -F "  " '{print $1}'|awk 'NR==1{flag=substr($1,2)}'
  '{a[NR]=$1}'
  'END{
      print flag;
      for(i=2;i<=NR-1;i++){
      if(a[i]~/^>/){
          flag=substr(a[i],2);
          print flag;}
      else if(a[i]==""){
              print}
      else{
      print flag
          }}}'|paste  - Ga_gene.txt -d "\t"|sed 's/\(\s\s\)\+/\t/g'
  ```
* **转换格式后的文件**

  ![转换格式后](https://github.com/zpliu1126/booknote/tree/60e66b201e3f20f7dee247ff11294c85ae65e8b8/Script/img/%40%4085\[Z8SN_YO%40%60%40%Z]QYV2.png)
