PTT網頁爬蟲-爬每頁文章的網址

- 12月 04, 2019

這篇要教大家如何從PTT網頁版來爬每篇文章的網址同時過八卦版18歲認證，
爬到網址後，PTT網頁爬蟲-爬每篇文章內容會介紹，進去把每篇網址的內容爬出來。

這邊使用python與八卦版的網頁來實作。
他網址有個優點就是可以按一下右上角的上一頁
https://www.ptt.cc/bbs/Gossiping/index.html

如圖index後面代個數字，這就是這網頁再PTT文章中第幾頁，所以我就可以用迴圈的方式把所有網頁中的URL抓出來，前一頁是39061所以最新一頁是39062，下面為程式碼。

迴圈從1~39062
url的index後面給他帶數字
payload是post的cookies參數
連線抓取html
先抓取div=title的部分，URL藏在裡面
最後再取href抓到URL丟到article_href

import requests

from bs4 import BeautifulSoup



for page in range(1,39062):

    article_href = []

    url = 'https://www.ptt.cc/bbs/Gossiping/index'+str(page)+'.html'

    payload = {'form':'/bbs/stock/index.html',

               'yes':'yes'

              }

    rs=requests.session()

    res=rs.post('https://www.ptt.cc/ask/over18',verify=False,data=payload)

    res=rs.get(url,verify=False)

  

    soup = BeautifulSoup(res.text,"html.parser")

    results = soup.select("div.title")

  



    for item in results:

        try:

            item_href = item.select_one("a").get("href")

            article_href.append(item_href)

        except:

            continue;

搜尋此網誌

YS生活誌

PTT網頁爬蟲-爬每頁文章的網址

留言

張貼留言

這個網誌中的熱門文章

Python-相關係數矩陣實作(python-correlation matrix )

ASP.NET-後端將值傳給javascript

Python-使用pyserial與控制板溝通