要从中获取链接的网站- http://hindi-movies-songs.com/films/index-previous-listen.html
我想得到一个列表的所有网页上的链接和他们的内部.mp3链接(aapx 20K链接),例如父链接:http://hindi-movies-songs.com/films/index-previous-listen.html这里面有14个链接,这有更多的链接,等等。有一个看起来很清楚:第一个链接在父级:http://hindi-movies-songs.com/films/index-listen-20131118.html 1.1.1链接:http://hindi-films-songs.com/main/roberto-48.html现在,我需要1.1.1下的所有链接,依此类推,所以有3个级别的页面被抓取。问题是在每个页面的末尾都有链接到主页,这是不被抓取的,我如何在每个级别排除它?
我的代码是-
import requests
from urllib.request import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
# init the colorama module
colorama.init()
GREEN = colorama.Fore.GREEN
GRAY = colorama.Fore.LIGHTBLACK_EX
RESET = colorama.Fore.RESET
# initialize the set of links (unique links)
internal_urls = set()
total_urls_visited = 0
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_website_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
print(f"{GREEN}[*] Internal link: {href}{RESET}")
urls.add(href)
internal_urls.add(href)
return urls
def crawl(url, max_urls=50):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 50.
"""
global total_urls_visited
total_urls_visited += 1
links = get_all_website_links(url)
for link in links:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Link Extractor Tool with Python")
parser.add_argument("url", help="The URL to extract links from.")
parser.add_argument("-m", "--max-urls", help="Number of max URLs to crawl, default is 50.", default=50, type=int)
args = parser.parse_args()
url = args.url
max_urls = args.max_urls
crawl(url, max_urls=max_urls)
print("[+] Total Internal links:", len(internal_urls))
print("[+] Total URLs:", len(internal_urls))
domain_name = urlparse(url).netloc
# save the internal links to a file
with open(f"{domain_name}_internal_links.txt", "w") as f:
for internal_link in internal_urls:
print(internal_link.strip(), file=f)
将href!="http://hindi-movies-songs.com/index.html"
放在空标记条件中没有帮助
有什么解决方案吗?
发布于 2020-06-24 16:36:01
href不在"http://hindi-movies-songs.com/index.html“works for me”中
import requests
from urllib.request import urlparse, urljoin
from bs4 import BeautifulSoup
import lxml
url = "http://hindi-movies-songs.com/films/index-previous-listen.html"
urls = set()
soup = BeautifulSoup(requests.get(url).content, "lxml")
for a_tag in soup.findAll("a"):
if a_tag['href'] not in ["http://hindi-movies-songs.com/index.html"]:
print(a_tag.get('href'))
输出为:
http://hindi-movies-songs.com/films/index-listen-20131118.html
http://hindi-movies-songs.com/films/index-listen-20121231.html
http://hindi-movies-songs.com/films/index-listen-20120327.html
http://hindi-movies-songs.com/films/index-listen-20110831.html
http://hindi-movies-songs.com/films/index-listen-20101215.html
http://hindi-movies-songs.com/films/index-listen-20100404.html
http://hindi-movies-songs.com/films/index-listen-20091201.html
http://hindi-movies-songs.com/films/index-listen-20090611.html
http://hindi-movies-songs.com/films/index-listen-20090105.html
http://hindi-movies-songs.com/films/index-listen-20080523.html
http://hindi-movies-songs.com/films/index-batch4.html
http://hindi-movies-songs.com/films/index-batch3.html
http://hindi-movies-songs.com/films/indexbatch2.html
http://hindi-movies-songs.com/films/index11to25.html
https://stackoverflow.com/questions/62544520
复制相似问题