文章/答案/技术大牛

发布

社区首页 >问答首页 >python线程池问题(等待某些东西)

问python线程池问题(等待某些东西)
EN

Stack Overflow用户

提问于 2010-09-06 02:57:55

回答 1查看 1.4K关注 0票数 1

我用线程池写了一个简单的网站crowler。问题是:然后爬虫得到所有的网站它必须完成，但在现实中，它等待的东西结束了，脚本没有完成，为什么会发生这种情况？

from Queue import Queue
from threading import Thread

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

class Worker(Thread):
    """Thread executing tasks from a given tasks queue"""
    def __init__(self, tasks):
        Thread.__init__(self)
        self.tasks = tasks
        self.daemon = True
        self.start()

    def run(self):
        while True:
            func, args, kargs = self.tasks.get()
            print "startcall in thread",self
            print args
            try: func(*args, **kargs)
            except Exception, e: print e
            print "stopcall in thread",self
            self.tasks.task_done()

class ThreadPool:
    """Pool of threads consuming tasks from a queue"""
    def __init__(self, num_threads):
        self.tasks = Queue(num_threads)
        for _ in range(num_threads): Worker(self.tasks)

    def add_task(self, func, *args, **kargs):
        """Add a task to the queue"""
        self.tasks.put((func, args, kargs))

    def wait_completion(self):
        """Wait for completion of all the tasks in the queue"""
        self.tasks.join()


def process(pool,host,url):

    try:
        print "get url",url
        #content = urlopen(url).read().decode(charset)
        content = urlopen(url).read()
    except UnicodeDecodeError:
        return

    for link in BeautifulSoup(content, parseOnlyThese=SoupStrainer('a')):
        #print "link",link
        try:
            href = link['href']
        except KeyError:
            continue


        if not href.startswith('http://'):
            href = 'http://%s%s' % (host, href)
        if not href.startswith('http://%s%s' % (host, '/')):
            continue



        if href not in visited:
            visited.add(href)
            pool.add_task(process,pool,host,href)
            print href




def start(host,charset):

    pool = ThreadPool(7)
    pool.add_task(process,pool,host,'http://%s/' % (host))
    pool.wait_completion()

start('simplesite.com','utf8')

python

multithreading

pool

回答 1

Stack Overflow用户

回答已采纳

发布于 2010-09-06 11:23:42

我看到的问题是，当中的运行时，您永远不会退出。因此，它将永远阻塞。当作业完成时，您需要打破这个循环。

您可以尝试：

1)插入

if not func: break

在task.get(...) in 之后，运行。

2)追加

pool.add_task(None, None, None)

在process的末尾。

这是 process 通知池他没有更多任务要处理的一种方式。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/3648781

复制

相似问题

问python线程池问题(等待某些东西)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python线程池问题(等待某些东西)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python线程池问题(等待某些东西)
EN