In this chapter, we focus on web crawler design: an interesting and classic system design interview question.
在本章,我们重点讨论网络爬虫的设计,这也是一个有趣且经典的系统设计面试问题。
A web crawler is known as a robot or spider. It is widely used by search engines to discover new or updated content on the web. Content can be a web page, an image, a video, a PDF file, etc. A web crawler starts by collecting a few web pages and then follows links on those pages to collect new content. Figure 1 shows a visual example of the crawl process.
网络爬虫(Web Crawler,下文简称为“爬虫”)也称为机器人(Bot)或者蜘蛛(Spider),被搜索引擎广泛地用于发现网络上的新内容或者更新的内容。这些内容可以是网页、图片、视频、PDF文件等。爬虫从收集网页开始,然后顺着这些网页上的链接收集新的内容。图1展示了爬虫爬取页面的示例。
A crawler is used for many purposes:
爬虫有很多用途。
The complexity of developing a web crawler depends on the scale we intend to support. It could be either a small school project, which takes only a few hours to complete or a gigantic project that requires continuous improvement from a dedicated engineering team. Thus, we will explore the scale and features to support below.
爬虫开发的复杂性取决于我们想要支持的爬虫规模。它可以是一个小的学校项目,只需要几小时就可以完成,也可以是一个需要专业开发团队持续优化的巨型项目。因此,下面我们会先确定我们需要支持的爬虫规模和特性。
The basic algorithm of a web crawler is simple:
爬虫的基本算法很简单。
Does a web crawler work truly as simple as this basic algorithm? Not exactly. Designing a vastly scalable web crawler is an extremely complex task. It is unlikely for anyone to design a massive web crawler within the interview duration. Before jumping into the design, we must ask questions to understand the requirements and establish design scope:
爬虫的工作真的像基本算法所述的这样简单吗?并不完全是。设计大规模的可扩展爬虫是一个极度复杂的任务。在面试时间内,没有人能设计出一个巨型爬虫。在着手设计之前,我们必须通过提问来理解需求并确定设计的边界。
Candidate: What is the main purpose of the crawler? Is it used for search engine indexing, data mining, or something else? Interviewer: Search engine indexing.
候选人:爬虫的主要目的是什么?是用于搜索引擎索引、数据挖掘还是其他什么?
面试官:搜索引擎索引。
Candidate: How many web pages does the web crawler collect per month? Interviewer: 1 billion pages.
候选人:爬虫每个月收集多少网页?
面试官:10亿个。
Candidate: What content types are included? HTML only or other content types such as PDFs and images as well? Interviewer: HTML only.
候选人:包括哪些内容类型?是只有HTML页面,还是也包括其他内容类型,比如PDF文件、图片等?
面试官:仅包括HTML页面。
Candidate: Shall we consider newly added or edited web pages? Interviewer: Yes, we should consider the newly added or edited web pages.
候选人:我们需要考虑新增的或者有更新的网页吗?
面试官:是的,我们应该要考虑新增的或者有更新的网页。
Candidate: Do we need to store HTML pages crawled from the web? Interviewer: Yes, up to 5 years
候选人:我们需要保存从网络上爬取的HTML页面吗?
面试官:是的,最多要保存5年。
Candidate: How do we handle web pages with duplicate content? Interviewer: Pages with duplicate content should be ignored.
候选人:我们如何处理有重复内容的网页?
面试官:忽略有重复内容的网页。
Above are some of the sample questions that you can ask your interviewer. It is important to understand the requirements and clarify ambiguities. Even if you are asked to design a straightforward product like a web crawler, you and your interviewer might not have the same assumptions.
以上是你可以向面试官提出的问题的一些示例。理解和厘清需求是很重要的。即使你被要求设计一个像爬虫这样简单的产品,你和面试官也可能会有不一样的假设。
Beside functionalities to clarify with your interviewer, it is also important to note down the following characteristics of a good web crawler:
除了要与面试官厘清功能需求,还要确定爬虫是否具备如下特性,它们都是一个好的爬虫应该具备的。
The following estimations are based on many assumptions, and it is important to communicate with the interviewer to be on the same page.
下面的估算基于很多假设,与面试官交流并达成共识很重要。
Once the requirements are clear, we move on to the high-level design. Inspired by previous studies on web crawling, we propose a high-level design as shown in Figure 2.
一旦明确了需求,我们就可以考虑高层级设计了。受前人关于爬虫研究的启发,我们提出如图2所示的高层级设计。
First, we explore each design component to understand their functionalities. Then, we examine the crawler workflow step-by-step.
首先,我们探索每个组件以了解它们的功能,然后一步步分析这个爬虫的工作流程。
Seed URLs
种子URL
A web crawler uses seed URLs as a starting point for the crawl process. For example, to crawl all web pages from a university’s website, an intuitive way to select seed URLs is to use the university’s domain name.
爬虫使用种子URL作为爬取的起点。例如,要爬取一所大学网站上的所有网页,直观的方法是用该大学的域名作为种子URL。
To crawl the entire web, we need to be creative in selecting seed URLs. A good seed URL serves as a good starting point that a crawler can utilize to traverse as many links as possible. The general strategy is to divide the entire URL space into smaller ones. The first proposed approach is based on locality as different countries may have different popular websites. Another way is to choose seed URLs based on topics; for example, we can divide URL space into shopping, sports, healthcare, etc. Seed URL selection is an open-ended question. You are not expected to give the perfect answer. Just think out loud.
为了爬取整个网络,我们需要有创意地选择种子URL。好的种子URL让我们有一个好的起点,爬虫可以利用它来遍历尽可能多的链接。一般的策略是把整个URL空间分成若干小块。因为不同国家可能有不同的热门网站,所以我们提议的第一个方法是基于物理位置来划分。另一个方法是基于话题来选择种子URL,比如我们可以把URL空间分为购物、体育、健康等部分。种子URL的选择是一个开放式问题。你不必给出完美的答案。只要边想边说出来就好。
URL Frontier
URL前线
Most modern web crawlers split the crawl state into two: to be downloaded and already downloaded. The component that stores URLs to be downloaded is called the URL Frontier. You can refer to this as a First-in-First-out (FIFO) queue. For detailed information about the URL Frontier, refer to the deep dive.
大部分现代爬虫把爬取状态分为两种:即将下载和已经下载。用来存储即将下载的URL的组件叫URL前线。你可以把它看作一个先进先出(FIFO)的队列。关于URL前线的详细信息,将在3.2节讲解。
HTML Downloader
HTML下载器
The HTML downloader downloads web pages from the internet. Those URLs are provided by the URL Frontier.
HTML下载器从互联网上下载网页。要下载的URL由URL前线来提供。
DNS Resolver
DNS解析器
To download a web page, a URL must be translated into an IP address. The HTML Downloader calls the DNS Resolver to get the corresponding IP address for the URL. For instance, URL www.wikipedia.org is converted to IP address 198.35.26.96 as of 3/5/2019.
要下载网页,必须将URL转换成IP地址。HTML下载器请求DNS解析器来获取URL对应的IP地址。举个例子,截至2019年5月3日,www.wikipedia.org会被转换成198.35.26.96这个IP地址。
Content Parser
内容解析器
After a web page is downloaded, it must be parsed and validated because malformed web pages could provoke problems and waste storage space. Implementing a content parser in a crawl server will slow down the crawling process. Thus, the content parser is a separate component.
网页被下载以后,必须进行解析和校验,因为有问题的网页可能会引发问题且浪费存储空间。在爬虫服务器中实现内容解析器会减慢爬虫的速度。因此,内容解析器是一个独立的组件。
Content Seen?
已见过的内容?
Online research reveals that 29% of the web pages are duplicated contents, which may cause the same content to be stored multiple times. We introduce the “Content Seen?” data structure to eliminate data redundancy and shorten processing time. It helps to detect new content previously stored in the system. To compare two HTML documents, we can compare them character by character. However, this method is slow and time-consuming, especially when billions of web pages are involved. An efficient way to accomplish this task is to compare the hash values of the two web pages.
有研究显示,29%的网页是重复内容,这可能导致同样的内容被多次存储。我们引入了“已见过的内容?”数据结构,来消除数据冗余和缩短处理时间。它帮助检测内容是否已存储在系统中。在比较两个HTML文件时,我们可以逐字符地比较。但是这个方法太慢且耗时,特别是有数十亿的网页要比较时。完成这个任务的一个高效方法是比较两个网页的哈希值。
Content Storage
内容存储
It is a storage system for storing HTML content. The choice of storage system depends on factors such as data type, data size, access frequency, life span, etc. Both disk and memory are used.
它是存储HTML内容的存储系统。选择什么样的存储系统,取决于数据的类型、大小、访问频率、生命周期等因素。硬盘和内存都被用到。
URL Extractor
URL提取器
URL Extractor parses and extracts links from HTML pages. Figure 3 shows an example of a link extraction process. Relative paths are converted to absolute URLs by adding the “https://en.wikipedia.org” prefix.
URL提取器从HTML页面中解析和提取链接。图3展示了链接提取过程。通过添加前缀“https://en.wikipedia.org”,相对路径被转换成绝对URL。
URL Filter
URL过滤器
The URL filter excludes certain content types, file extensions, error links and URLs in “blacklisted” sites.
URL过滤器用于排除特定内容类型、文件扩展名、问题链接和“黑名单”网站的URL。
URL Seen?
已见过的URL?
“URL Seen?” is a data structure that keeps track of URLs that are visited before or already in the Frontier. “URL Seen?” helps to avoid adding the same URL multiple times as this can increase server load and cause potential infinite loops.
“已见过的URL?”是一种数据结构,可以用来追踪记录已访问过的或者已经在URL前线里的URL。“已见过的URL?”可以帮我们避免多次添加同一个URL,重复添加URL会增加服务器负载并导致潜在的无限循环。
Bloom filter and hash table are common techniques to implement the “URL Seen?” component. We will not cover the detailed implementation of the bloom filter and hash table here. For more information, refer to the reference materials.
布隆过滤器和哈希表都是实现“已见过的URL?”组件的常见技术。在这里我们不会介绍布隆过滤器和哈希表的详细实现细节。如果你感兴趣,请参考Burton H.Bloom的文章“Space/Time Trade-Offs in Hash Coding with Allowable Errors”,以及Allan Heydon与Marc Najork合著的文章“Mercator:A Scalable,Extensible Web Crawler”。
URL Storage
URL存储
URL Storage stores already visited URLs.
URL存储用于保存已访问过的URL。
So far, we have discussed every system component. Next, we put them together to explain the workflow.
到目前为止,我们讨论了所有系统组件。接下来,我们把它们组合在一起来解释爬虫的工作流程。
Web crawler workflow
爬虫工作流程
To better explain the workflow step-by-step, sequence numbers are added in the design diagram as shown in Figure 4.
为了更好地分步骤解释爬虫工作流程,我们在设计图里加了序号,如图4所示。
Step 1: Add seed URLs to the URL Frontier
第1步:将种子URL添加到URL前线中。
Step 2: HTML Downloader fetches a list of URLs from URL Frontier.
第2步:HTML下载器从URL前线中获取URL列表。
Step 3: HTML Downloader gets IP addresses of URLs from DNS resolver and starts downloading.
第3步:HTML下载器从DNS解析器中获取URL对应的IP地址并开始下载。
Step 4: Content Parser parses HTML pages and checks if pages are malformed.
第4步:内容解析器解析HTML页面并检查页面是否有问题。
Step 5: After content is parsed and validated, it is passed to the “Content Seen?” component.
第5步:内容解析器将解析和验证后的内容传给“已见过的内容?”组件。
Step 6: “Content Seen” component checks if a HTML page is already in the storage.
第6步:“已见过的内容?”组件检查这个HTML页面是否已经在数据库中。
Step 7: Link extractor extracts links from HTML pages.
第7步:链接提取器从HTML页面中提取链接。
Step 8: Extracted links are passed to the URL filter.
第8步:提取出来的链接被传递给URL过滤器进行筛选。
Step 9: After links are filtered, they are passed to the “URL Seen?” component.
第9步:经过筛选的链接被传递给“已见过的URL?”组件。
Step 10: “URL Seen” component checks if a URL is already in the storage, if yes, it is processed before, and nothing needs to be done.
第10步:“已见过的URL?”组件检查这个URL是否已经在数据库中,如果是,则意味着它之前被处理过,无须再做处理。
Step 11: If a URL has not been processed before, it is added to the URL Frontier.
第11步:如果这个URL之前没有被处理过,就将被添加到URL前线中。
Up until now, we have discussed the high-level design. Next, we will discuss the most important building components and techniques in depth:
到目前为止,我们已经讨论过高层级的设计。接下来,我们将深入讨论几个最重要的构建组件和技术。
You can think of the web as a directed graph where web pages serve as nodes and hyperlinks (URLs) as edges. The crawl process can be seen as traversing a directed graph from one web page to others. Two common graph traversal algorithms are DFS and BFS. However, DFS is usually not a good choice because the depth of DFS can be very deep.
你可以把网络想成一个有向图,其中网页就是节点,超链接(URL)是边。爬虫在网络上爬行的过程可以看作是从一个网页到其他网页的有向图遍历。常见的两种图遍历算法是DFS和BFS。但是,因为DFS的深度可能非常深,所以它通常不是一个好的选择。
BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems:
BFS是爬虫常用的方法,通过先进先出(FIFO)队列来实现。在一个FIFO队列中,URL按照它们入列的顺序出列。尽管如此,这种实现方式还有以下两个问题。
URL frontier helps to address these problems. A URL frontier is a data structure that stores URLs to be downloaded. The URL frontier is an important component to ensure politeness, URL prioritization, and freshness. A few noteworthy papers on URL frontier are mentioned in the reference materials. The findings from these papers are as follows:
URL前线帮我们解决了这些问题。URL前线是一个重要组件,它是一个存储待下载URL的数据结构,能确保爬虫礼貌地访问网页,确定URL优先级并保证内容新鲜度。关于URL前线,建议细读Christopher Olston与Marc Najork合著的文章“Web Crawling”。这篇文章给出了如下结论。
Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers.
一般来说,爬虫应该避免在短时间内对同一个服务器发送太多的请求。发送过多请求会被认为“不礼貌”,甚至可能被视为拒绝服务攻击(DoS)。举个例子,如果没有任何限制,爬虫可以对同一个网站每秒发送数千个请求。但这可能会让Web服务器不堪重负。
The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue. Figure 6 shows the design that manages politeness.
确保礼貌性的大致思路是,从同一个主机每次只下载一个网页。可以在两次下载任务之间加入一定的延时。礼貌性约束是通过维护网站主机名和下载线程(Worker)的映射来实现的。每个下载线程都有独立的FIFO队列且只下载这个队列里的URL。图6展示了实现爬虫礼貌性的设计。
A random post from a discussion forum about Apple products carries very different weight than posts on the Apple home page. Even though they both have the “Apple” keyword, it is sensible for a crawler to crawl the Apple home page first.
某论坛上的一篇关于苹果产品的随机帖子与苹果官网上的文章相比,权重的差异很大。尽管它们都含有“苹果”这个关键字,但是对爬虫而言,先爬取苹果官网的网页肯定是更明智的选择。
We prioritize URLs based on usefulness, which can be measured by PageRank, website traffic, update frequency, etc. “Prioritizer” is the component that handles URL prioritization. Refer to the reference materials for in-depth information about this concept.
我们对URL基于有用性来排优先级,可以通过PageRank、网站流量、更新频率等指标来度量。“优先级排序器”(Prioritizer)是处理URL优先级排序的组件。请参阅相关文章来了解关于这个组件的更多信息。
Figure 7 shows the design that manages URL priority.
图7展示了实现URL优先级排序的设计。
Figure 8 presents the URL frontier design, and it contains two modules:
图8展示了URL前线的设计,它包含两个模块。
Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time-consuming and resource intensive. Few strategies to optimize freshness are listed as follows:
由于互联网上不断有网页被添加、删除和修改,所以爬虫必须定期重新爬取下载过的网页,以确保数据是最新的。重新爬取所有的URL是非常消耗时间和资源的。下面是两个优化新鲜度的策略。
In real-world crawl for search engines, the number of URLs in the frontier could be hundreds of millions. Putting everything in memory is neither durable nor scalable. Keeping everything in the disk is undesirable neither because the disk is slow; and it can easily become a bottleneck for the crawl.
在搜索引擎的实际爬取过程中,URL前线中的URL数量可能上亿。把所有内容放在内存中,既不可持续也不可扩展;而把所有内容放在硬盘中也不是理想的方案,因为硬盘的访问速度很慢,很容易成为爬虫爬取数据的瓶颈。
We adopted a hybrid approach. The majority of URLs are stored on disk, so the storage space is not a problem. To reduce the cost of reading from the disk and writing to the disk, we maintain buffers in memory for enqueue/dequeue operations. Data in the buffer is periodically written to the disk.
我们采用了一种混合方案。将大部分的URL存储在硬盘上,这样存储空间就不是问题。为了降低从硬盘读/写的开销,我们在内存中维护了缓冲区以进行入队/出队操作。缓冲区中的数据会被定期写入硬盘。
The HTML Downloader downloads web pages from the internet using the HTTP protocol. Before discussing the HTML Downloader, we look at Robots Exclusion Protocol first ——robots.txt.
HTML下载器通过HTTP协议从互联网下载网页。在讨论HTML下载器之前,我们先看看机器人排除协议(Robots Exclusion Protocol)——robots.txt。
Robots.txt, called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download. Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules.
robot.txt是网站和爬虫之间通信的标准。它标明了允许爬虫下载哪些网页。在尝试爬取一个网站之前,爬虫应该先检查对应的robots.txt并遵守其中的规则。
To avoid repeat downloads of robots.txt file, we cache the results of the file. The file is downloaded and saved to cache periodically. Here is a piece of robots.txt file taken from https://www.amazon.com/robots.txt. Some of the directories like creatorhub are disallowed for Google bot.
为了避免重复下载robots.txt文件,我们会缓存这个文件的结果。这个文件会被定期下载并保存在缓存中。下面是从https://www.amazon.com/robots.txt中截取的一段robots.txt文件内容。其中规定了如creatorhub之类的目录是不允许谷歌机器人爬取的。
User-agent: Googlebot
Disallow: /creatorhub/\*
Disallow: /rss/people/\*/reviews
Disallow: /gp/pdp/rss/\*/reviews
Disallow: /gp/cdp/member-reviews/
Disallow: /gp/aw/cr/
Besides robots.txt, performance optimization is another important concept we will cover for the HTML downloader.
除了robots.txt,性能优化是HTML下载器中的另一个重要概念。
Below is a list of performance optimizations for HTML downloader.
下面是HTML下载器的性能优化方法。
To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs. Figure 9 shows an example of a distributed crawl.
为了实现高性能,爬取任务被分配给多个服务器,每个服务器中运行着多个线程。URL空间被分成较小的部分,这样每个下载器只负责处理一部分URL。图9展示了一个分布式爬取的例子。
DNS Resolver is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many DNS interfaces. DNS response time ranges from 10ms to 200ms. Once a request to DNS is carried out by a crawler thread, other threads are blocked until the first request is completed. Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization. Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs.
因为很多DNS接口是同步的,所以DNS请求可能要花较长时间,导致DNS解析器成为爬虫的一个性能瓶颈。DNS响应时间在10ms到200ms之间。一旦爬虫的一个线程发出DNS请求,其他线程就会被阻塞,直到该DNS请求完成。维护DNS缓存,避免频繁向DNS服务器发起请求,是一个提高爬虫爬行速度的有效技术。我们的DNS缓存保存了域名与IP地址之间的映射,并通过定时任务进行更新。
Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time. Design locality applies to most of the system components: crawl servers, cache, queue, storage, etc.
将爬虫服务器按地理位置分布。爬虫服务器离网站主机越近,爬虫的下载速度会越快。本地性设计可以应用到大部分系统组件上:爬虫服务器、缓存、队列、存储等。
Some web servers respond slowly or may not respond at all. To avoid long wait time, a maximal wait time is specified. If a host does not respond within a predefined time, the crawler will stop the job and crawl some other pages.
一些Web服务器响应慢或者根本不响应。为了避免长时间等待,需要确定一个最长等待时间。如果一个主机在预定时间内没有响应,爬虫就会停止该任务转而爬取其他页面。
Besides performance optimization, robustness is also an important consideration. We present a few approaches to improve the system robustness:
除了性能优化,健壮性也是一个重要的考虑因素。以下是一些提升系统健壮性的方法。
As almost every system evolves, one of the design goals is to make the system flexible enough to support new content types. The crawler can be extended by plugging in new modules. Figure 10 shows how to add new modules.
因为几乎所有系统都在演进,所以系统的设计目标之一就是要足够灵活以支持新的内容类型。爬虫可以通过插入新的模块来进行扩展。图10展示了如何添加新模块。
This section discusses the detection and prevention of redundant, meaningless, or harmful content.
本节讨论检测及避免重复、无意义或者有害内容的方法。
As discussed previously, nearly 30% of the web pages are duplicates. Hashes or checksums help to detect duplication.
如前所述,接近30%的网页是重复的。哈希和校验和(Checksum)可以帮助我们检测出重复内容。
A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: http://www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/ … Such spider traps can be avoided by setting a maximal length for URLs. However, no one-size-fits-all solution exists to detect spider traps. Websites containing spider traps are easy to identify due to an unusually large number of web pages discovered on such websites. It is hard to develop automatic algorithms to avoid spider traps; however, a user can manually verify and identify a spider trap, and either exclude those websites from the crawler or apply some customized URL filters.
蜘蛛陷阱是可以导致爬虫陷入无限循环的网页,例如,一个无限深的目录结构www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/…。可以通过设置最大URL长度来避免这样的蜘蛛陷阱。尽管如此,并不存在检测蜘蛛陷阱的通用解决方案。含有蜘蛛陷阱的网站是容易识别的,因为在这种网站上网页的数量异常多。但是很难开发出一个自动算法来躲避蜘蛛陷阱。不过,用户可以手动验证和识别蜘蛛陷阱,然后要么在爬取时排除这些网站,要么应用一些定制的URL过滤器。
Some of the contents have little or no value, such as advertisements, code snippets, spam URLs, etc. Those contents are not useful for crawlers and should be excluded if possible.
有些内容只有很少的或者根本没有任何价值,比如广告、代码片段、垃圾邮件URL等。这些内容对于爬虫来说没有用,如果有可能,应该将其排除。
In this chapter, we first discussed the characteristics of a good crawler: scalability, politeness, extensibility, and robustness. Then, we proposed a design and discussed key components. Building a scalable web crawler is not a trivial task because the web is enormously large and full of traps. Even though we have covered many topics, we still miss many relevant talking points:
在本章中,我们先讨论了好爬虫的特点——它应该具有可扩展性、礼貌性和健壮性。接着,我们给出了设计方案并讨论了关键组件。因为互联网异常庞大且充满陷阱,所以创建一个可扩展的爬虫并非易事。即使讨论了很多内容,但我们依然漏掉了很多相关的讨论点,比如:
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
恭喜你已经看到这里了。给自己一些鼓励。干得不错!