最近试用HtmlAgilityPack 来解析html,试用过程中程序会抛出StackOverflowException异常,从MSDN上可以看到,从 .NET Framework 2.0 版开始,将无法通过 try-catch 块捕获 StackOverflowException 对象,并且默认情况下将终止相应的进程。
调查原因,发现,当一个html结构非常复杂时,HtmlAgilityPack 的递归次数会非常多,于是就报StackOverflowException异常,google了一下,找到下面的解决方案
首先,在库中新增一个类:
public class StackChecker
{
public unsafe static bool HasSufficientStack(long bytes)
{
var stackInfo = new MEMORY_BASIC_INFORMATION();
// We subtract one page for our request. VirtualQuery rounds UP to the next page.
// Unfortunately, the stack grows down. If we're on the first page (last page in the
// VirtualAlloc), we'll be moved to the next page, which is off the stack! Note this
// doesn't work right for IA64 due to bigger pages.
IntPtr currentAddr = new IntPtr((uint)&stackInfo - 4096);
// Query for the current stack allocation information.
VirtualQuery(currentAddr, ref stackInfo, sizeof(MEMORY_BASIC_INFORMATION));
// If the current address minus the base (remember: the stack grows downward in the
// address space) is greater than the number of bytes requested plus the reserved
// space at the end, the request has succeeded.
return ((uint)currentAddr.ToInt64() - stackInfo.AllocationBase) >
(bytes + STACK_RESERVED_SPACE);
}
// We are conservative here. We assume that the platform needs a whole 16 pages to
// respond to stack overflow (using an x86/x64 page-size, not IA64). That's 64KB,
// which means that for very small stacks (e.g. 128KB) we'll fail a lot of stack checks
// incorrectly.
private const long STACK_RESERVED_SPACE = 4096 * 16;
[DllImport("kernel32.dll")]
private static extern int VirtualQuery(
IntPtr lpAddress,
ref MEMORY_BASIC_INFORMATION lpBuffer,
int dwLength);
private struct MEMORY_BASIC_INFORMATION
{
internal uint BaseAddress;
internal uint AllocationBase;
internal uint AllocationProtect;
internal uint RegionSize;
internal uint State;
internal uint Protect;
internal uint Type;
}
}
然后,在递归次数较多的地方(such as HtmlNode.WriteTo(TextWriter outText) andHtmlNode.WriteTo(XmlWriter writer)):)添加下面的代码:
if (!StackChecker.HasSufficientStack(4*1024))
throw new Exception("The document is too complex to parse");
OK,大功告成!