问将大字符串输出转换为字典
EN

Stack Overflow用户

提问于 2017-05-10 12:17:23

回答 2查看 300关注 0票数 0

我有一个类似于这样的函数，当给出一个url时，它会在who.is上查找域：

import whois    

def who_is(url):
    w = whois.whois(url)
    return w.text

它以一个大字符串的形式返回以下内容：

Domain name:
    amazon.co.uk

Registrant:
    Amazon Europe Holding Technologies SCS

Registrant type:
    Unknown

Registrant's address:
    65 boulevard G-D. Charlotte
    Luxembourg City
    Luxembourg
    LU-1311
    Luxembourg

Data validation:
    Nominet was able to match the registrant's name and address against a 3rd party data source on 10-Dec-2012

Registrar:
    Amazon.com, Inc. t/a Amazon.com, Inc. [Tag = AMAZON-COM]
    URL: http://www.amazon.com

Relevant dates:
    Registered on: before Aug-1996
    Expiry date:  05-Dec-2020
    Last updated:  23-Oct-2013

Registration status:
    Registered until expiry date.

Name servers:
    ns1.p31.dynect.net
    ns2.p31.dynect.net
    ns3.p31.dynect.net
    ns4.p31.dynect.net
    pdns1.ultradns.net
    pdns2.ultradns.net
    pdns3.ultradns.org
    pdns4.ultradns.org
    pdns5.ultradns.info
    pdns6.ultradns.co.uk      204.74.115.1  2610:00a1:1017:0000:0000:0000:0000:0001

WHOIS lookup made at 21:09:42 10-May-2017

 -- 
   This WHOIS information is provided for free by Nominet UK the central registry
for .uk domain names. This information and the .uk WHOIS are:

Copyright Nominet UK 1996 - 2017.

You may not access the .uk WHOIS or use any data from it except as permitted
by the terms of use available in full at http://www.nominet.uk/whoisterms,
 which includes restrictions on: (A) use of the data for advertising, or its
 repackaging, recompilation, redistribution or reuse (B) obscuring, removing
 or hiding any or all of this notice and (C) exceeding query rate or volume
limits. The data is provided on an 'as-is' basis and may lag behind the
register. Access may be withdrawn or restricted at any time.

因此，只要看一下它，我就能看到布局是用来把它变成字典的，但我不知道如何用尽可能有效的方式来实现它。我需要删除不想要的文本底部，并删除所有的换行符和缩进。单独做并不是很有效率。我希望能够将任何url传递给函数，并有一个可以使用的字典。任何帮助都会很感激的。

预期产出将是：

dict = {
'Domain name':'amazon.co.uk',
'Registrant':'Amazon Europe Holding Technologies'
'Registrant type': 'Unknown'
and so on for all the available fields.
}

到目前为止，我已经尝试删除所有\n新行和使用remove函数\r，然后用替换函数替换所有缩进。但是，我根本不知道如何删除底部的大部分文本。

但是，python文档告诉您只打印w，当这样做时，它返回以下内容：

{
  "domain_name": null,
  "registrar": null,
  "registrar_url": "http://www.amazon.com",
  "status": null,
  "registrant_name": null,
  "creation_date": "before Aug-1996",
  "expiration_date": "2020-12-05 00:00:00",
  "updated_date": "2013-10-23 00:00:00",
  "name_servers": null
 }

如您所见，大多数这些值都是null，但是当返回w.text时，它们确实有值

string

python-3.x

dictionary

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-05-10 12:37:30

显然，你在使用巨蟒。

看看示例。您可以以结构化的形式获取所有数据，而不是需要解析的文本：

import whois
w = whois.whois('webscraping.com')
w.expiration_date  # dates converted to datetime object
# datetime.datetime(2013, 6, 26, 0, 0)
w.text  # the content downloaded from whois server
# u'\nWhois Server Version 2.0\n\nDomain names in the .com and .net ...'

print w  # print values of all found attributes
# creation_date: 2004-06-26 00:00:00
# domain_name: [u'WEBSCRAPING.COM', u'WEBSCRAPING.COM']
# emails: [u'WEBSCRAPING.COM@domainsbyproxy.com', u'WEBSCRAPING.COM@domainsbyproxy.com']
# expiration_date: 2013-06-26 00:00:00

您可以从whois对象(w)逐个获取所需的所有属性，并将它们存储在dict中，或者将对象本身传递给任何需要这些信息的函数。

w.text中有什么信息不能作为w的属性访问吗？

编辑：

它适用于我使用与您的URL相同的示例URL。

pip install python-whois
pip freeze |grep python-whois
# python-whois==0.6.5

import whois
w = whois.whois("amazon.co.uk")
w
# {'updated_date': datetime.datetime(2013, 10, 23, 0, 0), 'creation_date': 'before Aug-1996', 'registrar': None, 'registrar_url': 'http://www.amazon.com', 'domain_name': None, 'expiration_date': datetime.datetime(2020, 12, 5, 0, 0), 'name_servers': None, 'status': None, 'registrant_name': None}

编辑2：

如果我认为我在解析器中找到了问题。

正则表达式不应该是

'Registrant:\n\s*(.*)'

但

'Registrant:\r\n\s*(.*)'

您可以尝试在本地克隆whois并像这样修改它(添加\r)，然后如果它有效，建议使用一个修补程序，或者至少在错误报告中提到这一点。

票数 1

Stack Overflow用户

发布于 2017-05-10 13:36:47

试试这个：

from collections import OrderedDict

key_value=OrderedDict() #use dict() if order of keys is not important

for block in textstring.split("\n\n"): #textstring contains the string of w.text.
    try:
        key_value[block.split(":\n")[0].strip()] = '\n'.join(element.strip() for element in block.split(":\n")[1].split('\n'))
    except IndexError:
        pass

#print the result
for key in key_value:
    print(key)
    print(key_value[key])
    print("\n")