This chapter discusses setting data, preparing data, and premodel dimensionality reduction.These are not the
attractive parts of machine learning (ML), but they often turn out to be what determines if a model will work or not.
这章讨论设置数据,准备数据和降维处理。这不是机器学习中吸引人的部分,但是他们经常是模型是否能够工作的决定因素。
There are three main parts to the chapter. Firstly, we'll create fake data; this might seem trivial, but creating fake data and fitting models to fake data is an important step in model testing. It's more useful in situations where we implement an algorithm from scratch, but I'll cover it here for completeness, and in the event you don't have data of your own, you can just create it. Secondly, we'll look at broadly handling data transformations as a preprocessing step, which includes data imputation, categorical variable encoding, and so on. Thirdly,we'll look at situations where we have a large number of features relative to the number of observations we have.
本章有三个主要部分,首先,我们要生成虚拟数据,这貌似不重要,其实生成虚拟数据和用虚拟数据训练模型在训练模型中是非常关键的步骤。在图形化编程执行算法的过程中是非常有用的,但这里我要涵盖他。在你没有数据的时候,你可以选择去创建他。其次,我们用于数据转化过程,在如数据异常,分类数据编码等数据预处理阶段。最后,当观测数据有大量的数据特征的情况的处理方法。
This chapter, especially the first half, will set the stage for the later chapters. In order to use scikit-learn, data is required. The first two sections will discuss acquiring the data; the rest of the first half will discuss preparing this data for use.
这章,尤其是前半部分会影响到后面的部分,数据是必须的,前两部分是讨论如何获取数据,
This book is written using scikit-learn 0.15, NumPy 1.9, and pandas 0.13.There are other packages used as well, so it's advisable that you refer to the installation instructions included in this book.Getting sample data from external sources If possible, try working with a familiar dataset while working through this book; in order to level the field, built-in datasets will be used. The built-in datasets can be used as stand-ins to test several different modeling techniques such as regression and classification. These are, for the most part, famous datasets. This is very useful as papers in various fields will often use these datasets for authors to put forth how their model fits as compared to other models.
本书是基于scikit-learn 0.15, NumPy 1.9, and pandas 0.13版本的,里面包括相应的工作包,所以按照本书提到的版本安装软件是非常明智的,如果从外部获取数据集,请尽量使用同本书相似的数据集,为了避免这种情况,本书运用内置的数据集。并且内置数据集可用于测试几种不同的训练模型,例如线性模型和回归模型。这里都是经典的数据集,这些数据集在作者将他们的模型运用在不同领域的论文中时,显得非常有用。
I recommend you use IPython to run these commands as they are presented. Muscle memory is important, and it's best to get to the point where basic commands take no extra mental effort. An even better way might be to run IPython Notebook. If you do, make sure to use the %matplotlib inline command; this will allow you to see the plots in Notebook.
建议使用IPython来运行命令,实时显示是非常重要的,在没有产生额外的影响时就发现问题是最好的。更好的方法是使用IPython Notebook,但是别忘了使用%matplotlib inline控制,这让你能在页面内就看到图表。
The datasets in scikit-learn are contained within the datasets module. Use the following command to import these datasets:
scikit-learn的数据集被存放在datasets模型里,使用一下的命令来导入数据集:
>>> from sklearn import datasets
>>> import numpy as np
From within IPython, run datasets.*? , which will list everything available within the datasets module.
在IPython中,运行datasets.*? 将显示datasets模型里可用的所有内容。
How to do it…
There are two main types of data within the datasets module. Smaller test datasets are included in the sklearn package and can be viewed by running datasets.load_*? .Larger datasets are also available for download as required. The latter are not included in sklearn by default; however, at times, they are better to test models and algorithms due to sufficient complexity to represent realistic situations.
在datases中有两种数据类型,小数据集会直接包含在sklearn的包里,可以通过运行datasets.load_*?来浏览。较大的数据集将按需下载,较大数据集未被默认定义,但他能够更好地测试模型和算法来应对现实中复杂的情形。
>>> boston = datasets.load_boston()
>>> print(boston.DESCR) # output omitted due to length
DESCR will present a basic overview of the data to give you some context.
DESCR将展示给你一个基本的数据概览的文本
>>> housing = datasets.fetch_california_housing()
downloading Cal. housing from http://lib.stat.cmu.edu ... 从网上下载一个数据
>>> print(housing.DESCR) #output omitted due to length
How it works…
When these datasets are loaded, they aren't loaded as NumPy arrays. They are of type Bunch .A Bunch is a common data structure in Python. It's essentially a dictionary with the keys added to the object as attributes.
数据并不是按照Numpy数组载入,而是按照一个python的常见数据类型Bunch,它的实质是把值作为属性传给字典。
To access the data using the (surprise!) data attribute, which is a NumPy array containing the independent variables, the target attribute has the dependent variable:
为了使用数据特征,numpy数组包含自变量,目标属性为因变量
>>> X, y = boston.data, boston.target # X,y为n维数组
There are various implementations available on the Web for the Bunch object; it's not too difficult to write on your own. scikit-learn defines Bunch (as of this writing) in the base module.It's available in GitHub at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/base.py .
网上关于Bunch有大量的可靠的工具,自己写也不太难,scikit-learn定义在最基础的模型中。
There's more…
When you fetch a dataset from an external source it will, by default, place the data in you home directory under scikit_learn_data/ ; this behavior is configurable in two ways:
当你从外部获取数据集,会自动存在你的主目录下scikit_learn_data/,有两种方法自定义设置:
1、To modify the default behavior, set the SCIKIT_LEARN_DATA environment variable to point to the desired folder.
2、The first argument of the fetch methods is data_home , which will specify the home folder on a case-by-case basis.
1、改变默认设置,将SCIKIT_LEARN_DATA的环境变量设置到目标文件下
2、fetch方法的第一个变量是data_home,fetch_XX('路径')
It is easy to check the default location by calling datasets.get_data_home() .
可以执行datasets.get_data_home() 查看默认路径
本文系外文翻译,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文系外文翻译,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。