数字&文本——特征提取

特征提取

数字特征提取

数组型特征可以直接作为特征,但是对于一个多维的特征,某一个特征的取值范围特别大,很可能导致其他特征对结果的影响被忽略,这时候我们需要对数字型特征进行预处理,常见的预处理方式有以下几种。

1.标准化:

>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[1.,-1.,2],[2.,0.,0.],[0.,1.,-1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

2.正则化

>>> X = [[1.,-1.,2.],[2.,0.,0.],[0.,1.,-1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')
>>> X_normalized
array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

3.归一化

>>> X_train = np.array([[1.,-1.,2.],[2.,0., 0.],[0.,-1.,-1.]])
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  1.        ,  0.33333333],
       [ 0.        ,  0.        ,  0.        ]])

文本型特征提取

文本型特征提取,本质上是做单词切分,不同的单词当作一个新的特征,以hash结构为例:

>>> measurements=[{'city':'Dubai','temperature':33.},
...     {'city':'London','temperature':12.},
...     {'city':'San Fransisco','temperature':18.}
... ]

键值city具有多个取值,“Dubai”、“London”和“San Fransisco”,直接把每个取值作为新的特征即可。键值temperature是数值型,可以直接作为特征使用。

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

文本特征提取有两个非常重要的模型。

  • 词集模型:单词构成的集合,集合中每个元素都只有一个,即词集中的每个单词都只有一个。
  • 词袋模型:如果一个单词在文档中出现不止一次,并统计其出现的次数(频数)。

两者本质上的区别,词袋是在词集的基础上增加了频率的维度:词集只关注有和没有,词袋还要关注有几个。

导入相关的函数库:

>>> from sklearn.feature_extraction.text import CountVectorizer

实例化分词对象:

>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

将文本进行词袋处理:

>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?']
>>> X = vectorizer.fit_transform(corpus)
>>> X
<4x9 sparse matrix of type '<type 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

获取对应的特征名称:

>>> vectorizer.get_feature_names() == (
...     ['and', 'document', 'first', 'is', 'one',
...      'second', 'the', 'third', 'this'])
True

获取词袋数据,至此我们已经完成了词袋化。但是对于程序中的其他文本,如何使用现有的词袋的特征进行向量化呢?

>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

我们定义词袋的特征空间叫做词汇表vocabulary:

>>> vocabulary=vectorizer.vocabulary_
>>> vocabulary
{u'and': 0, u'third': 7, u'this': 8, u'is': 3, u'one': 4, u'second': 5, u'the': 6, u'document': 1, u'first': 2}

针对其他文本进行词袋处理时,可以直接使用现有的词汇表:

>>> new_vectorizer
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None,
        vocabulary={u'and': 0, u'third': 7, u'this': 8, u'is': 3, u'one': 4, u'second': 5, u'the': 6, u'document': 1, u'first': 2})

TensorFlow中有类似实现:

from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.contrib import learn
MAX_DOCUMENT_LENGTH=100
vocab_processor=learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
X_train = np.array(list(vocab_processor.fit_transform(X_train)))
X_test = np.array(list(vocab_processor.transform(X_test)))

数据读取

平时处理数据是,CSV是最常见的格式,文件的每行记录一个向量,其中最后一列为标记。TensorFlow提供了非常便携的方式从CSV文件中读取数据集。

加载对应的函数库:

>>> import tensorflow as tf
>>> import numpy as np

从CSV文件中读取数据:

>>> training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
... filename="iris_training.csv",
... target_dtype=np.int,
... features_dtype=np.float32)
>>> feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)]
>>> x=training_set.data
>>> y=training_set.target
>>> x
array([[ 6.4000001 ,  2.79999995,  5.5999999 ,  2.20000005],
       [ 5.        ,  2.29999995,  3.29999995,  1.        ],
       [ 4.9000001 ,  2.5       ,  4.5       ,  1.70000005],
       [ 4.9000001 ,  3.0999999 ,  1.5       ,  0.1       ],
       [ 5.69999981,  3.79999995,  1.70000005,  0.30000001],
       [ 4.4000001 ,  3.20000005,  1.29999995,  0.2       ],
       [ 5.4000001 ,  3.4000001 ,  1.5       ,  0.40000001],
       [ 6.9000001 ,  3.0999999 ,  5.0999999 ,  2.29999995],
       [ 6.69999981,  3.0999999 ,  4.4000001 ,  1.39999998],
       [ 5.0999999 ,  3.70000005,  1.5       ,  0.40000001],
       [ 5.19999981,  2.70000005,  3.9000001 ,  1.39999998],
       [ 6.9000001 ,  3.0999999 ,  4.9000001 ,  1.5       ],
       [ 5.80000019,  4.        ,  1.20000005,  0.2       ],
       [ 5.4000001 ,  3.9000001 ,  1.70000005,  0.40000001],
       [ 7.69999981,  3.79999995,  6.69999981,  2.20000005],
       [ 6.30000019,  3.29999995,  4.69999981,  1.60000002],
       [ 6.80000019,  3.20000005,  5.9000001 ,  2.29999995],
       [ 7.5999999 ,  3.        ,  6.5999999 ,  2.0999999 ],
       [ 6.4000001 ,  3.20000005,  5.30000019,  2.29999995],
       [ 5.69999981,  4.4000001 ,  1.5       ,  0.40000001],
       [ 6.69999981,  3.29999995,  5.69999981,  2.0999999 ],
       [ 6.4000001 ,  2.79999995,  5.5999999 ,  2.0999999 ],
       [ 5.4000001 ,  3.9000001 ,  1.29999995,  0.40000001],
       [ 6.0999999 ,  2.5999999 ,  5.5999999 ,  1.39999998],
       [ 7.19999981,  3.        ,  5.80000019,  1.60000002],
       [ 5.19999981,  3.5       ,  1.5       ,  0.2       ],
       [ 5.80000019,  2.5999999 ,  4.        ,  1.20000005],
       [ 5.9000001 ,  3.        ,  5.0999999 ,  1.79999995],
       [ 5.4000001 ,  3.        ,  4.5       ,  1.5       ],
       [ 6.69999981,  3.        ,  5.        ,  1.70000005],
       [ 6.30000019,  2.29999995,  4.4000001 ,  1.29999995],
       [ 5.0999999 ,  2.5       ,  3.        ,  1.10000002],
       [ 6.4000001 ,  3.20000005,  4.5       ,  1.5       ],
       [ 6.80000019,  3.        ,  5.5       ,  2.0999999 ],
       [ 6.19999981,  2.79999995,  4.80000019,  1.79999995],
       [ 6.9000001 ,  3.20000005,  5.69999981,  2.29999995],
       [ 6.5       ,  3.20000005,  5.0999999 ,  2.        ],
       [ 5.80000019,  2.79999995,  5.0999999 ,  2.4000001 ],
       [ 5.0999999 ,  3.79999995,  1.5       ,  0.30000001],
       [ 4.80000019,  3.        ,  1.39999998,  0.30000001],
       [ 7.9000001 ,  3.79999995,  6.4000001 ,  2.        ],
       [ 5.80000019,  2.70000005,  5.0999999 ,  1.89999998],
       [ 6.69999981,  3.        ,  5.19999981,  2.29999995],
       [ 5.0999999 ,  3.79999995,  1.89999998,  0.40000001],
       [ 4.69999981,  3.20000005,  1.60000002,  0.2       ],
       [ 6.        ,  2.20000005,  5.        ,  1.5       ],
       [ 4.80000019,  3.4000001 ,  1.60000002,  0.2       ],
       [ 7.69999981,  2.5999999 ,  6.9000001 ,  2.29999995],
       [ 4.5999999 ,  3.5999999 ,  1.        ,  0.2       ],
       [ 7.19999981,  3.20000005,  6.        ,  1.79999995],
       [ 5.        ,  3.29999995,  1.39999998,  0.2       ],
       [ 6.5999999 ,  3.        ,  4.4000001 ,  1.39999998],
       [ 6.0999999 ,  2.79999995,  4.        ,  1.29999995],
       [ 5.        ,  3.20000005,  1.20000005,  0.2       ],
       [ 7.        ,  3.20000005,  4.69999981,  1.39999998],
       [ 6.        ,  3.        ,  4.80000019,  1.79999995],
       [ 7.4000001 ,  2.79999995,  6.0999999 ,  1.89999998],
       [ 5.80000019,  2.70000005,  5.0999999 ,  1.89999998],
       [ 6.19999981,  3.4000001 ,  5.4000001 ,  2.29999995],
       [ 5.        ,  2.        ,  3.5       ,  1.        ],
       [ 5.5999999 ,  2.5       ,  3.9000001 ,  1.10000002],
       [ 6.69999981,  3.0999999 ,  5.5999999 ,  2.4000001 ],
       [ 6.30000019,  2.5       ,  5.        ,  1.89999998],
       [ 6.4000001 ,  3.0999999 ,  5.5       ,  1.79999995],
       [ 6.19999981,  2.20000005,  4.5       ,  1.5       ],
       [ 7.30000019,  2.9000001 ,  6.30000019,  1.79999995],
       [ 4.4000001 ,  3.        ,  1.29999995,  0.2       ],
       [ 7.19999981,  3.5999999 ,  6.0999999 ,  2.5       ],
       [ 6.5       ,  3.        ,  5.5       ,  1.79999995],
       [ 5.        ,  3.4000001 ,  1.5       ,  0.2       ],
       [ 4.69999981,  3.20000005,  1.29999995,  0.2       ],
       [ 6.5999999 ,  2.9000001 ,  4.5999999 ,  1.29999995],
       [ 5.5       ,  3.5       ,  1.29999995,  0.2       ],
       [ 7.69999981,  3.        ,  6.0999999 ,  2.29999995],
       [ 6.0999999 ,  3.        ,  4.9000001 ,  1.79999995],
       [ 4.9000001 ,  3.0999999 ,  1.5       ,  0.1       ],
       [ 5.5       ,  2.4000001 ,  3.79999995,  1.10000002],
       [ 5.69999981,  2.9000001 ,  4.19999981,  1.29999995],
       [ 6.        ,  2.9000001 ,  4.5       ,  1.5       ],
       [ 6.4000001 ,  2.70000005,  5.30000019,  1.89999998],
       [ 5.4000001 ,  3.70000005,  1.5       ,  0.2       ],
       [ 6.0999999 ,  2.9000001 ,  4.69999981,  1.39999998],
       [ 6.5       ,  2.79999995,  4.5999999 ,  1.5       ],
       [ 5.5999999 ,  2.70000005,  4.19999981,  1.29999995],
       [ 6.30000019,  3.4000001 ,  5.5999999 ,  2.4000001 ],
       [ 4.9000001 ,  3.0999999 ,  1.5       ,  0.1       ],
       [ 6.80000019,  2.79999995,  4.80000019,  1.39999998],
       [ 5.69999981,  2.79999995,  4.5       ,  1.29999995],
       [ 6.        ,  2.70000005,  5.0999999 ,  1.60000002],
       [ 5.        ,  3.5       ,  1.29999995,  0.30000001],
       [ 6.5       ,  3.        ,  5.19999981,  2.        ],
       [ 6.0999999 ,  2.79999995,  4.69999981,  1.20000005],
       [ 5.0999999 ,  3.5       ,  1.39999998,  0.30000001],
       [ 4.5999999 ,  3.0999999 ,  1.5       ,  0.2       ],
       [ 6.5       ,  3.        ,  5.80000019,  2.20000005],
       [ 4.5999999 ,  3.4000001 ,  1.39999998,  0.30000001],
       [ 4.5999999 ,  3.20000005,  1.39999998,  0.2       ],
       [ 7.69999981,  2.79999995,  6.69999981,  2.        ],
       [ 5.9000001 ,  3.20000005,  4.80000019,  1.79999995],
       [ 5.0999999 ,  3.79999995,  1.60000002,  0.2       ],
       [ 4.9000001 ,  3.        ,  1.39999998,  0.2       ],
       [ 4.9000001 ,  2.4000001 ,  3.29999995,  1.        ],
       [ 4.5       ,  2.29999995,  1.29999995,  0.30000001],
       [ 5.80000019,  2.70000005,  4.0999999 ,  1.        ],
       [ 5.        ,  3.4000001 ,  1.60000002,  0.40000001],
       [ 5.19999981,  3.4000001 ,  1.39999998,  0.2       ],
       [ 5.30000019,  3.70000005,  1.5       ,  0.2       ],
       [ 5.        ,  3.5999999 ,  1.39999998,  0.2       ],
       [ 5.5999999 ,  2.9000001 ,  3.5999999 ,  1.29999995],
       [ 4.80000019,  3.0999999 ,  1.60000002,  0.2       ],
       [ 6.30000019,  2.70000005,  4.9000001 ,  1.79999995],
       [ 5.69999981,  2.79999995,  4.0999999 ,  1.29999995],
       [ 5.        ,  3.        ,  1.60000002,  0.2       ],
       [ 6.30000019,  3.29999995,  6.        ,  2.5       ],
       [ 5.        ,  3.5       ,  1.60000002,  0.60000002],
       [ 5.5       ,  2.5999999 ,  4.4000001 ,  1.20000005],
       [ 5.69999981,  3.        ,  4.19999981,  1.20000005],
       [ 4.4000001 ,  2.9000001 ,  1.39999998,  0.2       ],
       [ 4.80000019,  3.        ,  1.39999998,  0.1       ],
       [ 5.5       ,  2.4000001 ,  3.70000005,  1.        ]], dtype=float32)
>>> y
array([2, 1, 2, 0, 0, 0, 0, 2, 1, 0, 1, 1, 0, 0, 2, 1, 2, 2, 2, 0, 2, 2, 0,
       2, 2, 0, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 2,
       0, 2, 0, 2, 0, 1, 1, 0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 0, 2, 2,
       0, 0, 1, 0, 2, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 0, 2, 1,
       0, 0, 2, 0, 0, 2, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 2, 1, 0, 2, 0,
       1, 1, 0, 0, 1])

其中各个参数定义为:

  • filename, 文件名;
  • target_dtype, 标记数据类型;
  • features_dtype, 特征数据类型;

测试文件下载地址:iris_training.csv

效果验证

效果验证是机器学习非常重要的一个环节,最常使用的是交叉验证。常见的验证过程如图中所示。

以SVM为例,导入SVM库以及Scikit-Learn自带的样本库datasets:

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
f>>> from sklearn import datasets
>>> from sklearn import svm

获取样本数据:

>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))

为了保证效果,使用函数train_test_spli随机分割样本为训练样本和测试样本:

>>> X_train, X_test, y_train, y_test = train_test_split(
...     iris.data, iris.target, test_size=0.4, random_state=0)
>>> X_train.shape,y_train.shape
((90, 4), (90,))
>>> X_test.shape,y_test.shape
((60, 4), (60,))

调用SVM进行训练:

>>> clf=svm.SVC(kernel='linear',C=1).fit(X_train,y_train)

判断预测结果与测试样本标记的结果,得到准确率:

>>> clf=svm.SVC(kernel='linear',C=1).fit(X_train,y_train)
>>> clf
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

判断预测结果与测试样本标记的结果,得到准确率:

>>> clf.score(X_test, y_test)
0.96666666666666667

为了提高验证的准确度,比较常见的方法是使用K折交叉验证。所谓K折交叉验证,就是初始采样分割成K个子样本,一个单独的子样本被保留作为验证模型的数据,其他K-1个样本用来训练。交叉验证重复K次,每个子样本验证一次,平均K次的结果或者使用其他结合方式,最终得到一个单一估测。三折交叉验证原理图见下图。

这个方法的优势在于,同时重复运用随机产生的子样本进行训练和验证,每次的结果验证一次,十字交叉验证是最常用的。还是上面的例子,十折交叉法实现如下:

>>> from sklearn.model_selection import cross_val_score
>>> clf = svm.SVC(kernel='linear',C=1)
>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)
>>> scores
array([ 0.96666667,  1.        ,  0.96666667,  0.96666667,  1.        ])