sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(sentenceDataFrame)
如果我运行命令
tokenized.head()
我希望得到这样的结果
Row(id=0, sentence='Hi I heard about Spark',
words=['H','i',' ','h','e',‘a’,……])
但是现在的结果是,
Row(id=0, sentence='Hi I heard about Spark',
words=['Hi','I','heard','about','spark'])
有没有办法通过PySpark中的Tokenizer或RegexTokenizer来实现这一点?
发布于 2018-01-16 11:02:13
看一看pyspark.ml documentation。Tokenizer
只按空格拆分,但是RegexTokenizer
-顾名思义-使用正则表达式来查找拆分点或要提取的标记(可以通过参数gaps
进行配置)。
如果您传递一个空模式并保留gaps=True
(这是默认设置),您应该会得到您想要的结果:
from pyspark.ml.feature import RegexTokenizer
tokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="")
tokenized = tokenizer.transform(sentenceDataFrame)
https://stackoverflow.com/questions/48278489
复制相似问题