sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(sentenceDataFrame)如果我运行命令
tokenized.head()我希望得到这样的结果
Row(id=0, sentence='Hi I heard about Spark',
words=['H','i',' ','h','e',‘a’,……])但是现在的结果是,
Row(id=0, sentence='Hi I heard about Spark',
words=['Hi','I','heard','about','spark'])有没有办法通过PySpark中的Tokenizer或RegexTokenizer来实现这一点?
发布于 2018-01-16 19:02:13
看一看pyspark.ml documentation。Tokenizer只按空格拆分,但是RegexTokenizer -顾名思义-使用正则表达式来查找拆分点或要提取的标记(可以通过参数gaps进行配置)。
如果您传递一个空模式并保留gaps=True (这是默认设置),您应该会得到您想要的结果:
from pyspark.ml.feature import RegexTokenizer
tokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="")
tokenized = tokenizer.transform(sentenceDataFrame)https://stackoverflow.com/questions/48278489
复制相似问题