使用字典,我需要找到并替换熊猫系列中的术语,根据以下标准:
如果在熊猫系列中找到字典键(例如,在“母版”:“硕士博士”中,替换结果将是“硕士”occurs)
数据: term_fixes是字典,df‘’job_description‘是所感兴趣的标记系列
term_fixes = {'rf': 'random forest',
'mastersphd': 'masters phd',
'curiosity': 'curious',
'trustworthy': 'ethical',
'realise': 'realize'}
df = pd.DataFrame(data={'job_description': [['knowledge', 'of', 'algorithm', 'like', 'rf'],
['must', 'have', 'a', 'mastersphd'],
['trustworthy', 'and', 'possess', 'curiosity'],
['we', 'realise', 'performance', 'is', 'key']]})**注:我也尝试过未标记化的数据结构(失败),但更喜欢标记化,因为我有更多的NLP要做。
df = pd.DataFrame(data={'job_description': ['knowledge of algorithm like rf',
'must have a mastersphd',
'must be trustworthy and possess curiosity',
'we realise performance is critical']})**期望的结果(请注意,性能中的“rf”没有被“随机森林”所取代):df‘作业_描述’
0 ['knowledge' 'of' 'algorithm' 'like' 'random' 'forest']
1 ['must' 'have' 'a' 'masters' 'phd']
2 ['must' 'be' 'ethical' 'and' 'possess' 'curious']
3 ['we' 'realize' 'performance' 'is' 'critical']我尝试过许多方法。失败:df['job_description'].replace(list(term_fixes.keys()), list(term_fixes.values()), regex=False, inplace=True)
失败:df['job_description'].replace(dict(zip(list(term_fixes.keys()), list(term_fixes.values()))), regex=False, inplace=True)
失败:df['job_description'] = df['job_description'].str.replace(term_fixes, regex=False)
失败:df['job_description'] = df['job_description'].str.replace(str(term_fixes.keys()), str(term_fixes.values()), regex=True)
我来得最近的是
df['job_description'] = df_jobs['job_description'].replace(term_fixes, regex=True)但是,任何匹配的regex=True标志(比如上面的'rf‘和'performance’示例)。不幸的是,将标志更改为regex=False无法替换任何东西。我在文档中寻找另一个我可以使用的论点,但没有运气。注意,这使用了未标记化的结构。
任何帮助都将不胜感激。谢谢!
发布于 2022-02-08 22:54:08
使用df的“令牌化”版本。
df['job_description'] = df['job_description'].explode().replace(term_fixes).groupby(level=-1).agg(list)
# explode to get single terms per "cell"
# replace to replace the terms in "term_fixes"
# groupby to reverse the previous explode and return to a column of lists
job_description
0 [knowledge, of, algorithm, like, random forest]
1 [must, have, a, masters phd]
2 [ethical, and, possess, curious]
3 [we, realize, performance, is, key]如果需要在空格上拆分新的术语,则可以在最终的.str.split().explode()之前添加另一个中间步骤groupby。
df['job_description'] = df['job_description'].explode().replace(term_fixes).str.split().explode().groupby(level=-1).agg(list)
job_description
0 [knowledge, of, algorithm, like, random, forest] # random forest is now split
1 [must, have, a, masters, phd] # masters phd is now split
2 [ethical, and, possess, curious]
3 [we, realize, performance, is, key]发布于 2022-02-08 17:30:37
您可以使用下面这样的方法来处理未标记的数据。
for k in term_fixes:
df['job_description'] = (df['job_description'].str.replace(r'(^|(?<= )){}((?= )|$)'.format(k), term_fixes[k]))
print(df)
job_description
0 knowledge of algorithm like random forest
1 must have a masters phd
2 must be ethical and possess curious
3 we realize performance is criticalhttps://stackoverflow.com/questions/71038063
复制相似问题