我有一列包含全名的字符串。后缀名被区分为全大写字母的组,而姓氏则以propercase大写形式给出。大多数名称被排序为(Firstname,LASTNAME),但许多名称在字符串的中间或开头包含LASTNAME信息,如这里的最后一个条目。
0 Manuel JOSE
1 Vincent MUANDUMBA
2 Alejandro DE LORRES
3 Luis FILIPE da Rivera
4 LIM Jock Hoi我想根据字符串中的文本是在propercase ( Firstname )还是在all-caps ( Lastname )中将该列拆分为分别的Firstname和Lastname列。
new = df["FullName"].str.split(pat=r'(?=[A-Z][a-z])', n=1, expand = True)
df['FirstName'] = new[0]
df['LastName'] = new[1]所有大写或小写的字符串都应该用new[0]分组,大写的所有字符串都应该用new[1]分组。
但是,由于正则表达式不正确,我似乎无法实现所需的输出。我也试过pat=r'[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)'
发布于 2021-11-29 20:06:55
您可以使用regex:
df['LastName'] = df['FullName'].str.findall(r'\b[A-Z]+(?:\s+[A-Z]+)*\b').str.join(' ')
df['FirstName'] = df['FullName'].str.findall(r"[A-Z]{0,1}[a-z]+").str.join(' ')输出:
names last_names first_names
0 Manuel JOSE JOSE Manuel
1 Vincent MUANDUMBA MUANDUMBA Vincent
2 Alejandro DE LORRES DE LORRES Alejandro
3 Luis FILIPE da Rivera FILIPE Luis da Rivera
4 LIM Jock Hoi LIM Jock Hoi发布于 2021-11-29 19:43:41
这段代码比使用str模式要长一些,但是您可以确保它将名称字符串的每个部分按照您的需要发送给名字或姓氏。技巧是使用.istitle()函数。
# Split every string in FullName column by returning a list of words
new = df["FullName"].str.split(' ')
# Create empty lists to keep new columns for df
FirstName = []
LastName = []
# Iterate over every splitted string (sample)
for name in new:
Propercase =[] #This keeps values for FirstName condition
Allcaps = [] # This keeps values for LastName (all-caps)
# Iterate over every word in the sample
for n in name:
# Check if it is propercase or lower ('da')
if n.istitle() or n.islower():
Propercase.append(n)
# If not, it is all-caps
else:
Allcaps.append(n)
# Add propercase words to FirstName list
FirstName.append(' '.join(Propercase))
# All-caps words to LastName list
LastName.append(' '.join(Allcaps))
# Create columns
df['FirstName'] = FirstName
df['LastName'] = LastName输出:
FullName FirstName LastName
0 Manuel JOSE Manuel JOSE
1 Vincent MUANDUMBA Vincent MUANDUMBA
2 Alejandro DE LORRES Alejandro DE LORRES
3 Luis FILIPE da Rivera Luis da Rivera FILIPE
4 LIM Jock Hoi Jock Hoi LIM如果您确定名称中的第一个单词是完整的first name或Lastname (大多数区域性但较难泛化),这可能会更快:
new = df["FullName"].str.split(' ',1)
FirstName = []
LastName = []
for name in new:
if name[0].istitle():
FirstName.append(name[0])
LastName.append(name[1])
else:
FirstName.append(name[1])
LastName.append(name[0])
df['FirstName'] = FirstName
df['LastName'] = LastNamehttps://stackoverflow.com/questions/70159898
复制相似问题