使用regex提取文章中的页码


第一种,使用re.search提取数字,删除pg和含有其他字母的text:

texts = ['pg 199-200,202.205', 'pg 7', 'pg 2, 14', 'pg 69-71','pg 159. 97 -98', 'pg 60, 62, 64-65', 'pg 3-4-5', 'pg Summary Data', 'pg FC1-1', 'pg 16-']

for text in texts:
  matched = re.search("(?<=pg\ )[\d\-,.\ ]+", text)
  if matched:
    print(matched.group())


# 199-200,202.205
# 7
# 2, 14
# 69-71
# 159. 97 -98
# 60, 62, 64-65
# 3-4-5
# 16-

在上面的基础上,变为list的形式:


for text in texts:
  matched = re.search("(?<=pg\ )[\d\-,.\ ]+", text)
  if matched:
    page_numbers_raw = re.split("[,.]\ *", matched.group())
    print(page_numbers_raw)

# ['199-200', '202', '205']
# ['7']
# ['2', '14']
# ['69-71']
# ['159', '97 -98']
# ['60', '62', '64-65']
# ['3-4-5']
# ['16-']

针对上面含有-的一些部分,使用下面的方法进行展开:


def unfold_range(hyphenated):
  folded = [matched for matched in re.split("\ *-\ *", hyphenated) if matched]
  if len(folded) == 1:
    return folded
  else:
    return list(map(str, range(int(folded[0]), int(folded[-1]) + 1)))

range_texts = ['69-71', '97 -98', '3-4-5', '16-']
for text in range_texts:
    unfold_range(text)

# ['69', '70', '71']
# ['97', '98']
# ['3', '4', '5']
# ['16']

欢迎订阅我的博客:RSS feed
知乎: 赤乐君
Blog: BrambleXu
GitHub: BrambleXu
Medium: BrambleXu


文章作者: BrambleXu
版权声明: 本博客所有文章除特別声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来源 BrambleXu !
评论
  目录