samedi 27 juin 2015

Parse text between multiple lines - Python 2.7 and re Module

I have a text file i want to parse. The file has multiple items I want to extract. I want to capture everything in between a colon ":" and a particular word. Let's take the following example.

Description : a pair of shorts
amount : 13 dollars
requirements : must be blue
ID1 : 199658
----

The following code parses the information out.

import re

f = open ("parse.txt", "rb")
fileRead = f.read()

Description = re.findall("Description :(.*?)amount", fileRead, re.DOTALL)
amount = re.findall("amount :(.*?)requirements", fileRead, re.DOTALL)
requirements = re.findall("requirements :(.*?)ID1", fileRead, re.DOTALL)
ID1 = re.findall("ID1 :(.*?)-", fileRead, re.DOTALL)

print Description[0]
print amount[0]
print requirements[0]
print ID1[0]

f.close()

The problem is that sometimes the text file will have a new line such as this

Description 
: a pair of shorts
amount 
: 13 dollars
requirements: must be blue
ID1: 199658
----

In this case my code will not work because it is unable to find "Description :" because it is now separated into a new line. If I choose to change the search to ":(.*?)requirements" it will not return just the 13 dollars, it will return a pair of shorts and 13 dollars because all of that text is in between the first colon and the word, requirements. I want to have a way of parsing out the information no matter if there is a line break or not. I have hit a road block and your help would be greatly appreciated.

Aucun commentaire:

Enregistrer un commentaire