How do I find string counts of a particular string in a dataset with some text allowed in between?

User 4544 | 4/9/2016, 10:23:31 AM

Hi there,

I want to search for some strings like "IIT Roorkee" or "Indian Inst Technol Roorkee;"

from a string like "Indian Inst Technol, Dept Mech & Ind Engn, Roorkee;"

Also the string could be a very long one but I want to stop the search continuation after semicolon (;). However the names to be searched have extra strings in between here. Also, there's a problem that the search has to continue to end of the text but if a semicolon is encountered, it should start searching again as it's a new block of information.

Thank you

Comments

User 91 | 4/9/2016, 3:12:50 PM

If you are familiar with regular expressions (https://docs.python.org/2/library/re.html) then I would suggest you try that.

Another option is to split the string on ; using the split function. This returns a list and then you can search within each list. Within each string, I wasn't sure if you were referring to a substring search (which can be done with a find) in python. If you are trying to do approximate matching, then you can performing 3 character shingling and compute distances between them.