Text Processing

User 2459 | 10/21/2015, 2:08:20 PM

Hey, 1. So I have .txt files from pub med and i would like to convert this data into potentially a table to use with graph lab create. 2. Each .txt file is an article and essentially I would like to extract information from these article/journals. so for example maybe a treatment or a drug that is discussed in the article. I was thinking of using a mixture of nltk, deep learning and classification. Any advice or thoughts about this?

here is an extract from one .txt file:

"Selenium can bioaccumulate in aquatic organisms resulting in adverse effects when it exceeds threshold levels. In fish, these effects can include reduced production of viable eggs, post-hatch mortality, deformities in growing stages, and various pathological effects in the kidneys, liver, heart, and ovaries (Hamilton , ; Lemly ). In severe cases, these effects may lead to population declines (Lemly )."


User 18 | 10/22/2015, 10:54:13 PM

Hi @antalexa,

Is your ultimate goal to tag each article with the relevant key concepts? If you can come up with a list of the concepts you'd like to tag, then the best way to go about this might be to use the Autotagger toolkit. I think this would do exactly what you'd like to do. The underlying implementation finds closely related strings using a distance metric. It's much more lightweight than using deep learning or other classification models to learn to classify the texts. We have example notebooks on Autotagging Hacker New articles and StackOverflow posts. Check them out.

If you don't know the interesting tags ahead of time, then I would use nltk or some other nlp package to extract entities out of unstructured text, then clean the list of extracted entities and get a set of tags to use.


User 2459 | 10/26/2015, 3:17:47 AM

Hey @alicez , I will not be knowing the tags ahead of time, therefore i'll end up using a nltk toolkit since it can look for potential key words to highlight. I am not too familiar with the nltk tool kit, do you have any suggestions where I could maybe find an example notebook/ example or tutorial on this topic. Cheers Anish