building dataset from pdf

User 5248 | 6/7/2016, 10:55:11 AM

hi i need create a data-set . just imagine we have 1000 PDF documents and we want to extract abstract, key word and name of journal as tabular data set. i have tried to illustrate my purpose in a word file to better explain what i want to do.

how can i able to do that for text analytic?

i really need to figure out how to do that.

best regard


User 1207 | 6/7/2016, 6:33:58 PM

Hello Mohsenbarani,

Graphlab Create deals more with doing predictive analytics and machine learning on text data and provides powerful tools for doing that, but we don't have a built-in pdf extractor. For your use case, you should check out pdftk -- it has a nice linux command line interface that should allow you to script your job.

-- Hoyt