I have a PDF document that doesnt convert well to text or other formats since it has many tables. I have tried converting to text/XML using various commerical and free packages and the files doesnt convert well.
The best approach seems to be convert into XML using PDFminer (python) and then do some slicing and dicing to retrieve relevant data. This requires understanding of data science and extracting info. Clustering/chunking etc.