Find Jobs
Hire Freelancers

SOLR search engine from OCR'ed, indexed PDFs

$1500-3000 USD

已关闭
已发布超过 6 年前

$1500-3000 USD

货到付款
The attached word document is ESSENTIAL to understanding this project as it contains very important images. I will ask if you have read the attached brief before I will accept your bid. This is a short description of the project. Please read the attached document for the whole story. We need a SOLR search engine built from old, multi-page PDFs. All of the indexed documents will be PDFs and many will need to go through OCR first. We will probably use something like Foxit to do the image to text conversion. We know the output will be messy, but text will only be used in indexing process. When user does a search, s/he will access the PDF directly. Note: All of our work is in Java. This will be running on a large Linux server. This project is not that simple though. Let’s take a look at this example > [login to view URL] We will want to index this 30 page document. But it contains more than one form (unique section). State Oil & Gas sites will often put an entire wellbore’s files in a single PDF. 20 years of paperwork can be sitting in a single PDF. If we index as-is and return results with a 30 to 100-page PDF attached, the user will never be able to find the single mention of their search string after opening the very long PDF file. For this reason, we need to break the 30+ page PDF into individual pages, OCR each, and index each page separately. When doing a search, user is actually searching individual pages. We tell the user we found the queried text on page 19 of the PDF. S/he clicks to get the full 30 pages, but knows to go to page 19. We may even load the PDF in a frame and keep a header at the top that reminds user to look on page 19. And there may be multiple mentions of the search query in a single PDF file. A lot of it will be nasty looking. Documentation goes back 50+ years to typewriters. If this all seems pretty impossible, you would be right. In fact, we believe the OCR will be so incomplete in places, we cannot even show a snippet (10-20 words) of text on the search results page, because it will be nonsensical. But this is ok. If we can OCR 70% of the data from these PDFs, that’s 70% we didn’t have yesterday. And no one will ever see the OCR text to complain how incomplete it is… Why are we going to all this effort? We plan on using SOLR to build a metadata engine around these documents. We are less interested in the content of each page and more interested in the page type, that a particular wellbore even has a C-144 form. We'd like to get as much data as we can but realize we won't be able to get it all. The end user will probably do very little “free text” searching of SOLR. Instead, we will process 10,000 of our own search phrases (tokenization and algorithms), e.g. “Tank Closure” or “C-144” and build a table of all the document types that are inside PDFs for each wellbore. We may tell a user that wellbore [Removed by Freelancer.com Admin - please see Section 13 of our Terms and Conditions] Now, it starts to make sense why we are breaking apart all the PDFs for OCR and indexing. We may store page 1, 2, 3, 4 and 5 in a database row for wellbore [Removed by Freelancer.com Admin - please see Section 13 of our Terms and Conditions] We cannot stress this enough. The user never sees the OCR text or the broken apart PDFs. Will be way too confusing. Instead, we will direct the user to open the original PDFs and go to page 6 or page 1 or page 27 and read further about a tank disclosure for this particular wellbore. Expect 10-15 million PDFs. If this work is good, we have many more follow on projects from this that we will LOVE for you to work on. OK! That should be enough to communicate the main purpose of this project. Please read the attached document which has more detailed information about the entire project.
项目 ID: 15634361

关于此项目

8提案
远程项目
活跃6 年前

想赚点钱吗?

在Freelancer上竞价的好处

设定您的预算和时间范围
为您的工作获得报酬
简要概述您的提案
免费注册和竞标工作
8威客以平均价$2,259 USD来参与此工作竞价
用户头像
I have worked with lucene search with java so I understand fully about solr search. I also worked on OCR tech like Tesseract, ephesoft , Asprise etc. I understand how OCRs work. I can really help you. Relevant Skills and Experience Have over 9 years of working with Java and related frameworks. Knows in and out of Spring and other web frameworks. Have lead teams to build products from scratch Proposed Milestones $2500 USD - Will discuss and decide on milestones Additional Services Offered $50 USD - per hour of work if want to work on hourly basis Is the OCR work part of the project or just solr search and indexing?
$2,222 USD 在20天之内
4.7 (17条评论)
6.8
6.8
用户头像
Hi I review the word file and understand the requirements. I propose to use C# to provide the index file and a simple GUI that you can put your queries. Relevant Skills and Experience Algorithm Proposed Milestones $3000 USD - Full I have my VPN, so if you provide some PDFs, then i can work on a working demo.
$3,000 USD 在20天之内
5.0 (12条评论)
5.7
5.7
用户头像
We have already worked on something of this sort. We are team of Scientists and Developers having rich experience with Artificial Intelligence and Machine Learning Techniques like Neural and NLP Relevant Skills and Experience We have done extensive research on Facial Recognition and CVIP. The Technical team consists of Programmers having experience more than 6 years.
$2,500 USD 在30天之内
5.0 (2条评论)
4.0
4.0

关于客户

UNITED STATES的国旗
Oklahoma City, United States
5.0
9
付款方式已验证
会员自8月 16, 2017起

客户认证

谢谢!我们已通过电子邮件向您发送了索取免费积分的链接。
发送电子邮件时出现问题。请再试一次。
已注册用户 发布工作总数
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
加载预览
授予地理位置权限。
您的登录会话已过期而且您已经登出,请再次登录。