Find Jobs
Hire Freelancers

Extract data from files compressed in gz archives

$30-250 USD

已完成
已发布将近 8 年前

$30-250 USD

货到付款
I need a script that can scan a large number of .gz files as fast as possible and extract certain data from them. I need the script to be able to do this without having to decompress all the files (there are over 40000 of them!) Speed is critical and I would prefer a solution in python but am not closed to the idea of using other languages especially if it will run faster. Also, the source files should not be modified just checking them for matches and writing the results to a "results" file. Every .gz file contains a single json file containing information on up to 750 different keywords/items so it needs to be able to recurse through the file and match the data found with the correct keyword. Here is an example: The data I need to locate is entries in the json file that include the "images" tag. Here is a sample: { "date": "2016-06-10", "gl": "us", "hl": "en", "custom_id": "2", "keyword": "cat drinking water", "data": { "1": { "pos": 1, "href": "[login to view URL]", "title": "", "description": "", "tags": [ "images" ] }, "2": { "pos": 2, "href": "[login to view URL]", "title": "Amazing Slow Motion Cat Drinking - YouTube", "description": "Discover the beauty of a cat in super slow motion thanks to a high definition ... Giant 6ft Water Balloon - The ...", "tags": [ "video" ] }, "3": { "pos": 3, "href": "[login to view URL]", "title": "Problems With a Cat Drinking Excessive Water - Pets", "description": "Perhaps better known as finicky eaters, cats aren't prolific water drinkers. If your cat is drinking a lot of water , it could be a sign of a serious health issue including \u00a0...", "tags": [] ... As you can see, the first entry under "data" has the "images" tag. Once the "images" tag is found I need to get this information from the "top" of the entry as seen in the example above: "gl": "us", "hl": "en", "custom_id": "2", "keyword": "cat drinking water", This information should be saved to a text file in this format: "GL;HL;ID;keyword" So based on this example it would save this to the file: "US;EN;2;cat drinking water" For every match found in all the .gz files a new line should be written to the "results" file so everything is stored in a single file. I have attached a sample .gz file you can test with. I need this completes as soon as possible so how soon you can begin and complete work will factor into my bid selection. Thanks and feel free to ask questions!
项目 ID: 10742030

关于此项目

16提案
远程项目
活跃8 年前

想赚点钱吗?

在Freelancer上竞价的好处

设定您的预算和时间范围
为您的工作获得报酬
简要概述您的提案
免费注册和竞标工作
颁发给:
用户头像
I think a mixture of python (to parse JSON an build results) and shell script (to decompress) will work best for the file. Only one sample file was attached. I will run the test on say 10 files to determine the average time consumed each file and then if required I will parallelize the scripts to decompress and parse multiple files simultaneously. I have 6+ years of experience in Python and Linux tools. I work as a full-time employee at Google. To know more about me visit : [login to view URL] (www [dot] ashishkedia [dot] me)
$83 USD 在1天之内
4.9 (10条评论)
3.9
3.9
16威客以平均价$240 USD来参与此工作竞价
用户头像
I can do this, no problem .
$199 USD 在3天之内
5.0 (157条评论)
8.4
8.4
用户头像
Web scraping expert I use python language. My scripts works on windows, mac or linux, but linux is preferably. I can schedule scripts on server if it is required. I have more 100 finished projects (google scraping, facebook scraping, yellow pages, linkedinIn, amazon, webshops and other sites with lists of any items). I can scrape secured and protected sites, my crawlers can enter into login form, emulate ajax requests etc. If site block IP i can use proxy or TOR. I can try avoid captha on site in avtomatic or manual mode. I can export data into json, xml, csv (excel), or any database (mysql, mongodb, mssql, etc). I can develop web-interface for management running script (start, stop, etc), using PHP, HTML, JS.
$200 USD 在3天之内
4.8 (103条评论)
6.2
6.2
用户头像
Hello, Thank you very much for this Web Scraping Project. I read through the job details extremely carefully and understand your required, for this I am absolutely sure that I can do the project very well. I can complete this Web Scraping project on time and within your budget. I have worked on similar Web Scraping projects, and I am confident I can exceed your expectations. Please click on Chat & reply me for see demo work or talk more details. Regards by Feroz Ahmed See My Feedback: www.freelancer.com/u/ferozstk.html
$100 USD 在3天之内
4.8 (47条评论)
5.5
5.5
用户头像
Dear Sir/Ma'am, I am a Web research, Data Entry & Webs Scrapping expert. I checked and understood your requirements. I can handle this job very well to your appreciation. I can find and extract the information from different websites into an Excel sheet. I am ready to hear the details of the project more in detail now. I have always created a long-term collaboration with my clients through hard work and quality output for a reasonable price. If you have questions or doubts about anything, please feel free to ask me. Sincerely, Mir
$250 USD 在5天之内
4.9 (27条评论)
5.1
5.1
用户头像
A proposal has not yet been provided
$222 USD 在3天之内
4.9 (15条评论)
4.3
4.3
用户头像
Hi. I can do such script in python. I hope each json file not as large to fit into memory with json decode. I already done prototype, and have result for your example file.
$111 USD 在0天之内
5.0 (14条评论)
4.3
4.3
用户头像
Hi, I am experienced python programmer and can offer you my solution on this topic with work on it starting today. There is no way to know the content of the files without unzipping them (who says otherwise is fooling you), but we will need to unzip just ones that script is working in a moment, if space complexity is your concern it shouldn't be. Looking forward to hear from you to discuss full solution details. If necessary I can create proof of concept with report time chart before the bid is accepted. Yours sincerely, Ivan
$200 USD 在2天之内
5.0 (14条评论)
3.8
3.8
用户头像
I got 7+years work experience in Data Collection,Bulk Email Campaign,Excel VBA and Internet Research in IT companies here.I can do create crawler and scrap datas from sites using C++,Python and Perl coding as per your requirements in excel with multiple ip rotations.I have dealt with US,UK and Australia companies President,Directors and Managers for web design and development projects successfully and I have Good Communication with writing skills.I am well versed in Internet,MS Office Applications and Phone Etiquette manners with latest Technologies.I can accept your payment terms.
$155 USD 在2天之内
3.9 (6条评论)
4.5
4.5
用户头像
Dear sir or madam, I have more than 5 years of experience in PHP programming. I know how to process gz files, how to analyze JSON data, etc. I can handle this project in a few hours. Kind regards, Alen
$200 USD 在1天之内
5.0 (3条评论)
2.3
2.3
用户头像
I have been doing the exact same work throughout my professional career. The bid is low because I am getting started on freelancer but the work will be of very high quality.
$98 USD 在3天之内
0.0 (0条评论)
0.0
0.0

关于客户

UNITED STATES的国旗
Andalusia, United States
5.0
172
付款方式已验证
会员自7月 9, 2012起

客户认证

谢谢!我们已通过电子邮件向您发送了索取免费积分的链接。
发送电子邮件时出现问题。请再试一次。
已注册用户 发布工作总数
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
加载预览
授予地理位置权限。
您的登录会话已过期而且您已经登出,请再次登录。