Computer Forums

Go Back   Computer Forums > Internet > Website Development

Website Development HTML, traffic, hosting, and more... this is the forum for every webmaster.

Register Now for FREE!
Computer Forums

Username: Password: Confirm Password: E-Mail: Confirm E-Mail:
Agree to forum rules 


Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 01-24-2007, 03:42 PM
Newbie
 
Join Date: 24 Jan 2007
Posts: 6
OllieB is on a distinguished road
Default Website scraping?

Hi All,
I have a challenge and need some help/advice.
In a nutshell, I want to extract a lot of data from a website (its in the public domain) and strip out the data I need (its wrapped in HTML and isnt too difficult to see), eventually exporting this to a Excel spreadsheet in column format.
I know there is a lot of software out there that will create macros and a bunch of other stuff to do the job, but I need to know that whatever I use will work AND something I can run every week by a click of a few buttons..
Any help/advice/offers to create it (paid of course) greatly appreciated!

Thanks, OllieB
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Sponsored Links
  #2 (permalink)  
Old 01-24-2007, 03:50 PM
Ash's Avatar
Ash Ash is online now
CF owner
 
Join Date: 27 Jul 2005
Location: Devon, UK
Posts: 4,138
Ash has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond repute
Default

Hi OllieB,
Welcome to CompuForums - great to have you here. I hope you can visit us often in the future and it would be great if you could add an entry to our member map.

Firstly, what sort of data are you extracting? There are some free applications that can do this sort of thing, but it depends on what you are grabbing.

Secondly, what site is it, and is it a big one? Doing this will take a lot of bandwidth, and you could end up using more than the site's monthly allowance, thus causing the site to either go offline for a month or causing the webmaster to pay a heavy bill.
__________________
Thanks,
Ash
CF Founder

Great Webhosting. Shared starting at $2 per month. VPSes starting at $6 per month.
www.Centicero.com

Want to get in touch? Send me a PM | Do you want to continue receiving free help? Or do you want this site to close? Become a premium member.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 01-24-2007, 03:59 PM
Newbie
 
Join Date: 24 Jan 2007
Posts: 6
OllieB is on a distinguished road
Default

Thanks for the swift reply Ash.
The data is simply a list of names, locations, referenceID's, and dates. There is a drop down menu that specifies each location, and then a calendar specifiying dates. Its this kind of data that I need to automate. With regards to the URL, if I can get a feel that this is possible and can be automated (which I'm sure it can) then I'll send you the link (bear with me on this.)
As an example, if the source code gave me all the info I needed on he first page, then I think I'd probably be able to strip out the data myself, but unfortunately, its over several hundred pages.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 01-24-2007, 04:06 PM
Ash's Avatar
Ash Ash is online now
CF owner
 
Join Date: 27 Jul 2005
Location: Devon, UK
Posts: 4,138
Ash has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond repute
Default

It might be simpler to contact the owner of the website and ask if they can provide a MySQL dump. This is the database of information which is then formatted when you access the site. The owner of the site can generate a comma-separated CSV file of the data through their control panel.
__________________
Thanks,
Ash
CF Founder

Great Webhosting. Shared starting at $2 per month. VPSes starting at $6 per month.
www.Centicero.com

Want to get in touch? Send me a PM | Do you want to continue receiving free help? Or do you want this site to close? Become a premium member.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 01-24-2007, 05:30 PM
Newbie
 
Join Date: 24 Jan 2007
Posts: 6
OllieB is on a distinguished road
Default

Thanks Ash
I doubt the site administrators would do this for me, and I'd need the csv file once, maybe twice a month.
Do you have skills to do this kind of thing, inc. stripping out the relevant data and exporting it to Excel once pulled from the site?
I can email you the site and details if you think you might be able to help?
Thanks, OllieB
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 01-24-2007, 06:09 PM
Ash's Avatar
Ash Ash is online now
CF owner
 
Join Date: 27 Jul 2005
Location: Devon, UK
Posts: 4,138
Ash has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond reputeAsh has a reputation beyond repute
Default

I am not a programmer myself however there are people here who may be able to help - they will reply if they can offer assistance. But, you should still check with the site administrator if it's okay to do it - otherwise you could end up making their site go offline.
__________________
Thanks,
Ash
CF Founder

Great Webhosting. Shared starting at $2 per month. VPSes starting at $6 per month.
www.Centicero.com

Want to get in touch? Send me a PM | Do you want to continue receiving free help? Or do you want this site to close? Become a premium member.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 01-26-2007, 07:40 AM
Secondary Administrator
 
Join Date: 20 Feb 2006
Location: United Kingdom
Posts: 579
robputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond reputerobputt796 has a reputation beyond repute
Default

If you have a Linux/Unix machine, or even a Virtual Machine I would make a shell script to wget the webpages and save them too a directory, and then to grab all the mysql from the server and save it as an SQL file. Or I would tell it to FTP in and grab from FTP if you have FTP access. This could be set up on CHRON to make it do it say every sunday night. Just an idea. But if your a Win32 user, I guess manually is the way. I wouldn't know too much, I tend to be a linux/mac/unix user.
__________________
-Rob Putt - Blog!
CompuForums Secondary Administrator
+ Download the CompuForums Thread Viewer!
+ Add yourself to the Member Map!
+ Be sure that you are up-to-date with the rules.
+ Still Not A Member? Register Now!
+ Contact Us
+ Email Me! - rob at compuforums dot org
+ipHideAway - Unfilter Anything Anywhere! Anonamize your surfing today!!
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 01-26-2007, 09:36 AM
Newbie
 
Join Date: 24 Jan 2007
Posts: 6
OllieB is on a distinguished road
Default

Hi Rob
Thanks for the advice. I used to have a linux box - but no longer. I understand the benefits of using a shell script, using wget in a loop to get the data and output to a file, and using Cron to schedule it. I'm not familiar with SQL file formats or how to use them. I see the whole process in 3 steps. 1. use wget to extract data pages. 2. use awk (or whatever) to strip out the relevant data. 3. export this data to Excel in columns.

Are you able to do this kind of work yourself?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 06:58 PM.



Powered by: vBulletin®
Copyright ©2000 - 2008, Jelsoft Enterprises Limited.
Content © Copyright 2005-2008 CompuForums. All Rights Reserved. Some content © Copyright of the respective owners.
Debt Consolidation - Credit Cards - Credit Card - Credit Counseling

Content Relevant URLs by vBSEO 3.2.0 RC5