Follow Fantora on Twitter
Fantora Word of Classified Ads and Community Forum
Welcome Guest Search | Active Topics | Members | Log In | Register

crawlering websites in php

Options
tan
Posted: Friday, May 30, 2008 8:44:07 AM
Rank: Advanced Member
Groups: Member

Joined: 1/31/2008
Posts: 67
Points: 201
Location: Pakistan

crawlering websites in php

hi guys

what syntex we have to write to start php page crawler web pages over internet ?
rasheed
Posted: Friday, May 30, 2008 8:45:35 AM
Rank: Advanced Member
Groups: Member

Joined: 1/31/2008
Posts: 48
Points: 144
Location: GB

crawlering websites in php

Well, you can use fopen() with fopen_url enabled (or whatever it's called). You can use stream contexts to supply additional parameters for this.

Alternatively you can use curl (if enabled).

As a third option, you can use one of the other HTTP clients already made (there are at least two in PEAR, I've not tried either of them).

As a fourth option you can write your own HTTP implementation (which is what I ultimately did after it became obvious that fopen() wasn't flexible enough even with the stream context options).

You'll also need a HTML parser - fortunately PHP5 has one built in via libxml2 - the DOMDocument::loadHTML function will do what you want.

Making a web crawler is VERY involved and takes a huge amount of work. Real web pages have a lot of errors in and you'll encounter a lot of problems.

Issues I found:
- Multithreading efficiently
- Database locking / contention issues
- Startup/shutdown and remembering what pages are done
- Parsing robots.txt
- Handling broken things (for example, servers which return a 200 status even for pages which don't exist).
- Handling SPAM sites created just to **Banned Word**robots off (believe me, there are a LOT of these)
- Gracefully handling errors / exceptions thrown from inside the crawler itself and deciding what to do with those URLs in the queue
- Handling encodings correctly - even when the page has several conflicting messages (headers, meta) about which encoding it's in or just plain lies.
- Handling non-HTML pages
- Redirect handling
- Deciding what to spider next / prioritisation

These are just a few of the issues I found when trying to do this.

My conclusion was that PHP isn't a very suitable language for a HTTP spider - it simply doesn't give you enough low level control over most things (such as sockets, processes, threads, locking, high performance db stuff).

But it did work and I spidered hundreds of thousands of web pages with it.
Users browsing this topic
Guest

 Related
PHP and seo
Boonex Dolphin PHP script is SCAM!!!
Problem in fetching rows from mysql in php?
remommendations for PHP math CAPTCHA script please
avoid php pages getting hacked
sending email with php
Powerful user administration application
please suggest
updating in MySQL
problem with .htaccess
Forum Jump
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Fantora Blog | Discount Shop UK | Discount Shop USA | Discount Shop Canada | Discount Boutique France | Discount Shop Deutschland | Discount Shop Italia | Descuento Shop España

Free Classified ads, Webmaster Forum & Technology Reviews | Fantora Free Classified Ads | Buy & Sell Electronics, Mobile phones & Accessories | fantora Forums Community | Buy & Sell DVD, Games and Consols | Free eBooks & Softwares | SEO & Affiliate Marketing Discussion | Programming Language Forum (.NET, ASP, PHP, SQL) | Free Classified Ads | General Stuff (Movies, Chat, Comics) | Free Online English Movies & Reviews | Free Online Hindi Movies & Reviews | Australia & New Zealand Immigration Forum | Europe immigration forum | Canada Immigration Forum | Ireland Immigration Forum | US Immigration Forum | United Kingdom Immigration Forum

Main Forum RSS : RSS

Powered by: YAF.NET
Copyright © AI Logica All rights reserved.
This page was generated in 0.390 seconds.