Web scraping

Q. Is there any webpage layout detection tool available in FOSS ?

A web page will be having left pane, right pane etc.. . As like a wiki pedia article page. I have to extract the article only . Not the content from left pane etc. I downloaded the Tamil wikipedia html dump and some blog pages. I have to extract content from this .

A. Firstly, what you are speaking about (extracting only some elements from a web page) is commonly referred to as web scraping. There are various libraries and tools available to web page scraping. Depending on how you wish to do it (ie: single pages, multiple pages, choice of programming language …etc). Do a google and if you need more help in narrowing down the choices ask again with more specifics of what type of tools you would prefer.

Secondly, if you intend to get large amount of content from wikipedia, it is recommended that you /do not/ use an automated tool:
Instead use one of the alternate methods mentioned in the page above.
Thirdly, if you just want to remove unnecessary elements from a page and save only the content, while browsing, I would suggest using one of these Firefox tools:
The answer has been provided by Mr.Steve.
You can mail him at steve@lonetwin.net
.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s