Web scraping

Q. Is there any webpage layout detection tool available in FOSS ?

A web page will be having left pane, right pane etc.. . As like a wiki pedia article page. I have to extract the article only . Not the content from left pane etc. I downloaded the Tamil wikipedia html dump and some blog pages. I have to extract content from this .

A. Firstly, what you are speaking about (extracting only some elements from a web page) is commonly referred to as web scraping. There are various libraries and tools available to web page scraping. Depending on how you wish to do it (ie: single pages, multiple pages, choice of programming language …etc). Do a google and if you need more help in narrowing down the choices ask again with more specifics of what type of tools you would prefer.

Secondly, if you intend to get large amount of content from wikipedia, it is recommended that you /do not/ use an automated tool:

http://en.wikipedia.org/wiki/Wikipedia_database#Please_do_not_use_a_web_crawler

Instead use one of the alternate methods mentioned in the page above.

Thirdly, if you just want to remove unnecessary elements from a page and save only the content, while browsing, I would suggest using one of these Firefox tools:

Aardvark: http://karmatics.com/aardvark/

Readability: http://lab.arc90.com/experiments/readability/

The answer has been provided by Mr.Steve.

You can mail him at steve@lonetwin.net

Mano

I have the light, come get enlightened …

Web scraping

Leave a comment

Share this:

Related

Leave a comment