Uni-Logo
Databases and Information Systems
Sie sind hier: Startseite Research Former Projects Automatic Web Data Extraction for Information Monitoring
 

Automatic Web Data Extraction for Information Monitoring

Overview

Rife information content available on the World Wide Web is published within representation-oriented semi-structured HTML pages making it difficult for machines to access the content automatically. Therefore tools which are able to unfold the information contained in suchlike resources and transform them into machine-readable and understandable formats are required.

As the vision of the Sematic Web seems to be far away from beeing accomplished, due to the lack of simple mechanisms, we propose a fully automatic wrapper prototype called ViPER (Visual perception-based extraction of records) which is able to extract repetitive structured data records contained in HTML pages with high precision and recall. ViPER tries to identifiy repetitive structures contained in the HTML source code and finally weights and separates the patterns according to the 2D-layout information according to the rendering information of the browser. After the extraction process the pattern with the highest weight becomes aligned in a table, handy to post-process the data. By mapping the data into a structured format, machines are finally able to process the relevant content automatically.

The ViPER system itself has been developed on top of JREX, which enables user to access Mozilla's XPCOM interface within Java. As ViPER is integrated into a meta search engine environment called ASTRO, we are currently working on a plugin-prototype of the system which enables a user to take advantage of the extraction power of ViPER during Web browsing within Mozilla or Firefox.

The ViPER-plugin for example enables a user to easily generate an agent which monitors a dynamic Web page and sends a notification as soon as the price of an item drops below a certain limit. Where the number of online stores monitored by the agent is not limited.

A sample snapshot of the plugin, demonstrating the automatic extracting functionality applied to a Web site containing repetitive data records, is shown below. In the future we plan to make the plugin freely available for others.



If you like to see further snapshots made during the use of the ViPER plugin take a look at these pictures 1 2 3.

ViPER uses several novel techniques to extract relevant information from dynamic Web Sites. Hereby most of the algorithms for the pattern detection and alignment procedure come from the bioinformatics domain. In order to demonstrate that our approach is able to extract repetitive data records with high precision and recall we compared our system with current state-of-the art systems on several third party data sets with very good results.



Data Set 2
[1]
Data Set 3
[1]
TBDW Ver. 1.02
[2]
extraction systems
(Asterisk marks results
reported in [1])
ViNTs*
[1]
ViPER
ViNTs*
MDR*
[3]
ViPER
ViNTs
ViPER
#search engines
100 42 51
#pages per search engine
5
1
1 1
#Search Result Records
6905
1390
795 693
#SRRs extracted
6872
1419
795
479
790
661
686
#SRRs correct
6740
1378
785
420
779
618
676
Recall
97.6%
99.1%
98.7%
52.8%
98.0%
89.2%
97.6%
Precision
98.1%
97.1%
98.7%
87.7%
98.6%
93.5%
98.5%
[1] W. Meng, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In WWW14, May 2005.
[2] Y. Yamada, N. Craswell, T. Nakatoh, and S. Hirokawa. Testbed for information extraction from deep web. In WWW Alt.’04, pages 346–347, New York, NY, USA, 2004. ACM Press
[3] B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In SIGKDD’03, 2003.

Publications

The following list identifies papers published in prestigious conferences that are relevant for the project. Please feel free to make inquiries in order to obtain most recent, still unpublished work.

Project Members