Web information extraction (IE) is the process of retrieving exact text fragments of record attributes from HTML web pages. Most of the existing approaches need to do a large amount of work on feature engineering, selecting or computing the underlying content, layout and contextual features from web pages. Another disadvantage is that a great number of human’s labor on annotating training example is required. Methods via solving wrapper adaption drastically reduce the annotating work but still need to label many pages on the seed website. In this work, we present a hierarchical attention recurrent neural network, which is an end-to-end model and do not require traditional, domain-specific feature engineering. The network can be also trained with only a few pages in a site, i.e. Few-Shot learning. As the model automatically and deeply learns the semantics of text fragments in pages, we adapt the network to extract records from the previously unseen websites. Experiments on a publicly available dataset demonstrate that our networks for both wrapper induction and adaption showed competitive results compared against state-of-the-art approaches.
CITATION STYLE
Liu, S., Li, Y., & Fan, B. (2018). Hierarchical RNN for few-shot information extraction learning. In Communications in Computer and Information Science (Vol. 902, pp. 227–239). Springer Verlag. https://doi.org/10.1007/978-981-13-2206-8_20
Mendeley helps you to discover research relevant for your work.