boilerpipe

Welcome to the Web API for the boilerpipe Java library.

boilerpipe provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

Demo

If you just want to see what boilerpipe does with the page, enter a URL below and click on "Extract".



Limitations

Please note: Due to heavy use of this free service in the past, the number of requests per user is limited.

The restriction can be removed by purchasing a commercial license for this Web API directly from Kohlschütter Search Intelligence for a modest fee.

Usage

See for yourself how boilerpipe works by using this Web API instead of direct usage of the boilerpipe core library.

Just call http://boilerpipe-web.appspot.com/extract?url=http://someurl to highlight the main content of an arbitrary URL.

This usually works fairly well, but you can adjust the extraction parameters to suit your needs. First, you can use several extraction strategies that come with boilerpipe. Second, you can choose from several output modes. Theses options can be specified using additional GET parameters:

To change the extraction strategy, add the extractor parameter, with one of the following values:

Strategy Description
ArticleExtractor (default). Uses ArticleExtractor: A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor.
DefaultExtractor Uses DefaultExtractor: A quite generic full-text extractor, but usually not as good as ArticleExtractor.
LargestContentExtractor Uses LargestContentExtractor: Like DefaultExtractor, but only keeps the largest content block. Good for non-article style texts with only one main content block.
KeepEverythingExtractor Uses KeepEverythingExtractor: Treats everything as "content". Useful to track down SAX parsing errors.
CanolaExtractor Uses CanolaExtractor: A full-text extractor trained on krdwrd Canola. If you are curious :-)

To change the output format, add the output parameter, with one of the following values:

Output Format Description
html (default). Output the whole HTML document and highlight the extracted main content
htmlFragment Output only those HTML fragments that are regarded main content
text Output the extracted main content as plain text
json Output the extracted main content as JSON. For details, see this page.
debug Output debug information to understand how boilerpipe internally represents a document.

More Info

The boilerpipe Web application is hosted on Google App Engine. The underlying library, boilerpipe, is available under the Apache 2.0 license on Google Code. If you want to know more about the algorithms, see this paper: Boilerplate Detection using Shallow Text Features.

Important Note

This Web Application probably uses a more recent version than the released versions in the Boilerpipe Google Code page. You might thus get slightly different (hopefully better) results.

Copyright

Copyright © 2010, 2013 Christian Kohlschütter. All rights reserved.