Welcome to the Web API for the boilerpipe Java library.
boilerpipe provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
If you just want to see what boilerpipe does with the page, enter a URL below and click on "Extract".
Please note: Due to heavy use of this free service in the past, the number of requests per user is limited.
The restriction can be removed by purchasing a commercial license for this Web API directly from Kohlschütter Search Intelligence for a modest fee.
See for yourself how boilerpipe works by using this Web API instead of direct usage of the boilerpipe core library.
Just call http://boilerpipe-web.appspot.com/extract?url=http://someurl
to highlight the main content of an arbitrary URL.
This usually works fairly well, but you can adjust the extraction parameters to suit your needs. First, you can use several extraction strategies that come with boilerpipe. Second, you can choose from several output modes. Theses options can be specified using additional GET parameters:
To change the extraction strategy, add the extractor
parameter, with one of the following values:
| Strategy | Description |
|---|---|
| ArticleExtractor | (default). Uses ArticleExtractor: A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. |
| DefaultExtractor | Uses DefaultExtractor: A quite generic full-text extractor, but usually not as good as ArticleExtractor. |
| LargestContentExtractor | Uses LargestContentExtractor: Like DefaultExtractor, but only keeps the largest content block. Good for non-article style texts with only one main content block. |
| KeepEverythingExtractor | Uses KeepEverythingExtractor: Treats everything as "content". Useful to track down SAX parsing errors. |
| CanolaExtractor | Uses CanolaExtractor: A full-text extractor trained on krdwrd Canola. If you are curious :-) |
To change the output format, add the output
parameter, with one of the following values:
| Output Format | Description |
|---|---|
| html | (default). Output the whole HTML document and highlight the extracted main content |
| htmlFragment | Output only those HTML fragments that are regarded main content |
| text | Output the extracted main content as plain text |
| json | Output the extracted main content as JSON. For details, see this page. |
| debug | Output debug information to understand how boilerpipe internally represents a document. |
The boilerpipe Web application is hosted on Google App Engine. The underlying library, boilerpipe, is available under the Apache 2.0 license on Google Code. If you want to know more about the algorithms, see this paper: Boilerplate Detection using Shallow Text Features.
This Web Application probably uses a more recent version than the released versions in the Boilerpipe Google Code page. You might thus get slightly different (hopefully better) results.
Copyright © 2010, 2011 Christian Kohlschütter. All rights reserved.