Welcome to the Web API for the boilerpipe Java library.
boilerpipe provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
If you just want to see what boilerpipe does with the page, enter a URL below and click on "Extract".
Please note: Due to heavy use of this free service in the past, the number of requests per user is limited.
The restriction can be removed by purchasing a commercial license for this Web API directly from Kohlschütter Search Intelligence for a modest fee.
See for yourself how boilerpipe works by using this Web API instead of direct usage of the boilerpipe core library.
Just call
http://boilerpipe-web.appspot.com/extract?url=
http://someurl to highlight the main content of an arbitrary
URL.
This usually works fairly well, but you can adjust the extraction parameters to suit your needs. First, you can use several extraction strategies that come with boilerpipe. Second, you can choose from several output modes. Theses options can be specified using additional GET parameters:
To change the extraction strategy, add the
extractor
parameter, with one of the following values:
Strategy | Description |
---|---|
ArticleExtractor | (default). A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. |
DefaultExtractor | A quite generic full-text extractor, but usually not as good as ArticleExtractor. |
LargestContentExtractor | Like DefaultExtractor, but only keeps the largest content block. Good for non-article style texts with only one main content block. |
KeepEverythingExtractor | Treats everything as "content". Useful to track down SAX parsing errors. |
To change the output format, add the
output
parameter, with one of the following values:
Output Format | Description |
---|---|
html | (default). Output the whole HTML document and highlight the extracted main content |
htmlFragment | Output only those HTML fragments that are regarded main content |
text | Output the extracted main content as plain text |
json | Output the extracted main content as JSON. For details, see this page. |
debug | Output debug information to understand how boilerpipe internally represents a document. |
The boilerpipe Web application is hosted on Google App Engine. The underlying library, boilerpipe, is available under the Apache 2.0 license on GitHub. If you want to know more about the algorithms, see this paper: Boilerplate Detection using Shallow Text Features.
This Web Application may use a more recent version than the ones released in the Boilerpipe project on GitHub. You might thus get slightly different (hopefully better) results.
Copyright © 2010-2016 Christian Kohlschütter. All rights reserved.