Exposing a web API

Create a crawler with web API support

There are 2 crawler implementations that have web API support:

CrawlerWithWebApi
CrawlerWithSecuredWebApi

These implementations allow users to register HTTP and WebSocket endpoints that can be used to interact with the crawler while it is running.

CrawlerWithWebApi

This implementation is intended to be used for public web APIs. It does not provide access control.

Configure the web server

The default configuration can be obtained by using the createDefault method. To create a custom configuration, use the ServerConfigurationBuilder class.

The default settings are the following:

Port: 8080
CORS allowed origins: *
CORS allowed methods: GET, POST, HEAD
CORS allowed headers: X-Requested-With, Content-Type, Accept, Origin
CORS exposed headers: None
SSL: Disabled

Configure SSL

In order to use SSL, an SSL context configuration is required.

This configuration is used to specify the followings:

The path to the key store that holds the SSL certificate
The password used to access the key store
The password used to access the key in the key store (optional)

Register HTTP endpoints

The addHttpEndpoint method is used to add HTTP endpoints to the web API.

The following parameters are required:

The HTTP method of the endpoint (e.g. GET)
The path of the endpoint (e.g. /api/foo)
The handler of the endpoint which handles the incoming HTTP request

Register WebSocket endpoints

The addWebSocketEndpoint method is used to add WebSocket endpoints to the web API.

The following parameters are required:

The path of the endpoint (e.g. /api/bar)
The handler of the endpoint which handles WebSocket events (e.g. when a client sends a text message)

Obtain open WebSocket sessions

The getOpenWebSocketSessions method can be used to obtain a set of open WebSocket sessions that represent connections to the specific endpoint. This could be useful in situations where the crawler needs to send a message to all the clients that are connected to the specific endpoint.

Communicate using JSON format

By default, the web server communicates using JSON objects. It is recommended that custom HTTP and WebSocket endpoints also follow this principle. To make it easier for users, the JsonReaderWriter helper class can be used for this purpose. This class can read and write JSON structures to HTTP or WebSocket streams.

Example

Implement the crawler with web API support:

public class MyCrawler extends CrawlerWithWebApi {

    protected MyCrawler(
            final ServerConfiguration serverConfig,
            final CrawlerConfiguration crawlerConfig) {
        super(serverConfig, crawlerConfig);

        // Add HTTP endpoint for retrieving crawl stats
        addHttpEndpoint(HttpMethod.GET, "/api/crawler/stats",
                (request, response) -> JsonReaderWriter.writeJsonResponse(response, getCrawlStats()));
    }

    // ...
}

Configure and create the crawler:

// Create the web server configuration
ServerConfiguration serverConfig = ServerConfiguration.createDefault();

// Create the crawler configuration
CrawlerConfiguration crawlerConfig = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(createDefault("http://example.com/"))
        .build();

// Create the crawler using the configurations above
MyCrawler crawler = new MyCrawler(serverConfig, crawlerConfig);

CrawlerWithSecuredWebApi

This implementation is intended to be used for private web APIs. Users are required to authenticate before they can access restricted endpoints (if they are authorized to do so).

Configure the access control

To create a configuration use the AccessControlConfigurationBuilder class. At least one user (called the root user - but here root does not mean a user with root privileges) must be specified.

The default settings are the following:

Authentication path: /api/auth
Secret key used by the JWT signing algorithm: None (will be generated on the fly)
Cookie authentication: Disabled
JWT expiration duration: 1 hour

Authenticate

Send an HTTP POST request to the authentication endpoint. The request body should be a JSON object containing your credentials:

{
    username: "username",
    password: "password"
}

If the provided credentials are valid, the response will also be a JSON object containing the JWT.

If cookie authentication is enabled, 2 cookies will also be set:

JWT: with the token as value
XSRF-TOKEN: with the CSRF token as value

Access HTTP endpoints that require authentication

The server will look for the JWT in the HTTP Authorization request header. If the header is not present and cookie authentication is enabled, it will look for the authentication cookie. To prevent CSRF when using cookie authentication, an X-XSRF-TOKEN header must be present in the request. This header's value should match the value of the XSRF-TOKEN cookie.

Access WebSocket endpoints that require authentication

The server will first look for the JWT in the access_token query parameter. If the header is not present and cookie authentication is enabled, it will look for the authentication cookie. To prevent CSRF when using cookie authentication, the server will check the origin of the request.

Register HTTP endpoints that require authentication

The addHttpEndpoint method is used to add restricted HTTP endpoints to the web API.

The following parameters are required:

The HTTP method of the endpoint (e.g. GET)
The path of the endpoint (e.g. /api/foo)
The set of allowed roles ("**" means any role is acceptable)
The handler of the endpoint which handles the incoming HTTP request

Register WebSocket endpoints that require authentication

The addWebSocketEndpoint method is used to add restricted WebSocket endpoints to the web API.

The following parameters are required:

The path of the endpoint (e.g. /api/bar)
The set of allowed roles ("**" means any role is acceptable)
The handler of the endpoint which handles WebSocket events (e.g. when a client sends a text message)

Example

Implement the crawler with secured web API support:

public class MyCrawler extends CrawlerWithSecuredWebApi {

    protected MyCrawler(
            final ServerConfiguration serverConfig,
            final AccessControlConfiguration accessControlConfig,
            final CrawlerConfiguration crawlerConfig) {
        super(serverConfig, accessControlConfig, crawlerConfig);

        // Add HTTP endpoint for retrieving crawl stats
        // Only users with role "custom-role" can access this endpoint
        addHttpEndpoint(HttpMethod.GET, "/api/crawler/stats", Collections.singleton("custom-role"),
                (request, response) -> JsonReaderWriter.writeJsonResponse(response, getCrawlStats()));
    }


    // ...
}

Configure and create the crawler:

// Create the web server configuration
ServerConfiguration serverConfig = ServerConfiguration.createDefault();

// Create the access control configuration
User rootUser = new User("username", "password", Collections.singleton("custom-role"));

AccessControlConfiguration accessControlConfig = new AccessControlConfigurationBuilder(rootUser).build();

// Create the crawler configuration
CrawlerConfiguration crawlerConfig = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFilterEnabled(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(createDefault("http://example.com/"))
        .build();

// Create the crawler using the configurations above
MyCrawler crawler = new MyCrawler(serverConfig, accessControlConfig, crawlerConfig);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing a web API

Create a crawler with web API support

CrawlerWithWebApi

Configure the web server

Configure SSL

Register HTTP endpoints

Register WebSocket endpoints

Obtain open WebSocket sessions

Communicate using JSON format

Example

CrawlerWithSecuredWebApi

Configure the access control

Authenticate

Access HTTP endpoints that require authentication

Access WebSocket endpoints that require authentication

Register HTTP endpoints that require authentication

Register WebSocket endpoints that require authentication

Example

Clone this wiki locally