Generating Input Data

To generate the input data for KeLP we developed a specific project: kelp-input-generator.

This project relies on third party software components, such as the Stanford Parser, and provides the functionalities to extract KeLP data structures from text snippets. Being a general purpose machine learning platform, KeLP is not limited to only Natural Language Processing tasks. However, for the moment, we do not provide any feature extraction capability for different fields.

In order to preserve the lightweight of the main KeLP project, kelp-input-generator is not included in kelp-full. If you want to use the kelp-input-generator functionalities in your maven project you can easily include it with the following Maven repository:

	<repositories>
		<repository>
			<id>kelp_repo_snap</id>
			<name>KeLP Snapshots Repository</name>
			<releases>
				<enabled>false</enabled>
				<updatePolicy>always</updatePolicy>
				<checksumPolicy>warn</checksumPolicy>
			</releases>
			<snapshots>
				<enabled>true</enabled>
				<updatePolicy>always</updatePolicy>
				<checksumPolicy>fail</checksumPolicy>
			</snapshots>
			<url>http://sag.art.uniroma2.it:8081/artifactory/kelp-snapshot/</url>
		</repository>
		<repository>
			<id>kelp_repo_release</id>
			<name>KeLP Stable Repository</name>
			<releases>
				<enabled>true</enabled>
				<updatePolicy>always</updatePolicy>
				<checksumPolicy>warn</checksumPolicy>
			</releases>
			<snapshots>
				<enabled>false</enabled>
				<updatePolicy>always</updatePolicy>
				<checksumPolicy>fail</checksumPolicy>
			</snapshots>
			<url>http://sag.art.uniroma2.it:8081/artifactory/kelp-release/</url>
		</repository>
</repositories>

<name>KeLP Snapshots Repository</name>

<enabled>false</enabled>

<updatePolicy>always</updatePolicy>

</releases>

<updatePolicy>always</updatePolicy>

</snapshots>

<url>http://sag.art.uniroma2.it:8081/artifactory/kelp-snapshot/</url>

</repository>

<id>kelp_repo_release</id>

<name>KeLP Stable Repository</name>

<updatePolicy>always</updatePolicy>

</releases>

<enabled>false</enabled>

<updatePolicy>always</updatePolicy>

</snapshots>

<url>http://sag.art.uniroma2.it:8081/artifactory/kelp-release/</url>

</repository>

</repositories>

Then, the Maven dependency for the kelp-input-generator project is:

<dependencies>
		<dependency>
			<groupId>it.uniroma2.sag.kelp</groupId>
			<artifactId>kelp-input-generator</artifactId>
			<version>1.0.1-SNAPSHOT</version>
		</dependency>
</dependencies>

<groupId>it.uniroma2.sag.kelp</groupId>

<artifactId>kelp-input-generator</artifactId>

<version>1.0.1-SNAPSHOT</version>

</dependency>

</dependencies>

Currently, kelp-input-generator allows to easily generate TreeRepresentations from text snippets. In particular, it provides the capabilities to extract the LOCT, LCT and GRCT representations, which are a tree views of a dependency graph, as introduced in (Croce et al., 2011).

KeLP uses its own format for representing graph data. However, a converter from the popular gSpan format is available inkelp-input-generator: it.uniroma2.sag.kelp.input.graph.GspanFormatConverter. The main method on the class can be invoked passing as parameter the gSpan file (and optionally a file with the target labels if they are available and they are not included in the gSpan file).

If your input graphs are in a format supported by Open Babel, the following script converts graphs from one of the Open Babel to gSpan. Therefore, all 111 Open Babel formats are indirectly supported as well.

In the future we plan to extend kelp-input-generator by adding the possibility to extract shallow and constituency tree representations, as well as SequenceRepresentations and DirectedGraphRepresentations.

References

Danilo Croce, Alessandro Moschitti, and Roberto Basili. Structured lexical similarity via convolution kernels on dependency trees. In Proceedings of EMNLP, Edinburgh, Scotland, UK., 2011.