Wer wir sind

Gegründet 2009 in Hannover, zieht es uns seitdem in das Big Data Ökosystem, und wir sorgen erfolgreich dafür, dass unsere internationalen Kunden diese neuen Technlogien verstehen und optimal an ihre individuellen Bedürfnisse angepasst bekommen. Weiterlesen…

Was wir machen

Wir entwerfen Architekturen für große unstrukturierte Datenmengen. Wir schaffen robuste Verbindungen zu ihren bestehenden Systemen. Wir implementieren und schulen in Big Data-Technologien. Wir helfen ihnen, neuartige Fragen zu stellen. Weiterlesen…

Technologien

Wir skalieren ihre Datenverarbeitung mit Hadoop, Spark oder Streamingtechnologien wie Apache Flink oder Apache Storm. Wir entwickeln auf der Basis von Apache HUE, Apache Pig, Presto, Hive oder HBase Analyseplattformen. Wir realisieren robuste, verzahnte Prozesse mit Oozie, Airflow oder Schedoscope. Weiterlesen…

Aus unserem Developer Blog

Auf unserer englischsprachigen Seite bloggen wir über Technologien und den täglichen Umgang mit Hadoop, Spark und co:

GDELT on SCDF: Implementing a custom reactive source application

In the second part of our blog post series “processing GDELT data with SCDF on kubernetes” we will create a custom source application based on spring cloud stream and the reactive framework to pull GDELT data and use it in a very simple flow.

GDELT on SCDF: Bootstrapping Spring Cloud Data Flow on Kubernetes

In the first part of our planned blog posts (processing GDELT data with SCDF on kubernetes) we go through the steps to deploy the latest Spring Cloud Data Flow (SCDF) Release 1.7.0 on Kubernetes , including the latest version of starter apps that will be used in the examples.

fixing spark classpath issues on CDH5 accessing Accumulo 1.7.2

We experienced some strange NoSuchMethorError while migrating a Accumulo based application from 1.6.0 to 1.7.2 running on CDH5. A couple of code changes where necessary moving from 1.6.0 to 1.7.2, but these were pretty straightforward (members visibility changed, some getters were introduced). Everything compiled fine, but when we executed the spark application on the cluster we got an exception that was pointing directly to a line we changed during the migration.

how to access a remote ha-enabled hdfs in a (oozie) distcp action

Accessing a remote hdfs that has high availability enabled it not that straight forward as it used to be with non-ha hdfs setups. In a non-ha hdfs setup you simply use the namenode hostname and port, but with ha-enabled an alias is used and your local hadoop configuration needs to contain a couple of properties so that the hdfs client can determine the active namenode and do a failover if it’s neccessary.

Using Accumulos RangePartioner in a m/r job (and Oozie workflow)

In Hadoop, HashPartitioner is used by default to partition the datasets among different reducers. This is done by hashing the keys to distribute the data evenly between all reducers and then sorting the data within each reducer accordingly. While this works just fine, if you were running a map reduce job that writes data into an existing Accumulo table, HashPartitioner would lead to each reducer writing to every tablet (sequentially). In this case, you might want to configure your MapReduce job to use RangePartitioner for an improved job’s write performance and therefore increase its speed. In this blog, we are going to discuss how to configure your job to use RangePartitioner as its partitioner, and, – assuming that you are running the job as part of a worklow – , how to incorporate it in Oozie.

Vielen Dank an die Organisatoren des Java Forum Nord 2018 #jfn18 und wir freuen uns im nächsten Jahr als Goldsponsor wieder mit dabei sein zu dürfen.