Wer wir sind

Gegründet 2009 in Hannover, zieht es uns seitdem in das Big Data Ökosystem, und wir sorgen erfolgreich dafür, dass unsere internationalen Kunden diese neuen Technlogien verstehen und optimal an ihre individuellen Bedürfnisse angepasst bekommen. Weiterlesen…

Was wir machen

Wir entwerfen Architekturen für große unstrukturierte Datenmengen. Wir schaffen robuste Verbindungen zu ihren bestehenden Systemen. Wir implementieren und schulen in Big Data-Technologien. Wir helfen ihnen, neuartige Fragen zu stellen. Weiterlesen…

Technologien

Wir skalieren ihre Datenverarbeitung mit Hadoop, Spark oder Streamingtechnologien wie Apache Flink oder Apache Storm. Wir entwickeln auf der Basis von Apache HUE, Apache Pig, Presto, Hive oder HBase Analyseplattformen. Wir realisieren robuste, verzahnte Prozesse mit Oozie, Airflow oder Schedoscope. Weiterlesen…

Aus unserem Developer Blog

Auf unserer englischsprachigen Seite bloggen wir über Technologien und den täglichen Umgang mit Hadoop, Spark und co:

fixing spark classpath issues on CDH5 accessing Accumulo 1.7.2

We experienced some strange NoSuchMethorError while migrating a Accumulo based application from 1.6.0 to 1.7.2 running on CDH5. A couple of code changes where necessary moving from 1.6.0 to 1.7.2, but these were pretty straightforward (members visibility changed, some getters were introduced). Everything compiled fine, but when we executed the spark application on the cluster we got an exception that was pointing directly to a line we changed during the migration.

how to access a remote ha-enabled hdfs in a (oozie) distcp action

Accessing a remote hdfs that has high availability enabled it not that straight forward as it used to be with non-ha hdfs setups. In a non-ha hdfs setup you simply use the namenode hostname and port, but with ha-enabled an alias is used and your local hadoop configuration needs to contain a couple of properties so that the hdfs client can determine the active namenode and do a failover if it’s neccessary.

Best Practices using PigServer (embedded pig)

Using PigServer in your own java application is a great way to leverage the simplicity of pig scripts, especially if you are generating your pig scripts dynamically and then execute them on demand or via a scheduler.

But if you are using multiple PigServer instances in a multi-threaded, long running application there are quite some pitfalls you need to avoid.

Passing many parameters from Java action to Oozie workflow

When executing Java actions in an Oozie workflow, there are going to be cases where you want to have certain data to be passed from your action to the workflow itself, making it avalailable for use within the workflow. This can be easily achieved by adding “capture-output” element to your workflow action, which is an indicator for Oozie to capture the output of your action.

Using Accumulos RangePartioner in a m/r job (and Oozie workflow)

In Hadoop, HashPartitioner is used by default to partition the datasets among different reducers. This is done by hashing the keys to distribute the data evenly between all reducers and then sorting the data within each reducer accordingly. While this works just fine, if you were running a map reduce job that writes data into an existing Accumulo table, HashPartitioner would lead to each reducer writing to every tablet (sequentially). In this case, you might want to configure your MapReduce job to use RangePartitioner for an improved job’s write performance and therefore increase its speed. In this blog, we are going to discuss how to configure your job to use RangePartitioner as its partitioner, and, – assuming that you are running the job as part of a worklow – , how to incorporate it in Oozie.

Upgrading an existing Accumulo 1.6 CDH5 cluster to HDFS HA

Upgrading an existing CDH5 Cluster to HDFS HA is well documented and the Cloudera Manager will guide you through the process using a wizard. But if you are running Accumulo 1.6 on that cluster, your instance will still try to access old hdfs paths like hdfs://youroldnamenodehost:8020/accumulo/… as these fully qualified paths have not been migrated by the HA upgrade wizard.