Talend Basic Install on Linux

Categories: BigData

Article Overview

Talend sells a suite of tools for processing enterprise data. Unfortunately, their installation documentation is rubbish. Truly rubbish, some of the worst I have ever seen. This article provides a walk-through of installing the core Talend components on a Linux server (optionally with a basic cluster of execution-servers for running jobs).

The focus here is on installation of a licensed Talend suite, with the core proprietary components such as the Talend Administration Console and the logging components. This article is not likely to be helpful if you are wanting to use the (few) open-source Talend components available for free.

The instructions here are also not intended to be a complete guide to setting up a production environment. The process for installing a robust and secure Talend environment is so extremely complex that I strongly recommend any company which is considering purchasing a Talend license also pay Talend to install the product.

What the instructions here do provide is instructions for creating a basic test/development environment. This article may also be useful when validating/understanding what Talend will/should do when setting up a production environment in an on-premise datacenter.

The version of Talend used to write this guide was 7.0.1 (released mid 2018).

To be precise, these instructions:

  • are only relevant to companies with at least a license to the TAC component
  • do try to follow good security practices with regards to separation of user accounts
  • partially follow good administration practices with regards to file locations (and note where good practice is not followed)

and do not cover:

  • creating automated install scripts for provisioning systems such as Puppet, Chef, Ansible
  • SSL encryption of connections
  • single-signon integration
  • enabling high-availability

These instructions are also mine, ie do not reflect the opinion of Talend themselves.

Recently (since mid 2018), Talend has released a “cloud-based” Talend environment, and is trying to push new customers to using that approach rather than an on-premise deployment. I am not sure how successful they will be with that. These instructions are for the on-premise approach, not the cloud - though this article might possibly be helpful by shedding some light on what is happening in the cloud VMs “behind the scenes”.

Relevant Components

The following servers will be created and configured with the following software:

  • master
  • TAC, AMC, commandline
  • ELK-based logserver
  • nexus
  • git server (your choice of gitblit or gitlab)
  • mysql
  • workers x 2
    • jobserver or runtime (your choice)

Note that these instructions include setting up a mysql database rather than using the “toy” H2 database that is included with Talend’s “easy install”. However in general, this is still a “play” environment rather than a production-ready setup.

Not covered (ie no install-instructions here):

  • ssl encryption of connections
  • high-availability configuration
  • proper configuration automation (puppet/chef/ansible/etc)
  • proper systemd-init service scripts (basic ones are provided, but they could be improved)
  • single signon (SSO)
  • dqportal, data-steward, MDMServer, dictionary-service, dataprep
  • ESB Container
  • zookeeper, kafka, activemq, etc

Downloadable Files

When you purchase (or otherwise obtain) a Talend license, you receive an email from Talend like the following, with a “license” file attached:

Dear Customer,

You will find in this email all the information you need to install your copy of 
Talend Data Fabric 7.0.1 with the following runtime (s):
- 1 ESB non-production core runtime(s)
- 1 ESB non-production core runtime(s)

Your license key for YOURCOMPANY - License Talend Data Fabric ( / 5 MDM Admin User(s) / 99 MDM Interactive User(s) ) 
is available in the file attached to this email. The expiry date is SOMEDATE. Once you have downloaded the
application and related documentation, save the attached License key file onto your hard drive then place it 
at the root of the application extraction folder.

Use the following login details to connect to the download website:
User: xxxxxxxxx
Password: xxxxxxxxx

Caution : Please keep this email in a safe place, as the user and password information will be required 
for the Software Update configuration. For more information, see the Installation Guide.

Download the application from:

Windows Platform Installer (via download manager):

Platform Installer (without download manager):

Studio-only installers (via download manager):

Studio-only installers (without download manager):

Manual Installation:
... and urls for various other components ...

Local update sites for installation without internet:

Get the related documentation, including the installation guide, from:
Online: https://help.talend.com/search/books?filters=EnrichProdName~%2522Talend+Data+Fabric%2522*EnrichVersion~%25227.0%2522

PDF English:
.. and other languages ..
Talend's consulting programs offer a comprehensive set of services, from Quickstart implementations and Accelerators to more
adoption and strategically-oriented programs.
Talend consultants have years of experience, and proven success with strategic and technical planning in a multitude of
implementation circumstances. You will receive one-on-one attention and support to see your projects through to a successful

Note that at the bottom of the emails are links to “DocumentationSet” downloads. Within the zip are installation manuals for different operating systems.

The installation manuals are large (240 pages for linux) but rather poorly presented, and in my opinion do not make good choices with respect to security. See file “Talend_DataFabric_IG_Linux_7.0.1_EN.pdf” for Linux.

The install-instructions below are hopefully clearer than the official guide - or at least gets straight to the point for the specific use-case of setting up a test cluster on Linux.

Create the VMs

(Note: it is assumed below that each node on which Talend software is being installed is a virtual machine (aka VM). Of course physical servers could also be used - but is less usual these days, particularly for a test/dev environment)

In your server environment, allocate three VMs:

  • name=talend-tac (or similar)
    • Ubuntu 16.04 LTS
    • 2 CPU cores (recommended)
    • 16GB ram (ie 16384 MB)
    • 256 GB disk (core system takes around 20GB, leaving rest for user-data)
    • 32GB swap (ie 2x ram)
  • name=talend-work1 (or similar)
    • Ubuntu 16.04 LTS
    • 2 CPU cores (recommended)
    • 256 GB disk
    • 8GB swap (ie same size as ram)
  • name=talend-work2 # config options are same as talend-work1

Update the VMs.

If you do not have DNS set up to resolve the above names, then edit your local /etc/hosts file to add the names for each server; various webservers on these machines will use hostnames which must be resolvable on client machines on which the browser is running.

Install your SSH key into each VM to make later logins easier:

ssh-copy-id root@talend-tac
ssh-copy-id root@talend-work1
ssh-copy-id root@talend-work2

On each VM, apply all OS upgrades:

ssh root@talend-tac
apt update
apt dist-upgrade
reboot now
# repeat for workers

Configure Shared Filesystem

It is recommended by Talend that all components be configured to write their logfiles to a shared filesystem. Presumably this is so that in the case of failure of a node, logs that were previously on that system are still accessible. If you are setting up a highly-available system, then you will need to allocate such a shared filesystem, and mount it in each server.

Install Version Control System

Talend needs a git or svn repository for developers to store data in, and for the Talend build-tools to compile code from.

The install guide p52 describes how to set up a hosted SVN instance. The install guide also includes instructions related to Git - but the instructions are just nonsense as far as I can see; having read them multiple times I still have no idea what they are trying to say. It is best to just ignore them…

Assuming you wish to use git, a server is necessary and there are several options:

  • use an existing Atlassian Bitbucket or Github repo (but creating new users is obviously somewhat tricky)
  • install Gitlab
  • install gitblit

Gitlab can be run in a docker container on the Talend master VM - or a dedicated VM if you wish (installing gitlab without container is rather complicated). The necessary instructions are here - and fortunately no Talend service runs on port 80, so the Gitlab container port can be run there. Note that Gitlab requires reasonable amounts of resources - it is a full suite, not just a simple Git wrapper (among other things, it runs a full Postgres database). The talend-tac server resources shown above are, however, sufficient to handle a Gitlab instance.

Gitblit is much simpler than Gitlab, but still provides all the necessary functionality. It is a simple war-file that is deployed within a jetty or tomcat instance. It then provides a web-ui through which git-repos can be created, users can be defined, user-access-rights assigned, and then it handles git-clone/git-push/etc. The project is very dormant, but it still appears to work fine (tested under tomcat8), and I would recommend it as the simplest solution. The necessary instructions are here.

Configure master node “talend-tac”

Install mysql

You should of course choose better passwords than the ones used here!

ssh root@talend-tac
apt install mysql-server mysql-client

# reconfigure mysql
vi /etc/mysql/mysql.conf.d/mysqld.cnf

# replace "max_allowed_packet = 16M" with "max_allowed_packet = 64M"
# replace "bind-address =" with "bind-address ="

service mysql restart

Create databases and users

# create dbs and users
mysql # or mysql -p if you set a password during mysql install

## for TAC
create database talend_admin;
GRANT ALL ON talend_admin.* TO talend_admin@'%' identified by 'pwd1';

## for MDM
create database talend_mdm;
GRANT ALL on talend_mdm.* to talend_mdm@'%' identified by 'pwd1';

## for DQ
create database talend_dq;
GRANT ALL on talend_dq.* to talend_dq@'%' identified by 'pwd1';

## done

Note that the mysql users created here are “any network” users (domain="@%"). We could use “@localhost” instead, as we are installing all apps which need mysql access on the same host. However I want to keep these instructions applicable also to cases where the DB is on another host, or where some services are moved to other VMs.

For production, the mysql rights granted can be better limited - this guide does not attempt to do so.

As noted earlier, the “DQ” (Data Quality) component requires an external db (embedded H2 not supported) - and the installer verifies the connection during install configuration.

Install Software

See page Install Manually for details on how to get the basic software components installed.

Or see page Install with Wizard if you want to install faster, but end up with a completely broken/useless system.

RPMs are also available for some licenses. For example, the install-documentation for “Talend Data Integration” describes RPM-based installation, while the install-documentation for “Talend Data Fabric” does not. Given Talend’s general poor approach to system security, I would be reluctant to run talend-provided RPMs on a system - manual install seems safer.

Configure TAC

Visit the TAC website at http://talend-tac:8080/org.talend.administrator (password = “password” or maybe “admin”).

Initial Config Page

This first page is an “admin page” where the following can be set up:

  • database connectivity (already defined if the “install wizard” has been used)
  • license (must be uploaded from PC running browser)

Database connect params should be:

database type: mysql
driver: org.gjt.mm.mysql.Driver
url: jdbc:mysql://{yourhostname}:3306/talend_admin
username: talend_admin # as specified when creating database earlier
password: pwd1
# as specified when creating database earlier

Then click on “check” at bottom, before clicking on “save” in the middle.

The “finalize” button writes to file tomcat/webapps/org.talend.administrator/WEB-INF/classes/configuration.properties, permanently storing the config params.

The “go to logon” link then takes you to the real page.

Note that the buttons “Project Check” and “Transfer Libraries” are used when migrating from one major Talend version to another (migrating code in version-control, and libraries in Nexus respectively). For a new Talend install, these features are not needed.

Create Admin User

TAC comes with a default user: “security@company.com”, password: “admin” - or whatever was entered in the install wizard. This user has the single role “Security Administrator”, which allows them to create users, and to perform some system-config tasks but not all. Therefore the first thing you need to do is create an “application administrator” user, then log in as that user. For maximum access, the new user should have roles:

  • Security Administrator
  • Administrator
  • Operation Manager
  • Viewer

This new user now has a full set of options in their “Settings/Configuration” menu. Note that in the page associated with this menu option:

  • there is a refresh button in the top left corner; after changing a setting this can be clicked to verify the setting is correct
  • the page (somewhat annoyingly) automatically invokes refresh every 30 seconds or so
  • a “lightning bolt” symbol means “verifying…” (working on it, or blocked waiting for some other config)
  • the “finalize” button should not be used.

Note also that in the Configuration page, it is apparently normal for features that are not used (eg DataPrep when you have no dataprep module installed) show as “errors”.

Set Up Version Control

From the main page for an administrator user, version control can be set up via menu option “Settings/Configuration”.

For Git access, you first need to create (eg via gitblit or gitlab UI) a repository in the target git server, and a user for the TAC. Then in TAC web ui, under settings/configuration/git, define:

  • Git server url: ssh://tac@{githostname}:29418/talendrepo.git (assuming user=tac, and a repo name of talendrepo)
  • username: whatever user you created
  • password: whatever password you definedAFAICT, the TAC does not support authenticating against a git server using an ssh key - only username/password are possible.

As noted in the manual install page, using git ssh protocol requires the TAC unix user have appropriate entries in its ~/.ssh/known_hosts file.

See the manual install page for more details.

Set up Other Config

The following options are all under menu “Settings/Configuration” (available only for a Talend application admin user).

  • Commandline/primary
  • Monitoring
    • AMC url: {yourhostname}:8080/amc => do NOT use “localhost” in this URL, as this is output as a “src” attribute on an html iframe, ie is interpreted by the user’s browser.
    • Kibana: {yourhostname}:5601/... => do not use “localhost” here either..
  • Artifact Repository
    • Nexus url: {yourhostname}:8081 => default usually ok, unless you are using an external nexus or have set up HA
  • Job conductor:
    • Generated jobs folder => /var/talend/admin/generated-jobs (or whatever you created; see manual install instructions)
    • Tasks logs folder => /var/talend/admin/execution-logs (or whatever you created)

Other setup

For more details, see the official Talend Administration Center user manual.

Check Other Components

If you have set up additional components, then visit those apps:

  • MDM: http://{yourhostname}:8180/talendmdm
  • Data Stewardship: http://{yourhostname}:19999 (user="tds-user", password="duser")
  • DQ Portal: http://{yourhostname}:8580/tdqportal (user="tdq_admin", password="tdq")
  • DataPrep: http://{yourhostname}:9999 (user="dataprep-user" password="duser")

Random Notes

Component “CI-Builder” is documented in Talend’s “Software Development Lifecycle Guide”. It is an advanced tool for sites that wish to drive their deployment processes using an external tool (eg Jenkins) rather than the standard Talend tools. It can be useful when a system has hundreds of jobs.

References and Further Reading