Talend Basic Install on Linux - Manual

Categories: BigData

Overview

This page describes how to install the core components of the Talend software suite on Linux using “manual” setup (rather than the official installer, which produces a poor result). The overall install process description should be read first; it convers setting up the VMs on which these instructions are carried out.

The instructions here do not replace the Talend installation guide; the guide has far more detail than included here. However the guide is also incomplete, and poorly structured - hopefully this page provides a better “flow” through the necessary steps.

The following components are addressed in this page:

  • Nexus
  • Admin Center (aka TAC)
  • CommandLine
  • LogServer
  • AMC (activity monitoring console)
  • Jobserver
  • Runtime (aka Talend Container)

together with all of their dependencies (eg Java, Tomcat, mysql, systemd-init).

The following issues and components are NOT addressed in this page:

  • Setting up ssl
  • Setting up high availability
  • Setting up single-signon (SSO)
  • Setting up more complex Talend components including:
    • IAM (optionally used by DataStewardship, DataPrep, DictionaryService)
    • BRMS (business rules server)
    • SAP RFC Server (SAP integration component)
    • MDM (Master Data Management)
    • ESB Container and subcomponents
      • service-locator
      • service-activity-monitor (SAM)
      • security-token-services (STS)
    • DQPortal (data quality component)
    • DataStewardship
    • DataPrep and subcomponents
      • streams-runner
      • spark-job-server
      • components-catalog
    • DictionaryService
    • Repository manager (an obsolete component)
    • CI Builder (a maven plugin for integrating external CI tools such as Jenkins into the Talend dev/rollout cycle)
  • Setting up core dependencies:
    • zookeeper (required by ESB-locator-service and Kafka)
    • kafka (optionally used by DataSteward)
    • activemq (optionally used by Talend ESB)
    • MongoDB (used by DataStewardship, DataPrep, DictionaryService)

Unless mentioned otherwise, all commands below are expected to be run as the root user. The result is:

  • a tree of files at /opt/talend
  • a set of system user accounts with home-dirs under /opt/talend and
  • a set of systemd configuration files in /etc/systemd

Install Necessary OS Packages

Some basic software is needed:

apt install openjdk-8-jre-headless
apt install zip unzip
apt install libmysql-java  # assuming you are using mysql of course..

Download Talend Files

The following will download all necessary files from the Talend download site to a directory on the local machine. This list of links is usually provided by Talend in an email, together with the licence file.

mkdir /usr/share/talend
cd /usr/share/talend

cat >urls.txt <<EOF
http://www.opensourceetl.net/tis/tdf_701/Talend-Studio-20180411_1414-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-Studio-20180411_1414-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-AdministrationCenter-20180411_1414-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-AdministrationCenter-20180411_1414-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-JobServer-20180411_1414-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-JobServer-20180411_1414-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-AMC_Web-20180411_1414-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-AMC_Web-20180411_1414-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-BRMS-20180411_1414-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-BRMS-20180411_1414-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-SAP-RFC-Server-20180411_1414-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-SAP-RFC-Server-20180411_1414-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-IAM-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-IAM-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-DataPreparation-Server-full-V2.5.0.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-DataPreparation-Server-full-V2.5.0.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-DataStewardship-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-DataStewardship-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-DictionaryService-2.1.2.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-DictionaryService-2.1.2.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-DQPortal-20180411_1414-V7.0.1.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-DQPortal-20180411_1414-V7.0.1.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-MDMServer-20180411_1414-V7.0.1.jar
http://www.opensourceetl.net/tis/tdf_701/Talend-MDMServer-20180411_1414-V7.0.1.jar.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-MDMServer-20180411_1414-V7.0.1.war
http://www.opensourceetl.net/tis/tdf_701/Talend-MDMServer-20180411_1414-V7.0.1.war.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-MDMServer-20180411_1414-V7.0.1-HOME.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-MDMServer-20180411_1414-V7.0.1-HOME.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-ESB-V7.0.1-20180411143409.tar.gz
http://www.opensourceetl.net/tis/tdf_701/Talend-ESB-V7.0.1-20180411143409.tar.gz.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-ESB-V7.0.1-20180411143409.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-ESB-V7.0.1-20180411143409.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-Runtime-V7.0.1-20180411143409.zip
http://www.opensourceetl.net/tis/tdf_701/Talend-Runtime-V7.0.1-20180411143409.zip.MD5
http://www.opensourceetl.net/tis/tdf_701/Talend-LogServer-V7.0.1-linux-x86_64.tar.gz
http://www.opensourceetl.net/tis/tdf_701/Talend-LogServer-V7.0.1-linux-x86_64.tar.gz.MD5
EOF

wget --input-file=urls.txt --continue --show-progress --user=.... --password=....

Yes, Talend-studio is on the list; tool talend-commandline that (usually) needs to be installed on the server is actually implemented as an Eclipse plugin, and distributed as part of the studio. See installation instructions for the commandline component for more details.

The above list does include some components whose installation is not covered in this page. You may of course remove from the above list any components that you do not intend to install.

Also download Tomcat - the exact version of course depends on what has most recently been released.

  • wget https://www-eu.apache.org/dist/tomcat/tomcat-9/v9.0.14/bin/apache-tomcat-9.0.14.zip

For some components (eg commandline) you will also need the Talend licence file on the host on which the component is being installed.

If you see mention of component “RepositoryManager”, you can ignore it - it is deprecated. Note however that the Artifact Repository (aka Nexus) is something different, and very much needed.

Notes on Installing Tomcat

Many Talend components need to run within a Java Servlet Container; the one chosen for this installation walkthrough is Apache Tomcat.

The approach used below is to create different system-accounts for different components of the Talend suite, for security and cleanliness.

When a component (running as a system-account) needs a Tomcat instance to run within, a Tomcat “config dir” is created in the account but the Tomcat core files are taken from a central/shared location. In this approach, standard Tomcat environment variables CATALINA_BASE and CATALINA_HOME are different (one pointing to the base install, and the other to the config-files and extra libs needed by a specific instance).

An alternative would be to create a single user account for all Talend components, running a single Tomcat instance with various Talend web-app components running within that single Tomcat instance. However that has two significant disadvantages:

  • security - a problem with one app can lead to breach of that single user-account, and thus access to data for all Talend components
  • stability - an out-of-memory problem with one component would take all components down, high CPU in one component would interfere with others, etc.

The approach used below uses a Tomcat release downloaded directly from the Apache site and unpacked. Some linux distributions include Tomcat as a standard package (eg Ubuntu 18.04 can “apt install tomcat-8”); this approach does potentially have benefits for security support.

It might also be a good idea to configure all Tomcat instances to just open an AJP port on the localhost interface, and then to use an http-proxy (eg nginx) to forward requests for specific ports to the Tomcat instance. SSL termination could then be done at the proxy rather than within Tomcat - easier to configure.

The Talend TAC can be configured in a “high availability mode” which is also described as “Tomcat in cluster mode”. However this is actually nothing to do with Tomcat itself - the high availability mode is implemented via code in each TAC instance writing “heartbeat” records into the shared database, and client tools (particularly Talend Studio instances) using data from the DB to choose the “active” server instance. No special Tomcat configuration is required.

Notes on systemd-init Service Files

All the main Linux distros (including RedHat, Centos, and Ubuntu) use systemd-init to manage services.

When using the Talend wizard installer, directory /opt/Talend-*/utils is filled with sysv-init scripts and systemd-init service-unit files. As services are configured by the wizard the service-unit files are migrated to /etc/systemd/system (and renamed). Sadly, the service-unit files mostly just point to the sysv-init scripts, ie don’t take advantage of any of systemd-init’s features. The service-unit files are nevertheless useful as a starting-point for writing real service-unit files.

The Talend install manual for Linux does have a few systemd-init service-unit file examples (starting from page 216; search for “systemd”), but not all services are covered - and the examples are not particularly good.

This article provides a basic systemd-init service unit file for each component. However they could be significantly improved; there are many security features in systemd-init that I have not bothered to enable/configure.

Configure Java

Much of the Talend suite runs in a JVM, and ideally wants variable JAVA_HOME to be set. Rather than have variables everywhere, it seems easiest to create a single symbolic-link that all configuration can refer to:

mkdir -p /opt/talend
cd /opt/talend
ln -s /usr/lib/jvm/java-8-openjdk-amd64/jre jre-curr  # or wherever "apt install openjdk-8-jre-headless" put its files

This provides a link /opt/talend/jre-curr which points to a JDK installation and can be modified at any time to point to an updated JDK.

Install Tomcat Core

Various components need to be run as separate Tomcat instances. It is therefore useful to install Tomcat once, and reference it from the other instances.

mkdir -p /opt/talend
cd /opt/talend
unzip /usr/share/talend/apache-tomcat-*.zip
mv apache-tomcat-* tomcat

Install Nexus

Overview

A Maven artifact repository is needed by various Talend components; if you need to install the one provided by Talend (Nexus) then it makes sense to install it first.

If you already have a Nexus instance on some other machine that you wish to use for storing Talend-generated artifacts (workflow jobs), then this can be skipped. Note however that the “migration scripts” embedded within the file Artifact-Repository-Nexus-*.zip (which is embedded in the TAC zipfile) should be applied when not using the provided Nexus image.

Unpack Files

mkdir -p /opt/talend/artifactrepo
cd /opt/talend/artifactrepo
unzip /usr/share/talend/Talend-AdministrationCenter-*.zip
unzip Talend-AdministrationCenter-*/Artifact-Repository-Nexus-*.zip
mv Artifact-Repository-Nexus-*-unix nexus
rm -rf Talend-AdministrationCenter-*
rm -rf Artifact-Repository-Nexus-*
# should leave just "nexus" and "migration*" remaining
cd nexus
ln -s nexus-* curr

Create User Account for Nexus

useradd --system --home /opt/talend/artifactrepo/nexus --shell /usr/sbin/nologin talendnexus
chown -R talendnexus:talendnexus /opt/talend/artifactrepo/nexus

Configure systemd-init to start Nexus

Create a systemd-init unit file with:

cat > /etc/systemd/system/talend-nexus.service <<EOF
[Unit]
Description=Talend Nexus service
After=syslog.target network.target

[Service]
User=talendnexus
Group=talendnexus
LimitNOFILE=65536

Type=forking
Environment=JAVA_HOME=/opt/talend/jre-curr
WorkingDirectory=/opt/talend/artifactrepo/nexus/curr
ExecStart=/opt/talend/artifactrepo/nexus/curr/bin/nexus start
ExecStop=/opt/talend/artifactrepo/nexus/curr/bin/nexus stop
Restart=on-abort

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now talend-nexus

Summary

The above steps start an application listening on http port 8081.

There are probably a bunch of options that should be configured for a truly production-quality install of Nexus - I haven’t bothered with these.

By default, Nexus appears to write logs and artifacts into curr/sonatype-work/nexus3 - which is acceptable for a toy environment, but not for production (large amounts of data will be written). A prod environment should probably configure the app (somehow) to store data under /var/nexus or similar.

Install TAC

Overview

This section creates a Tomcat instance running the Talend admin-center webapp. This Tomcat instance may run other webapps (depending on what you install), and various other tools need to run as the same user that runs the TAC webapp, so the user-account created is named “talendadmin” rather than “talendtac”.

Install TAC files

mkdir -p /opt/talend/admin
cd /opt/talend/admin

sh /opt/talend/tomcat/bin/makebase.sh tomcat

unzip /usr/share/talend/Talend-AdministrationCenter*.zip
mv Talend-AdministrationCenter-*/endorsed/*.jar tomcat/lib/
mv Talend-AdministrationCenter-*/org.talend.administrator-*.war tomcat/webapps/org.talend.administrator.war
rm -rf Talend-AdministrationCenter-*  # no longer needed
cd tomcat/webapps
unzip org.talend.administrator.war  # gives access to config-files

Set Database Connection Params

Edit file tomcat/webapps/org.talend.administrator/WEB-INF/classes/configuration.properties, setting:

database.url=jdbc:mysql://{yourhostname}:3306/talend_admin
database.driver=org.gjt.mm.mysql.Driver
database.username=talend_admin
database.password=pwd1

Note that this is a “workaround” - theoretically, the db parameters can be set from the TAC web interface and then saved to this file with the “finalize” button. However I was unable to get to the point in the web interface where the “finalize” button was enabled without first setting the parameters manually in this file! When the “finalize” button is finally pressed in the UI, the password will be updated with an encrypted version.

Note: File webapps/org.talend.administrator/META-INF/context.xml also has DB connection parameters in it. However AFAICT, these are only used if the webapp is configured to allow Tomcat to manage the database connection pool at servlet-container level, rather than the default approach of having the webapp manage its own database connection pool. It isn’t clear to me what the advantages/disadvantages of these two approaches are. Config-item “database.useContext” (in file WEB-INF/classes/configuration.properties) selects the active option - and defaults to webapp-level connection pooling, ie db connection params are in configuration.properties itself. See the Talend installation guide for more information.

It is also recommended that you change option “database.config.password” so that arbitrary users cannot reconfigure the database that the TAC uses.

Install mysql driver

If using mysql as the central database, then:

  • cp /usr/share/java/mysql-connector-java-*.jar tomcat/lib

which will later allow you to configure the TAC webapp to use that driver.

Create a User

useradd --system --home /opt/talend/admin --shell /usr/sbin/nologin talendadmin
chown -R talendadmin:talendadmin /opt/talend/admin

Ports

By default, Tomcat opens the following ports - which are fine for the TAC:

  • admin port: 8005
  • http port: 8080
  • AJP port: 8009

Note that Talend supports “clustering” of TAC instances for high-availability. This is not done by clustering Tomcat itself, but instead by adding a Talend-specific module to Talend which updates “heartbeat records” in the central Talend database instance; Talend client apps check the database to find out which http server address is the “current master” and make requests to that instance. Setting this up is NOT described on this page - see the Talend install guide.

The TAC application itself dynamically opens listening ports; each time it launches a job on an “execution server” it opens a “free” port so that the execution-server (jobserver, runtime, or esb) can report statistics. The port is closed when the job completes. The default port-range that it uses is 10000-11000.

Configure systemd-init to start Tomcat

cat > /etc/systemd/system/talend-admin.service <<EOF
[Unit]
Description=Talend Admin Console (Tomcat instance)
After=syslog.target network.target mysql.service
 
[Service]
User=talendadmin
Group=talendadmin
UMask=0007

Type=simple
 
Environment=JAVA_HOME=/opt/talend/jre-curr
Environment='JAVA_OPTS=-Djava.awt.headless=true -Djava.security.egd=file:/dev/./urandom'
Environment=CATALINA_HOME=/opt/talend/tomcat
Environment=CATALINA_BASE=/opt/talend/admin/tomcat
Environment='CATALINA_OPTS=-Xms512M -Xmx1024M -server -XX:+UseParallelGC'
 
ExecStart=/bin/sh /opt/talend/tomcat/bin/catalina.sh run
 
RestartSec=10
Restart=always
 
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now talend-admin
systemctl status talend-admin
journalctl --unit=talend-admin

Configure SSH for Git Access

The TAC instance will later be configured with credentials to access a git or svn repository in which developer code is to be stored. When using SVN, or using Git with access-protocol http, nothing needs to be done yet. However when using Git with access-protocol ssh, some setup is needed. Running command “git clone ssh://someuser@somehost:someport/reponame” at the command-line as user talendadmin would result in the following prompt:

The authenticity of host '[somehost]:someport)' can't be established.
RSA key fingerprint is SHA256:ABCDEF0123456789...
Are you sure you want to continue connecting (yes/no)?

The TAC cannot handle such a prompt, so it is necessary to approve the associated ssh key beforehand. The easiest way to do this is (as root):

mkdir --mode 0700 ~talendadmin/.ssh
ssh-keyscan -p someport somehost > ~talendadmin/.ssh/known_hosts
chown -R talendadmin:talendadmin ~talendadmin/.ssh

where somehost and someport are the address of the git server, eg “{yourhostname}” and “29418” when using gitblit on the local host.

Configure Storage Directories

Create a few storage dirs that will be needed later when configuring the TAC via its UI:

mkdir -p /var/talend/audit/reports
chown -R talendadmin:talendadmin /var/talend/audit

mkdir -p /var/talend/admin/generated-jobs
mkdir -p /var/talend/admin/execution-logs
chown -R talendadmin:talendadmin /var/talend/admin

Note: when setting up an HA system, it might be necessary to map these dirs to a shared fs.

During TAC graphical configuration, entries “Settings/Configuration/Audit” and “Settings/Configuration/Job conductor” need to be updated to point to these dirs.

Verify Installation

Visit “http://{yourhostname}:8080/org.talend.administrator” to see the Talend admin page

See the parent install page for info on configuring the TAC webapp itself.

Install Activity Monitoring Console (AMC)

Overview

Various Talend components record status and errors by writing data into a central database. The AMC web app then provides a way for users/admins to view and search these records; Talend Studio also provides a UI for viewing these records.

For more information, see the Talend Activity Monitoring Console User Guide

AMC is a single java servlet engine warfile, and the easiest way to deploy it is to use the TAC server’s Tomcat instance. There might be slight security and stability benefits to running it under a separate Tomcat instance, but I am not sure that Talend supports this (ie it seems from initial experiments that the TAC expects AMC to be running on the same port as it is).

AMC is (afaik) only used via the TAC; when option “monitoring/activity monitoring console” is chosen in the TAC UI, the TAC generates an html page which contains an iframe whose “src” attribute points back to the AMC instance. The AMC instance then renders html through which the user can view records in the AMC schema within the database. Note:

  • it is not currently clear to me how the database connection parameters are passed to the AMC webapp; they are configured in the TAC interface, not in AMC. Perhaps AMC assumes that “localhost:myport/org.talend.administrator” points to a TAC instance?
  • the “AMC url” configured within the TAC needs to use an absolute hostname, not “localhost”, as the url is evaluated in the user’s browser.
  • it is not currently clear to me how authentication works for AMC requests

Talend studio also has an “AMC view” through which logging data can be viewed, but this code reads directly from the database rather than going through the AMC webapp.

Even though the only user of the AMC webapp is the TAC, the code cannot simply be integrated into the TAC for technical reasons. The TAC is implemented as a fairly traditional webapp - it uses GWT (Google Widget Toolkit) to generate HTML; it does use some eclipse libraries but not the eclipse framework. The AMC webapp is instead implemented as an Eclipse application using RAP to generate HTML rather than using SWT to generate native graphical widgets; this approach allows the AMC webapp to reuse the same eclipse code that the Talend Studio uses to present the AMC ui - but means that it must be deployed as a separate app rather than merged into the TAC codebase.

Create Table Schema

(following information extracted from the Talend Activity Monitoring Console User Guide)

cat >/tmp/amcsetup.sql <<"EOF"
CREATE DATABASE `amc`;
USE `amc`;

DROP TABLE IF EXISTS `tflowmetercatcher`;
CREATE TABLE `tflowmetercatcher` (
  `moment` datetime DEFAULT NULL,
  `pid` varchar(20) DEFAULT NULL,
  `father_pid` varchar(20) DEFAULT NULL,
  `root_pid` varchar(20) DEFAULT NULL,
  `system_pid` bigint(8) DEFAULT NULL,
  `project` varchar(50) DEFAULT NULL,
  `job` varchar(255) DEFAULT NULL,
  `job_repository_id` varchar(255) DEFAULT NULL,
  `job_version` varchar(255) DEFAULT NULL,
  `context` varchar(50) DEFAULT NULL,
  `origin` varchar(255) DEFAULT NULL,
  `label` varchar(255) DEFAULT NULL,
  `count` int(3) DEFAULT NULL,
  `reference` int(3) DEFAULT NULL,
  `thresholds` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `tlogcatcher`;
CREATE TABLE `tlogcatcher` (
  `moment` datetime DEFAULT NULL,
  `pid` varchar(20) DEFAULT NULL,
  `root_pid` varchar(20) DEFAULT NULL,
  `father_pid` varchar(20) DEFAULT NULL,
  `project` varchar(50) DEFAULT NULL,
  `job` varchar(255) DEFAULT NULL,
  `context` varchar(50) DEFAULT NULL,
  `priority` int(3) DEFAULT NULL,
  `type` varchar(255) DEFAULT NULL,
  `origin` varchar(255) DEFAULT NULL,
  `message` varchar(255) DEFAULT NULL,
  `code` int(3) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `tstatcatcher`;
CREATE TABLE `tstatcatcher` (
  `moment` datetime DEFAULT NULL,
  `pid` varchar(20) DEFAULT NULL,
  `father_pid` varchar(20) DEFAULT NULL,
  `root_pid` varchar(20) DEFAULT NULL,
  `system_pid` bigint(8) DEFAULT NULL,
  `project` varchar(50) DEFAULT NULL,
  `job` varchar(255) DEFAULT NULL,
  `job_repository_id` varchar(255) DEFAULT NULL,
  `job_version` varchar(255) DEFAULT NULL,
  `context` varchar(50) DEFAULT NULL,
  `origin` varchar(255) DEFAULT NULL,
  `message_type` varchar(255) DEFAULT NULL,
  `message` varchar(255) DEFAULT NULL,
  `duration` bigint(8) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


GRANT ALL ON amc.* TO amc@'%' identified by 'pwd1'; -- create user
EOF

mysql < /tmp/amcsetup.sql   # or mysql -p {password} < /tmp/amcsetup.sql etc

A developer using Talend Studio usually configures a whole “project” with some DB credentials that have rights to write to the above tables (often via variables so the values can be different in each deployment environment). The “monitoring” section of the TAC web ui must also be configured with DB credentials that have rights to at least read the above tables.

The above SQL creates an “amc” user which has all rights on the amc schema, which works for both the TAC and Talend workflow execution - but a properly secured system might want to restrict this (maybe creating multiple db users).

Install AMC Webapp into TAC Tomcat Instance

cd /opt/talend
unzip /usr/share/talend/Talend-AMC_Web-*
mv Talend-AMC_Web-*/amc.war admin/tomcat/webapps/
chown talendadmin:talendadmin admin/tomcat/webapps/amc.war
rmdir Talend-AMC_Web-*

Install MySQL driver

The AMC app will obviously need a suitable JDBC driver to communicate with the AMC database. However because amc.war is an eclipse-based application rather than a normal webapp (see overview), the Tomcat standard lib directory is ignored; the mysql driver file must instead be installed as follows:

cp /usr/share/java/mysql-connector-java-*.jar admin/tomcat/webapps/amc/WEB-INF/plugins/org.talend.amc.libraries_*/lib/ext/

Install Log Server

Overview

The Talend LogServer consists of:

  • elasticsearch - a distributed database optimised for full-text search and real-time analytics
  • kibana - for building realtime graphical dashboards of data in elasticsearch
  • logstash - for parsing logfiles (splitting records into fields) and uploading
  • filebeat - for uploading logfiles into elasticsearch

The above components are all open-source but primarily developed by elastic.co and together are known as the ELK stack (Elasticsearch + Logstash + Kibana).

Together, these provide a web UI through which administrators and users can view/search-for all information written to monitored logfiles.

Note: this is separate from talend-amc, the “activity monitoring console”, which is based on writing to a relational database.

File “/usr/share/talend/Talend-LogServer-*.zip” contains all four components. Theoretically:

  • elasticsearch should be installed on N hosts for scalability (data is sharded across all instances) and high-availability; N is usually in the range 1..3.
  • kibana should be installed on N hosts for high-availability; N is usually 1 or 2.
  • logstash should be installed on N hosts for high-availability; N is usually 1 or 2
  • filebeat needs to be installed on each host where logfiles exist that should be monitored, sending files to logstash

In practice, talend-logserver will usually be installed on the same hosts as the TAC component (ie 1 host for non-ha environment, and 2 hosts for ha-environment). Filebeat is not usually needed on other hosts, as:

  • the TAC launches jobs via the jobserver, and the jobserver streams data back and stores it on the filesystem of the controlling TAC instance; and
  • jobserver instances are typically configured to store their logs on a shared-filesystem which is mounted on all hosts; the filebeat instances on just 1 or 2 instances can therefore upload the files from the shared-fs.

TODO: when filebeat is on multiple hosts, and watching files on a shared filesystem, won’t filebeat upload the same data multiple times?

Actually, talend-logserver is often not considered critical, ie is often not configured as high-availability even when the TAC is.

Installing Files

# initial setup
mkdir -p /opt/talend
cd /opt/talend

# unpack files
tar zxf /usr/share/talend/Talend-LogServer-*-linux-*.tar.gz
cd logserv

# create user
useradd --system --home /opt/talend/logserv --shell /usr/sbin/nologin talendlogserv
chown -R talendlogserv:talendlogserv /opt/talend/logserv

No configuration is usually necessary.

Warning: File /usr/share/talend/Talend-Logserver-*.zip is actually the windows-specific version! Make sure you unpack the one with “linux” in the name.

Configure systemd-init to start ELK core

cat > /etc/systemd/system/talend-log.service <<EOF
#
# This service is a little odd - file "start_logserver.sh" actually starts
# three processes: elasticsearch, logstash, and kibana. All three create
# pid-files in the working-directory. Because systemd-init is really
# intended only for monitoring a single primary process, and there is
# no primary process here, restart-on-failure is not really applicable.
#

[Unit]
Description=Talend LogServ
After=network.target

[Service]
User=talendlogserv
Group=talendlogserv

Type=forking
Environment=JAVA_HOME=/opt/talend/jre-curr
WorkingDirectory=/opt/talend/logserv
ExecStart=/opt/talend/logserv/start_logserver.sh
ExecStop=/opt/talend/logserv/stop_logserver.sh
Restart=no

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now talend-log
systemctl status talend-log
journalctl --unit=talend-log

Configure Filebeat

This step is actually needed on each server on which files to monitor are present, even when the ELK core is not present. However the following instructions assume the current host already has the rest of the Talend logging components installed - ie you will need to adapt these instructions when installing Filebeat on other servers..

cd /opt/talend
ln -s logserv/filebeat-* filebeat-curr
cd filebeat-curr
vi filebeat.yml # see following instructions

File filebeat.yml should contain one section under “filebeat.prospectors” for each set of files to be monitored. Here is the config for monitoring the Tomcat “admin” logs:

filebeat.prospectors:
- type: log
  enabled: true
  paths:
    - ${LOG_PATH:/opt/talend/admin/tomcat/logs/*}
  fields:
    app_id: ${APP_NAME:TAC}
  fields_under_root: true

And of course, filebeat needs to be able to read the specified files. Filebeat will be run as user “talendlogserv” below, so grant that user access to the logs:

usermod --append --groups talendadmin talendlogserv 

Configure systemd-init to start filebeat

cat > /etc/systemd/system/talend-filebeat.service <<EOF
[Unit]
Description=Talend Filebeat (part of logserv)
After=network.target

[Service]
User=talendlogserv
Group=talendlogserv

Type=simple
Environment=JAVA_HOME=/opt/talend/jre-curr
WorkingDirectory=/opt/talend/filebeat-curr
ExecStart=/opt/talend/filebeat-curr/filebeat -e -c filebeat.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl enable --now talend-filebeat
systemctl status talend-filebeat

Install Commandline Tool

Overview

This component is a server that:

  • listens for requests on a network port
  • connects to version-control (git or svn) and checks out the repo
  • compiles a specific Talend workflow (ie transforms xml into a jarfile)
  • and uploads the resulting jarfile into Nexus

The TAC then downloads artifacts from Nexus and executes them on an “execution server” via the jobserver/runtime/esb components.

It is possible for developers to generate jarfiles themselves on their development PCs, and run them on jobserver etc. for testing purposes. However uploading into Nexus is not automatic. In a production-quality environment, it is usual for all Nexus artifacts to be generated via a commandline instance running on a controlled server, rather than allowing developers to push arbitrary code to Nexus.

While commandline instances can be deployed anywhere, it is typical to run one on each server where the TAC is running - there should be more than one for high-availability, but more than two is not needed as the rate at which new compilations are done is not likely to be high.

Why is it called “commandline”? Afaik, because the “network protocol” with which clients communicate with this component is effectively an interactive commandline, rather than REST or other application protocol.

Implementation Notes

The compilation feature is actually implemented as an Eclipse plugin. That obviously makes it available for developers running the IDE. Running it as a server is done by starting a “headless” Talend studio (Eclipse) instance on the server.

Does it make sense to implement a system service on Eclipse? Well, it can - the core of Eclipse is a very small OSGi container, with the rest being OSGi modules (plugins). If the commandline tool was just the OSGi core plus the few required plugins, that would be a reasonable implementation. Sadly, the provided install is the complete studio, including graphical components - rather nonsensical.

Ports

The server opens only a single port (8002 by default).

Installation

cd /opt/talend
unzip /usr/share/Talend-Studio-*
ln -s Talend-Studio-* cmdline
# also copy your Talend license file into /opt/talend/cmdline

useradd --system --home /opt/talend/cmdline --shell /usr/sbin/nologin talendcmdline
chown -R talendcmdline:talendcmdline /opt/talend/cmdline # necessary to allow user talendcmdline to write files...

cat >/etc/systemd/system/talend-cmdline.service <<EOF
[Unit]
Description=Talend Commandline
After=network.target
[Service]
User=talendcmdline
Group=talendcmdline
Type=simple
Environment=JAVA_HOME=/opt/talend/jre-curr
WorkingDirectory=/opt/talend/cmdline
ExecStart=/bin/sh /opt/talend/cmdline/commandline-linux.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now talend-cmdline
systemctl status talend-cmdline

Note that after startup, the application logs a bunch of errors. This appears to be harmless, and is typical for Talend.

To test, “telnet localhost 8002” and at the prompt, type “help”.

Configure Storage Directories

Create a storage dir (on the commandline host) that will be needed later when configuring the TAC via its UI:

mkdir -p /var/talend/commandline/exports
chown -R talendcmdline:talendcmdline /var/talend/commandline

The directory must be identical on each host running the commandline server.

AFAICT, the commandline tool itself does not need to be configured with the location of these dirs; that info is presumably passed as part of the network protocol.

Note: when setting up an HA system, it might be necessary to put this dir on a shared fs - not clear at the moment.

During TAC graphical configuration, entry “Settings/Configuration/CommandLine” needs to be updated to point to this dir.

Install Jobserver

Overview

The jobserver is a simple agent that a client (usually the TAC or TalendStudio) can send Java jarfiles to; the jobserver then executes that jarfile in a separate JVM process. The jobserver also reports various statistics back to its client while the job is running.

The installation process should be repeated on each host on which you wish to run Talend jobs - which might or might not include the server on which the TAC(s) run.

NOTE: the “talend runtime” component provides multiple features including an alternate implementation of the jobserver - ie either jobserver or “talend runtime” should be installed, but not both. When only jobserver functionality is needed, then the jobserver component should be used as it is simpler and less resource-intensive. When ESB-related functionality is also needed, use the runtime instead. It is theoretically possible to install both talend-jobserver and talend-runtime, and this might provide slight security benefits (launching jobs can be done via the simpler-and-thus-more-secure jobserver binary) but this could be rather confusing (multiple servers on same host), and might interfere with resource management (estimating free capacity on a host).

Install and Configure Application

cd /opt/talend
unzip /usr/share/talend/Talend-JobServer-*.zip
mv Talend-JobServer-* jobserver

mkdir -p /var/talend/jobserver # for data storage

vi jobserver/conf/TalendJobServer.properties
# uncomment line "*.TAC_URLS=..." and modify to point to the actual TAC servers that have been installed
# set ROOT_PATH to point to /var/talend/jobserver

useradd --system --home /opt/talend/jobserver --shell /usr/sbin/nologin talendjobs
chown -R talendjobs:talendjobs /opt/talend/jobserver
chown -R talendjobs:talendjobs /var/talend/jobserver

Note that the default config-settings settings use ports 8000, 8001 and 8888 - this should not need to be changed.

It is possible to configure the jobserver to launch jobs as another user - but that requires setting up suid scripts, allow sudo, etc. And in particular, the configuration process in the Talend installation guide states that enabling this requires the user-account that runs the jobserver itself to have sudo-rights for command “/bin/sh” - ie that user-account is effectively root on the machine. IMO, it is more secure to not enable this “user impersonation” and run all jobs as the jobserver user.

Configure systemd-init to start Jobserver

cat > /etc/systemd/system/talend-jobserver.service <<EOF
[Unit]
Description=Talend Jobserver
After=syslog.target network.target

[Service]
User=talendjobs
Group=talendjobs

Type=simple
WorkingDirectory=/opt/talend/jobserver 
ExecStart=/opt/talend/jobserver/start_rs.sh
# ExecStop not needed when type=simple
Restart=on-failure
RestartPreventExitStatus=255

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now talend-jobserver
systemctl status talend-jobserver
journalctl --unit=talend-jobserver

After the jobserver installations have been started, they need to be registered with the TAC via the TAC web ui - see the Talend installation guide.

Install Runtime

Overview

The component named “runtime” is rather too generically named; it is actually a component that provides a single Java process running an Apache Karaf OSGi environment, with a collection of plugins. The “camel” plugin provides generic message-bus functionality (data transformation and routing), and “endpoints” for integrating with other software - ie an “enterprise service bus” aka ESB. Talend studio users can develop transformations, routing rules, etc. and then upload them into the Nexus artifact repository, from where a “runtime” will download and install them.

The runtime component includes a plugin that supports executing “jobserver” jobs, ie when runtime is installed, jobserver should not be. If ESB functionality is not needed, just install the (simpler) jobserver component.

Install Component Files

cd /opt/talend
unzip /usr/share/talend/Talend-Runtime-*.zip
mv Talend-Runtime-* runtime

useradd --system --home /opt/talend/runtime --shell /usr/sbin/nologin talendrt
chown -R talendrt:talendrt runtime

Note that in version 7.0.1, file runtime/bin/setmem line 68 is not compatible with openjdk - the code parses the output of “java -version”, but expects only oracle-java, not openjdk. This can be fixed by replacing line 68 with:

# if [ ${JAVA_VERSION_MINOR} -lt 8 ]; then
if [ 0 ]; then

Ports

By default, talend-runtime opens the following ports:

  • karaf rmi access: 1009 (rmi registry port) and 44444 (rmi server port)
  • karaf ssh access: 8101
  • jobserver support: port 8000 (command), 8001 (filetransfer), 8555 (message-port for status info) and 8888 (monitoring)
  • esb support: port 8040 (http) and 9001 (https)

In addition:

  • looks like some Talend ESB components expect zookeeper to be on port 2181 on localhost
  • looks like some Talend ESB components expect a WSDL service to be on port 8042, but none is started
  • looks like some Talend ESB components expect an OIDC auth service to be on port 9080, but none is started

If port or ssl changes are necessary, see files in directory “runtime/etc/”.

Other Config

The config settings mentioned for “jobserver” above also apply to talend-runtime; see file “runtime/etc/org.talend.remote.jobserver.server.cfg”.

Configure systemd-init to start Jobserver

cat >/etc/systemd/system/talend-runtime.service <<EOF
[Unit]
Description=Talend Runtime
After=syslog.target network.target

[Service]
User=talendrt
Group=talendrt

Type=forking
WorkingDirectory=/opt/talend/runtime
PIDFile=/opt/talend/runtime/karaf.pid
ExecStart=/opt/talend/runtime/bin/start
ExecStop=/opt/talend/runtime/bin/stop 
Restart=on-failure
RestartPreventExitStatus=255

[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now talend-runtime
systemctl status talend-runtime
journalctl --unit=talend-runtime

Note that the Talend install guide recommends executing command “runtime/bin/trun” rather than “runtime/bin/start”. However doing that from systemd-init results in a NullPointerException in the talend-runtime; afaict the cause is that “trun” starts an interactive console reading from stdin, but systemd-init provides /dev/null as the stdin for processes it launches. Running “runtime/bin/trun </dev/null” reproduces the NullPointerException problem. Using script “start” appears to work from systemd-init.

Re Zookeeper

Zookeeper has 2 different use-cases within Talend Platform:

  • As the Infrastructure Resource Manager of Apache Kafka, in order to allow for internal communication with Talend Data Preparation, Talend Data Stewardship and Talend Dictionary Service.
  • As the Service Discovery and Failover functionality of Talend ESB called Talend Service Locator and running inside Talend Runtime.

It is not needed if you do not intend to run DataPrep/DataStewardship/DictionaryService, and do not intend to use a service-locator with ESB-based services.

The ESB-container and Runtime-container components include a Zookeeper server. However this is for simple setup only; for high-availability, Zookeeper should be run standalone on either 3 or 5 nodes (depending on your uptime requirements), and that is best done via a standalone Zookeeper cluster. See the following references:

Re Kafka

As noted under Zookeeper, Kafka is needed by DataPrep/DataStewardship/DictionaryService. If you need one of these, then also install a Kafka cluster (which can be one node for testing).

Re MongoDB

Some components require a MongoDB instance. If you are running these components, then also install a MongoDB cluster (which can be one node for testing).