Hot questions for Spring Cloud Data Flow

Top 10 Java Open Source / Spring / Spring Cloud Data Flow

Question:

I am migrating from Spring XD to Spring Cloud Data Flow. When I am looking for module list I realised that some of the sources are not listed in Spring Cloud Flow - One of them is KAFKA source.

My question is why KAFKA source is removed from standard sources list in spring cloud data flow ?


Answer:

When I am looking for module list I realised that some of the sources are not listed in Spring Cloud Flow

Majority of the applications are ported over and the remaining are incrementally prioritized - you can keep track of the remaining subset in the backlog.

My question is why KAFKA source is removed from standard sources list in spring cloud data flow ?

Kafka is not removed and in fact, we are highly opinionated about Kafka in the context of streaming use-cases so much so that it is baked into the DSL directly. More details here.

For instance,

(i) if you've to consume from a Kafka topic (as a source), your stream definition would be:

stream create --definition ":someAwesomeTopic > log" --name subscribe_to_broker --deploy

(ii) if you've to write to a Kafka topic (as a sink), your stream definition would be:

stream create --definition "http --server.port=9001 > :someAwesomeTopic" --name publish_to_broker --deploy

(where *someAwesomeTopic* is the named destination, a topic name)

Question:

We are exploring various programming/library options (on Java side of the world) for faster batch processing as well as be able to be deployed on cloud. We came across Spring batch/XD/cloud data flow. From the quick review of documentation on http://cloud.spring.io/spring-cloud-dataflow/, we could not assess whether Spring cloud data flow also has all the batch processing features that spring batch would offer. For example, here is what SPring batch documentation (http://projects.spring.io/spring-batch/) says: "Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management."

If someone has any idea about the batch processing capabilities in spring cloud data flow, could you please post here. Many thanks!


Answer:

Please review Spring Cloud Task project. This project offers the framework and programming model to develop "short-lived" microservice applications.

At a high-level, a Task may be any process that does not run indefinitely, including Spring Batch jobs. This gives you the flexibility to develop Spring Batch jobs, using all its core capabilities, and you can run them as standalone Spring Boot applications. There are some samples here.

Spring Cloud Data Flow builds upon Spring Cloud Task to provide orchestration capability for batch data pipelines. A wide range of options including Shell, DSL, Admin UI, and Flo UI are available to orchestrate batch workloads. You can use these utility Task applications in Spring Cloud Data Flow and this list is growing.

Question:

Team, Currently I am working on spring-xd and using as a runtime container for data analytics and yarn jobs.

My questions are

1) Can I leverage the same environment setup which I used for spring-xd? 2) From the documentation,I read that it can be deployed as micro services, is it using embedded drivers for stream processing? If it is using embedded drivers, can I use it to deploy in a clustered environment with the same infrastructure leveraged for spring-xd? 3) Is there any specific wrappers built for Apache Spark?

My Environment: Spark 1.6.1, Hadoop 2.7.2, zookeeper 3.6.8, redis 3.2, spring-xd-1.3.1

Any help on this specific queries would be highly appreciated.


Answer:

Can I leverage the same environment setup which I used for spring-xd?

Spring Cloud Data Flow (SCDF) relies on spring-cloud-deployer SPI and there are implementations for Cloud Foundry, Apache YARN, Apache Mesos and Kubernetes. Given that you're already having an Hadoop cluster with YARN in use, you could provision the YARN implementation of the SCDF-server.

is it using embedded drivers for stream processing?

Not clear what you mean by this. If you're referring to jdbc-drivers; specifically, when using jdbc as a sink application, we do embed OSS-friendly drivers, yes.

If it is using embedded drivers, can I use it to deploy in a clustered environment with the same infrastructure leveraged for spring-xd?

Perhaps answer to #1 covers this. You could leverage the same infrastructure and provision YARN SCDF-server using the Ambari plugin.

Is there any specific wrappers built for Apache Spark?

We have Spark as a client and cluster applications. You can register them in SCDF to build task/batch pipelines.