Home - Apache Gobblin

콘텐츠

Gobblin Logo

Over the years, LinkedIn's data infrastructure team built custom solutions for ingesting diverse data entities into our Hadoop eco-system. At one point, we were running 15 types of ingestion pipelines which created significant data quality, metadata management, development, and operation challenges.

Our experiences and challenges motivated us to build Gobblin. Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. Gobblin handles the common routine tasks required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. Gobblin ingests data from different data sources in the same execution framework, and manages metadata of different sources all in one place. This, combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility, and the ability of handling data model evolution, makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

You can find a lot of useful resources in our wiki pages, including how to get started with Gobblin, an architecture overview of Gobblin, and the Gobblin user guide. We also provide a discussion group: Google Gobblin-Users Group. Please feel free to post any questions or comments.

For a detailed overview, please take a look at the VLDB 2015 paper and the LinkedIn's Gobblin blog post.

요약하다
LinkedIn's data infrastructure team faced challenges with multiple ingestion pipelines, leading to the development of Gobblin, a universal data ingestion framework. Gobblin simplifies ETL processes by handling tasks like scheduling, error handling, and data quality checking. It supports various data sources and manages metadata efficiently. Gobblin offers features like auto scalability, fault tolerance, and extensibility, making it user-friendly and efficient. Useful resources, including a wiki, architecture overview, user guide, and discussion group, are available for users. For more details, refer to the VLDB 2015 paper and LinkedIn's blog post on Gobblin.