Goby Benchmark

GOBY: An Enterprise Benchmark for Data Integration

¹MIT, ²Intel Labs, ³AWS AI Labs, ⁴Technical University of Munich, ⁵University of Washington

Overview

What is GOBY?
GOBY is a benchmark dataset designed for evaluating data integration techniques specifically for enterprise data. It was derived from a real-world production workload in the event promotion and marketing domain, compiled around 2017. Unlike public benchmarks, GOBY focuses on private datasets, making it more representative of enterprise challenges.

Where was it collected?
The data was collected over several years using over 1,000 wrappers developed by professionals. These wrappers converted web pages and APIs into relational tables, creating a rich dataset.

What does it represent?
GOBY represents over 4 million rows of data corresponding to events. It includes detailed semantic labels such as event locations, organizers, and other metadata relevant to the domain. It highlights the structural and semantic complexity often found in enterprise datasets.

What does it contain?
- Source Tables: Nearly 1,200 source tables, each generated from a wrapper.
- Semantic Types: A hierarchy of semantic types developed by domain experts.
- Universal Schema: A unified schema combining all source tables.
- Statistics:
   - 4.04 million rows
   - 23,203 columns
   - Average 3,405 rows per table
   - Average 20 columns per table

GOBY is semantically richer and structurally more complex than typical public benchmarks, such as VizNet or T2Dv2, making it well-suited for enterprise data integration tasks.

Filestructure of Goby

The primary data archive, goby.tar.gz, contains the following key directories:

dump/: PostgreSQL dump files that include:
- doit_categories: Data categories with record counts.
- doit_data: Triple-based data representing (category_id, source_id, entity_id, name, value).
- Additional mapping and result files.

Download Instructions

To access the GOBY dataset:

Download the goby.zip file with the button below using the password:
```
GOBY2025
```

Download ZIP

Extract the zip-file via the file explorer or in the terminal using a command like unzip -P your_password goby.zip -d /path/to/extract/.

BibTeX

If you use this dataset, cite our companion CIDR 2025 paper:


@inproceedings{cidr-goby,
  author       = {Moe Kayali and Fabian Wenz and Nesime Tatbul and Cagatay Demiralp},
  title        = {Mind the Data Gap: Bridging Large Language Models (LLMs) to Enterprise Data Integration},
  booktitle    = {15th Conference on Innovative Data Systems Research, {CIDR} 2025,
  Amsterdam, The Netherlands  January 19-22, 2025},
  publisher    = {www.cidrdb.org},
  year         = {2025}
}

GOBY: An Enterprise Benchmark for Data Integration

Overview

Filestructure of Goby

Download Instructions

BibTeX

Contact