Selected projects

    The projects below reflect my main research interest. But I enjoy working on other topics too. Check my profile on Google Scholar for my full publication record.

    Dataset Search [WIP, 2024]
    1) Survey results surfacing why, what, and how is searched for data, key open challenges, and system desiderata.
    2) System (tbc).
    1) paper survey

    GitSchemas [DBML@ICDE 2022, SIGMOD 2024]
    A dataset of approximately 50K real-world database schemas extracted from SQL files from GitHub.
    paper | code/dataset

    Observatory [PVLDB, NeurIPS, 2023]
    1) Framework for analyzing table embeddings based on the relational model, and desiderata for TRL models.
    2) Library for extracting table embeddings on row- column-, cell-level.
    1) analysis paper | 2) library paper | code

    GitTables [SIGMOD, 2023]
    Corpus of 1.7M relational tables extracted from GitHub CSVs. Columns annotated w/ semantic types.
    paper | website | dataset | code | video presentation | slides | podcast

    AdaTyper [CIDR, 2022]
    Adaptive semantic column type detection system focusing on productization in industry contexts.
    paper | video presentation

    Sherlock [KDD, 2019]
    DL method for semantic data type detection of table columns (top-5 MIT Media Lab repos, 2 Aug 23).
    paper | website | code

    VizNet [CHI, 2019]
    Corpus of over 31 million datasets from open data repositories, for benchmarking visualization studies.
    paper | website