Skip to content
/ gemini Public

Advanced similarity and duplicate source code at scale.

License

Notifications You must be signed in to change notification settings

src-d/gemini

Repository files navigation

Gemini Build Status codecov

Find similar code in Git repositories

Gemini is a tool for searching for similar 'items' in source code repositories. The supported granularity levels for items are:

  • repositories (TBD)
  • files
  • functions

Gemini is based on its sister research project codenamed Apollo.

Run

./hash   <path-to-repos-or-siva-files>
./query  <path-to-file>
./report

You would need to prefix commands with docker-compose exec gemini if you run it in docker. Read below how to start gemini in docker or standalone mode.

Hash

To pre-process number of repositories for a quick finding of the duplicates run

./hash ./src/test/resources/siva

Input format of the repositories is the same as in src-d/Engine.

To pre-process repositories for search of similar functions run:

./hash -m func ./src/test/resources/siva

Besides local file system gemini support different distributed storages.

Query

To find all duplicate of the single file run

./query <path-to-single-file>

To find all similar function defined in a file run:

./query -m func <path-to-single-file>

If you are interested in similarities of only 1 function defined in the file you can run:

./query -m func <path-to-single-file>:<function name>:<line number where the function is defined>

Report

To find all duplicate files and similar functions in all repositories run

./report

All repositories must be hashed before and a community detection library installed.

Requirements

Docker

Start containers:

docker-compose up -d

Local directories repositories and query are available as /repositories and /query inside the container.

Examples:

docker-compose exec gemini ./hash /repositories
docker-compose exec gemini ./query /query/consumer.go
docker-compose exec gemini ./report

Standalone

You would need:

  • JVM 1.8
  • Apache Cassandra or ScyllaDB
  • Apache Spark 2.2.x
  • Python 3
  • Bblfshd v2.5.0

By default, all commands are going to use

  • Apache Cassandra or ScyllaDB instance available at localhost:9042
  • Apache Spark, available though $SPARK_HOME
# save some repos in .siva files using Borges
echo -e "https://github.com/src-d/borges.git\nhttps://github.com/erizocosmico/borges.git" > repo-list.txt

# get Borges from https://github.com/src-d/borges/releases
borges pack --loglevel=debug --workers=2 --to=./repos -f repo-list.txt

# start Apache Cassandra
docker run -p 9042:9042 \
  --name cassandra -d rinscy/cassandra:3.11

# or ScyllaDB \w workaround https://github.com/gocql/gocql/issues/987
docker run -p 9042:9042 --volume $(pwd)/scylla:/var/lib/scylla \
  --name some-scylla -d scylladb/scylla:2.0.0 \
  --broadcast-address 127.0.0.1 --listen-address 0.0.0.0 --broadcast-rpc-address 127.0.0.1 \
  --memory 2G --smp 1

# to get access to DB for development
docker exec -it some-scylla cqlsh

Configuration for Apache Spark

Use env variables to set memory for hash job:

export DRIVER_MEMORY=30g
export EXECUTOR_MEMORY=60g

To use a external cluster just set the URL to the Spark Master though an env var:

MASTER="spark://<spark-master-url>" ./hash <path>

CLI arguments

All three commands accept parameters for database connection and logging:

  • -h/--host - cassandra/scylla db hostname, default 127.0.0.1
  • -p/--port - cassandra/scylla db port, default 9042
  • -k/--keyspace - cassandra/scylla db keyspace, default hashes
  • -v/--verbose - producing more verbose output, default false

For query and hash commands parameters for bblfsh/features extractor configuration are available:

  • -m/--mode - similarity modes: file or function, default file
  • --bblfsh-host - babelfish server host, default 127.0.0.1
  • --bblfsh-port - babelfish server port, default 9432
  • --features-extractor-host - features-extractor host, default 127.0.0.1
  • --features-extractor-port - features-extractor port, default 9001

Hash command specific arguments:

  • -l/--limit - limit the number of repositories to be processed. All repositories will be processed by default
  • -f/--format - format of the stored repositories. Supported input data formats that repositories could be stored in are siva, bare or standard, default siva
  • --gcs-keyfile - path to JSON keyfile for authentication in Google Cloud Storage

Report specific arguments:

  • --output-format - output format: text or json
  • --cassandra - Enable advanced cql queries for Apache Cassandra database

Limitations

Currently gemini targets medium size repositories and datasets.

We set resonable defaults and pre-filtering rules to provide the best results for this case. List of rules:

  • Exclude binary files
  • Exclude empty files from full duplication results
  • Exclude files less than 500B from file-similarity results
  • Similarity deduplication works only for languages supported by babelfish and syntactically correct files

Performance tips

We recommend to run Spark with 10GB memory for each executer and for the driver. Gemini wouldn't benifit from more than 1 CPU per task.

Horizontal scaling doesn't work well for the first stage of the pipeline and depends on size of the biggest repositories in a dataset but the rest of pipeline scales well.

Distributed storages

Gemini supports different distributed storages in local and cluster mode. It already includes all necessary jars as a part of fat jar.

HDFS

Path format to git repositories: hdfs://hdfs-namenode/path

To configure HDFS in local or cluster mode please consult Hadoop documentation.

Google Cloud Storage

Path format to git repositories: gs://bucket/path

To connect to GCS locally use --gcs-keyfile flag with path to JSON keyfile.

To use GCS in cluster mode please consult Google Cloud Storage Connector documentation.

Amazon Web Services S3

Path format to git repositories: s3a://bucket/path

To connect to S3 locally use following flags:

  • --aws-key - AWS access keys
  • --aws-secret - AWS access secret
  • --aws-s3-endpoint - region endpoint of your S3 bucket

Due to some limitations passing key&secret as part of URI is not supported.

To use AWS S3 in cluster mode please consult hadoop-aws documentation

Known bugs

  • Search for similarities in C# code isn't supported right now (patch with workaround)
  • Timeout for UAST extraction is relatevely low on real dataset according to our experience and it isn't configurable (patch1 and path2 with workaround)
  • For standard & bare format gemini prints wrong repositories listing (issue)

Development

Compile & Run

If env var DEV is set, ./sbt is used to compile and run all non-Spark commands: ./hash and ./report. This is a convenient for local development, as not requiring a separate "compile" step allows for a dev workflow that is similar to experience with interpreted languages.

Build

To build final .jars for all commands

./sbt assemblyPackageDependency
./sbt assembly

Instead of 1 fatJar we bulid 2, separating all the dependencies from actual application code to allow for lower build times in case of simple changes.

Test

To run tests, that rely

./sbt test

Re-generate code

Latest generated code for gRPC is already checked in under src/main/scala/tech/sourced/featurext. In case you update any of the src/main/proto/*.proto, you would need to generate gRPC code for Feature Extractors:

./src/main/resources/generate_from_proto.sh

To generate new protobuf messages fixtures for tests, you may use bblfsh-sdk-tools:

bblfsh-sdk-tools fixtures -p .proto -l <LANG> <path-to-source-code-file>

License

Copyright (C) 2018 source{d}. This project is licensed under the GNU General Public License v3.0.