Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Remove duplicated repositories #200

Merged
merged 1 commit into from
Feb 19, 2019
Merged

Remove duplicated repositories #200

merged 1 commit into from
Feb 19, 2019

Conversation

smacker
Copy link
Contributor

@smacker smacker commented Feb 15, 2019

In case of bare or standard repositories engine can return more than 1
repository for the same directory.

The most common case is when you just "git clone" remote repository and
run gemini on it.

Engine would produce 2 different repositories:

As a result all files in the repository will have duplicates which is
incorrect from user point of view.

Signed-off-by: Maxim Sukharev max@smacker.ru

@se7entyse7en
Copy link

A couple of things, correct me if I'm wrong:

  • here in case of limit > 0 and format != "siva", you're querying again the repos dataframe. Given that you already computed the repos ids, you could just take the first limit elements and filter just once.
  • I always prefer immutable variables as IMO makes the code easier to follow, wdyt of something like this?
  def getRepos(reposPath: String, limit: Int = 0, format: String = "siva"): DataFrame = {
    val engine = Engine(session, reposPath, format)
    val repos = engine.getRepositories

    val reposIds = format match {
      case "siva" => if (limit <= 0) None else Some(
        repos.limit(limit).select($"id").collect().map(_ (0)))
      case _ => Some(getReposIds(repos, limit))
    }

    reposIds.fold(repos)(rIds => repos.filter($"id".isin(rIds: _*)))
  }

  def getReposIds(repos: DataFrame, limit: Int): Array[Any] = {
    val df = repos
      .getHEAD
      .groupByKey(_.getAs[String]("hash"))
      .reduceGroups {(r1, r2) =>
        // In case of multiple repositories have the same HEAD we keep only one
        // we prefer remote over local one
        // in case of multiple remotes we choose random one
        val v1 = r1.getAs[String]("repository_id")
        if (v1.isEmpty || v1.startsWith("file://")) {
          r2
        } else {
          r1
        }
      }
      .withColumnRenamed("ReduceAggregator(org.apache.spark.sql.Row)", "tmp")
      .select("tmp.repository_id")

    (if (limit <= 0) df else df.limit(limit))
      .collect()
      .map(_ (0))
  }

@smacker
Copy link
Contributor Author

smacker commented Feb 15, 2019

  1. Thanks! You are right. I just didn't touch old logic. But I can improve it as you suggested. Will do.
  2. I'm not scala expert but to be honest it's much harder for me to read your example compare to current code. But I got your idea, I'll think about it and try to come up with immutable and still simple code.

@se7entyse7en
Copy link

se7entyse7en commented Feb 15, 2019

I'm not scala expert

Me neither, it would be great to have an input from someone with more experience for this kind of more idiomatic stuff.

In case of bare or standard repositories engine can return more than 1
repository for the same directory.

The most common case is when you just "git clone" remote repository and
run gemini on it.

Engine would produce 2 different repositories:
- local one: file://path/to/directory
- remote one: https://github.com/...

As a result all files in the repository will have duplicates which is
incorrect from user point of view.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
@smacker
Copy link
Contributor Author

smacker commented Feb 19, 2019

@se7entyse7en I have changed the code to improve limit logic and made it immutable.

It's not as short as you suggest and maybe less idiomatic. But imo it's easy to understand which was my point. Taking into account that all gemini devs aren't hard-core scala developers, I value readiablity higher than idiomacity.

@smacker smacker merged commit b386c0e into src-d:master Feb 19, 2019
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants