spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Documenting the various DataFrame/SQL join types
Date Tue, 08 May 2018 13:13:35 GMT
The documentation for DataFrame.join()
<https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join>
lists all the join types we support:

   - inner
   - cross
   - outer
   - full
   - full_outer
   - left
   - left_outer
   - right
   - right_outer
   - left_semi
   - left_anti

Some of these join types are also listed on the SQL Programming Guide
<http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#supported-hive-features>
.

Is it obvious to everyone what all these different join types are? For
example, I had never heard of a LEFT ANTI join until stumbling on it in the
PySpark docs. It’s quite handy! But I had to experiment with it a bit just
to understand what it does.

I think it would be a good service to our users if we either documented
these join types ourselves clearly, or provided a link to an external
resource that documented them sufficiently. I’m happy to file a JIRA about
this and do the work itself. It would be great if the documentation could
be expressed as a series of simple doc tests, but brief prose describing
how each join works would still be valuable.

Does this seem worthwhile to folks here? And does anyone want to offer
guidance on how best to provide this kind of documentation so that it’s
easy to find by users, regardless of the language they’re using?

Nick
​

Mime
View raw message