spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From roni <roni.epi...@gmail.com>
Subject Re: which database for gene alignment data ?
Date Tue, 09 Jun 2015 19:21:15 GMT
Hi Frank,
Thanks for the reply. I downloaded ADAM and built it but it does not seem
to list this function for command line options.
Are these exposed as public API and I can call it from code ?

Also , I need to save all my intermediate data.  Seems like ADAM stores
data in Parquet on HDFS.
I want to save something in an external database, so that  we can re-use
the saved data in multiple ways by multiple people.
Any suggestions on the DB selection or keeping data centralized for use by
multiple distinct groups?
Thanks
-Roni



On Mon, Jun 8, 2015 at 12:47 PM, Frank Austin Nothaft <fnothaft@berkeley.edu
> wrote:

> Hi Roni,
>
> We have a full suite of genomic feature parsers that can read BED,
> narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM
> <https://github.com/bigdatagenomics/adam>  Additionally, we have support
> for efficient overlap joins (query 3 in your email below). You can load the
> genomic features with ADAMContext.loadFeatures
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L438>.
> We have two tools for the overlap computation: you can use a
> BroadcastRegionJoin
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/BroadcastRegionJoin.scala>
if
> one of the datasets you want to overlap is small or a ShuffleRegionJoin
> <https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ShuffleRegionJoin.scala>
if
> both datasets are large.
>
> Regards,
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466
>
> On Jun 8, 2015, at 9:39 PM, roni <roni.epi112@gmail.com> wrote:
>
> Sorry for the delay.
> The files (called .bed files) have format like -
>
> Chromosome start  end    feature score  strand
>
> chr1	 713776	 714375	 peak.1	 599	+
> chr1	 752401	 753000	 peak.2	 599	+
>
> The mandatory fields are
>
>
>    1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g.
scaffold10671).
>    2. chromStart - The starting position of the feature in the chromosome or scaffold.
The first base in a chromosome is numbered 0.
>    3. chromEnd - The ending position of the feature in the chromosome or scaffold. The
*chromEnd* base is not included in the display of the feature. For example, the first 100
bases of a chromosome are defined as *chromStart=0, chromEnd=100*, and span the bases numbered
0-99.
>
> There can be more data as described - https://genome.ucsc.edu/FAQ/FAQformat.html#format1
> Many times the use cases are like
> 1. find the features between given start and end positions
> 2.Find features which have overlapping start and end points with another feature.
> 3. read external (reference) data which will have similar format (chr10	48514785	49604641
MAPK8	49514785	+) and find all the data points which are overlapping with the other  .bed
files.
>
> The data is huge. .bed files can range from .5 GB to 5 gb (or more)
> I was thinking of using cassandra, but not sue if the overlapping queries can be supported
and will be fast enough.
>
> Thanks for the help
> -Roni
>
>
> On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> Can you describe your use case in a bit more detail since not all people
>> on this mailing list are familiar with gene sequencing alignments data ?
>>
>> Thanks
>>
>> On Fri, Jun 5, 2015 at 11:42 PM, roni <roni.epi112@gmail.com> wrote:
>>
>>> I want to use spark for reading compressed .bed file for reading gene
>>> sequencing alignments data.
>>> I want to store bed file data in db and then use external gene
>>> expression data to find overlaps etc, which database is best for it ?
>>> Thanks
>>> -Roni
>>>
>>>
>>
>
>

Mime
View raw message