spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Austin Nothaft <fnoth...@berkeley.edu>
Subject Re: which database for gene alignment data ?
Date Mon, 08 Jun 2015 19:47:56 GMT
Hi Roni,

We have a full suite of genomic feature parsers that can read BED, narrowPeak, GATK interval
lists, and GTF/GFF into Spark RDDs in ADAM  Additionally, we have support for efficient overlap
joins (query 3 in your email below). You can load the genomic features with ADAMContext.loadFeatures.
We have two tools for the overlap computation: you can use a BroadcastRegionJoin if one of
the datasets you want to overlap is small or a ShuffleRegionJoin if both datasets are large.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Jun 8, 2015, at 9:39 PM, roni <roni.epi112@gmail.com> wrote:

> Sorry for the delay.
> The files (called .bed files) have format like - 
> Chromosome start  end    feature score  strand 
> chr1	 713776	 714375	 peak.1	 599	+
> chr1	 752401	 753000	 peak.2	 599	+
> The mandatory fields are 
> 
> chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
> chromStart - The starting position of the feature in the chromosome or scaffold. The
first base in a chromosome is numbered 0.
> chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd
base is not included in the display of the feature. For example, the first 100 bases of a
chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
> There can be more data as described - https://genome.ucsc.edu/FAQ/FAQformat.html#format1
> Many times the use cases are like 
> 1. find the features between given start and end positions
> 2.Find features which have overlapping start and end points with another feature.
> 3. read external (reference) data which will have similar format (chr10	48514785	49604641
MAPK8	49514785	+) and find all the data points which are overlapping with the other  .bed
files.
> 
> The data is huge. .bed files can range from .5 GB to 5 gb (or more)
> I was thinking of using cassandra, but not sue if the overlapping queries can be supported
and will be fast enough.
> 
> Thanks for the help
> -Roni
> 
> On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> Can you describe your use case in a bit more detail since not all people on this mailing
list are familiar with gene sequencing alignments data ?
> 
> Thanks
> 
> On Fri, Jun 5, 2015 at 11:42 PM, roni <roni.epi112@gmail.com> wrote:
> I want to use spark for reading compressed .bed file for reading gene sequencing alignments
data. 
> I want to store bed file data in db and then use external gene expression data to find
overlaps etc, which database is best for it ?
> Thanks
> -Roni
> 
> 
> 


Mime
View raw message