Hi Roni,We have a full suite of genomic feature parsers that can read BED, narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM Additionally, we have support for efficient overlap joins (query 3 in your email below). You can load the genomic features with ADAMContext.loadFeatures. We have two tools for the overlap computation: you can use a BroadcastRegionJoin if one of the datasets you want to overlap is small or a ShuffleRegionJoin if both datasets are large.Regards,On Jun 8, 2015, at 9:39 PM, roni <firstname.lastname@example.org> wrote:Sorry for the delay.The files (called .bed files) have format like -Chromosome start end feature score strandchr1 713776 714375 peak.1 599 + chr1 752401 753000 peak.2 599 +The mandatory fields are
- chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
- chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
- chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.There can be more data as described - https://genome.ucsc.edu/FAQ/FAQformat.html#format1Many times the use cases are like1. find the features between given start and end positions2.Find features which have overlapping start and end points with another feature.3. read external (reference) data which will have similar format (chr10 48514785 49604641 MAPK8 49514785 +) and find all the data points which are overlapping with the other .bed files.The data is huge. .bed files can range from .5 GB to 5 gb (or more)I was thinking of using cassandra, but not sue if the overlapping queries can be supported and will be fast enough.Thanks for the help-RoniOn Sat, Jun 6, 2015 at 7:03 AM, Ted Yu <email@example.com> wrote:Can you describe your use case in a bit more detail since not all people on this mailing list are familiar with gene sequencing alignments data ?ThanksOn Fri, Jun 5, 2015 at 11:42 PM, roni <firstname.lastname@example.org> wrote:I want to use spark for reading compressed .bed file for reading gene sequencing alignments data.I want to store bed file data in db and then use external gene expression data to find overlaps etc, which database is best for it ?Thanks-Roni