spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jakeheller <>
Subject Good Spark consultants?
Date Mon, 08 Jun 2015 06:06:35 GMT
I was wondering if there were any consultants in high standing in the
community. We are considering using Spark, and we'd love to have someone
with a lot of experience help us get up to speed and implement a preexisting
data pipeline to use Spark (and perhaps first help answer the question of
whether we should be using Spark to begin with for our use case). 

Here's a bit more background fwiw:
We would like help setting up Spark to transition a data processing pipeline
to use this platform (perhaps after a bit of exploration of whether it is a
technology we should adopt to begin with).

We're doing a lot of processing of legal documents -- in particular, the
entire corpus of American law. It's about 10m documents, many of which are
quite large as far as text goes (100s of pages).

We'd like to (a) transform these documents from the various (often borked)
formats they come to us in into a standard XML format, (b) when it is in a
standard format, extract information from them (e.g., which judicial cases
cite each other?) and annotate the documents with the information extracted,
and then (c) deliver the end result to a repository (like s3) where it can
be accessed by the user-facing application.

Of course, we'd also like to do all of this quickly -- optimally, running
the entire database through the whole pipeline in a few hours.

We currently use a mix of Python and Java scripts (including XSLT, and
NLP/unstructured data tools like UIMA and Stanford's CoreNLP) in various
places along the pipeline we built for ourselves to handle these tasks. The
current pipeline infrastructure was built a while back -- it's basically a
number of HTTP servers that each have a single task and pass the document
along from server to server as it goes through the processing pipeline. It's
great although it's having trouble scaling, and there are some reliability
issues. It's also a headache to handle all the infrastructure. For what it's
worth, metadata about the documents resides in SQL, and the actual text of
the documents lives in s3.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message