As part of the Australian Research Data Commons-supported Research Platforms program, the ATAP project aims to provide accessible tools and training for researchers working with large volumes of unstructured text, supported by a community of practice.
Text analytics is the process of extracting data from large volumes of unstructured text to derive machine-readable information for research purposes. Unstructured text refers to text-based material that lacks descriptive data and cannot be readily organised or defined, including for example, documents, social media and audio transcripts. The analytics process workflow provides tools to clean and organise data by introducing structure such as dates and identifying entities of interest to transform the material into information that can be understood by computers. This process enables rich insights through advanced queries, data visualisations, and preparation for machine learning.
SWAN (Service for Web-based Analysis) will form a crucial component of ATAP. Based on Jupyter Notebooks, SWAN is a cloud-based workbench to write, run and share code for data analysis. AARNet will also support ATAP by delivering hands-on workshops and online training modules. SWAN was developed by CERN and compliments other projects supported by AARNet, including the Language Data Commons of Australia. AARNet’s CloudStor and SWAN assist in the publication of datasets and reproducibility of results that is important to research outputs.
The outcomes of ATAP will benefit a broad field of research where information from increasing volumes of text-based material is a valuable resource. Techniques to process and analyse unstructured text is applicable to humanities fields, engineering and the sciences. ATAP aims to transform and accelerate the data-driven research possibilities across disciplines and demonstrates AARNet’s continuing commitment to advance national research infrastructure and build data skills.